Responsibilities
- In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself.
- We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve.
- You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.
- Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.
- We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.
Experience
- 8+ years experience as infrastructure engineer or Devops in large and complex distributed systems.
- Deep understanding of networking, bonus points for experience in HPC networking.
- Experience developing high-quality software in a general-purpose programming language, preferably including Python.
- Excellent problem-solving skills and attention to detail.
- Experience with GPUs in large scale clusters is strongly preferred.
- Strong knowledge of observability and monitoring in distributed systems.
- Tenacious at troubleshooting hardware and network topology failures in distributed systemsIndependently driven and able to own problems and build solutions from end-to-end.
- Experience with large scale data center operations, proficiency in cloud orchestration and system tools.
- Please note this role is not meant for recent grads.
Compensation
- In addition to cash base pay, you'll also receive a sizable grant of Luma's equity.
- The pay range for this position is $180000- 220000/yr for Bay Area. Base pay offered will vary depending on job-related knowledge, skills, candidate location, and experience.
0 applies
23 views
Other Jobs from Luma AI
Senior Data Scientist
Senior iOS Engineer
AI Agent Engineer
Staff Software Engineer- Frontend
Staff Software Engineer- Reliability
Similar Jobs
Summer 2025 Software Engineering Internship
Technology Risk Oversight Manager - Digital and Systems Risk Oversight
Principal Data Architect
Principal Data Architect
Principal Data Architect
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 401 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say