Responsibilities
- Collaborate with researchers and engineers to specify the availability, performance, correctness, and efficiency requirements of the current and future versions of our GPU infrastructure.
- Work with multiple GPU cloud providers to scale up, scale down, maintain and monitor our 000's GPUs in many clusters.
- Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands.
- Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment.
- Implement fault-tolerant and resilient design patterns to minimize service disruptions.
- Build and maintain automation tools to streamline repetitive tasks and improve system reliability.
- Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability alongside other infrastructure developers.
- Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability.
Experience
- Proven work experience 10+ yrs as an reliability engineer, production engineer, infrastructure software engineer or a similar role in a fast-paced, rapidly scaling company.
- Strong proficiency in GPU cloud infrastructure, including the underlying concepts of scheduling, scaling, cloud storage, networking and security.
- Proficiency in programming/scripting languages.
- Experience with containerization technologies and container orchestration platforms like Kubernetes or equivalent.
- Knowledge of IaC tools such as Terraform or CloudFormation or equivalent.
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills.
- Experience with observability tools; examples include DataDog, Prometheus, Grafana, Splunk and ELK stack or similar.
- Knowledge of security best practices in cloud environments.
- Good to have experience as an SRE within the AI/ML space is strongly preferred.
Compensation
- The pay range for this position in California is $200,000 - $250,000yr; however, base pay offered may vary depending on job-related knowledge, skills, candidate location, and experience. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan.
0 applies
14 views
Other Jobs from Luma AI
Senior HPC engineer, Research infrastructure
Senior Data Scientist
Senior iOS Engineer
AI Agent Engineer
Staff Software Engineer- Frontend
Similar Jobs
Site Reliability Engineer
Principal Software Engineer (Cloud Infrastructure)
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
π₯³π₯³π₯³ 401 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineersβ¦ in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. π οΈ
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. π
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. π―
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. π
What Fellow Engineers Say