Institute for Foundation Models

Machine Learning Infrastructure Engineer

Sunnyvale, CA
Python DeepSpeed FSDP FairScale Horovod PyTorch JAX Slurm Kubernetes Ray NCCL GLOO CUDA Triton
Description

Machine Learning Infrastructure Engineer

Team: Engineering

Location: Sunnyvale, CA

Commitment: Full-time

Workplace Type: onsite

Salary:


• Comprehensive medical, dental, and vision 
• 401(k) program 
• Generous PTO, sick leave, and holidays
• Paid parental leave and family-friendly benefits
• On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station

About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role 

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to: 
• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) 
• Implement distributed optimizers from mathematical specs 
• Build robust config + launch systems across multi-node, multi-GPU clusters 
• Own experiment tracking, metrics logging, and job monitoring for external visibility 
• Improve training system reliability, maintainability, and performance 
• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most. 

Key Responsibilities 

• Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures. 
• Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations. 
• Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets. 
• Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers. 
• Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale. 

Qualifications
Must-Haves: 
• 5+ years of experience in ML systems, infra, or distributed training 
• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) 
• Strong software engineering fundamentals (Python, systems design, testing) 
• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO) 
• Ability to implement algorithms across GPUs/nodes based on mathematical specs 
• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team 
• Experience with large-scale machine learning workloads (strong ML fundamentals) 

Nice-to-Haves: 
• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation 
• Familiarity with performance profiling, kernel fusion, or memory optimization 
• Open-source contributions or published research (MLSys, ICML, NeurIPS) 
• CUDA or Triton kernel experience 
• Experience with large-scale pre-training  
• Experience building custom training pipelines at scale and modifying them for custom needs 
• Deep familiarity with training infrastructure and performance tuning 
Institute for Foundation Models
Institute for Foundation Models

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say