Institute for Foundation Models

Senior Distributed Systems Engineer

Sunnyvale, CA
C++ Rust Go PyTorch NCCL RDMA InfiniBand RoCE GPUDirect RDMA UCX
Description

Senior Distributed Systems Engineer

Team: Engineering

Location: Sunnyvale, CA

Workplace Type: onsite

About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
·       Design and optimize expert-parallel and hybrid-parallel communication patterns
·       Drive high-performance hierarchical collectives for MoE workloads
·       Co-design runtime orchestration with communication topology awareness
·       Reduce tail latency and improve determinism across thousands of GPUs
·       Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
·       Communication-compute overlap and topology-aware collective optimization
·       Deep debugging of NCCL, RDMA, and custom communication layers
·       Hybrid expert parallel strategies in modern large-scale MoE systems
·       Elastic and resilient distributed job orchestration concepts
·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
·       Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
·       Hybrid expert parallel communication for Mixture-of-Experts training
·       Scaling behavior under network pressure
·       Distributed orchestration for elastic, large-scale training
·       Fault detection and recovery in distributed GPU workloads
·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
·       Deep familiarity with NCCL and/or UCX internals
·       Strong systems programming ability (C/C++, Rust, or Go)
·       Strong familiarity with modern model training frameworks such as PyTorch
·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks
·       Ability to translate research ideas into production-grade optimizations
·       Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
·       You can explain why an communication degrades at scale and how to fix it
·       You have improved real cluster throughput via communication redesign
·       You can trace a distributed hang across ranks and identify the root cause
·       You are comfortable working at the boundary between hardware and runtime
Application Requirements
·       Include a link to your GitHub (required)
·       Provide links to relevant distributed systems, HPC, or large-scale training projects
·       Include a list of publications and/or public technical reports (if applicable)
·       Describe the hardest distributed debugging problem you solved
·       Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.
Visa Sponsorship
This position is eligible for visa sponsorship.

Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
Institute for Foundation Models
Institute for Foundation Models

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say