Meta

Software Engineer, Systems ML - Scaling / Performance

Menlo Park, CA
USD 251k - 251k
PyTorch C++ Python TensorFlow Deep Learning
Search for More Jobs Talk to a recruiter now 💪
Description
In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns. At the high level, the team works on GenAI Llama large-scale training enablement, reliability and performance. Currently, team is heavily working on NCCL feature development, to enable reliable/high-performant distributed Llama training. The team's work includes the development of customized NCCL and distributed training features, automating production processes, benchmarking software, optimizing performance, and enhancing software stacks around NCCL and PyTorch. We are seeking a techical-lead to lead the space of GenerativeAI/LLM scaling reliability and performance.
Software Engineer, Systems ML - Scaling / Performance Responsibilities
  • Tech Lead for the overall distributed ML enablement and performance on Meta's large-scale GPU training infra with a focus on GenerativeAI/LLM scaling
Minimum Qualifications
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
  • Proven C/C++ and Python programming skills
  • Proven track record of leading successful projects
  • Effective leadership and communication skills
Preferred Qualifications
  • PhD in Computer Science, Computer Engineering, or relevant technical field
  • Experience with collective communication library development (MPI/NCCL etc)
  • Experience with NCCL and distributed GPU performance analysis on RoCE/Infiniband
  • Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
  • Experience in HPC and parallel computing
  • Knowledge of GPU architectures and CUDA programming
  • Knowledge of ML, deep learning and LLM
For those who live in or expect to work from California if hired for this position, please click here for additional information.
Locations
About Meta
Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics.
Meta is committed to providing reasonable support (called accommodations) in our recruiting processes for candidates with disabilities, long term conditions, mental health conditions or sincerely held religious beliefs, or who are neurodivergent or require pregnancy-related support. If you need support, please reach out to accommodations-ext@fb.com.
$85.10/hour to $251,000/year + bonus + equity + benefits

Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about benefits at Meta.
Meta
Meta
Augmented Reality Metaverse Social Media Social Network Virtual Reality

0 applies

3 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 401 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say