You're A Great Fit If
- You have extensive experience building distributed training infrastructure for language and multimodal models, with hands-on expertise in frameworks like PyTorch Distributed, DeepSpeed, or Megatron-LM
- You're passionate about solving complex systems challenges in large-scale model training—from efficient multimodal data loading to sophisticated sharding strategies to robust checkpointing mechanisms
- You have a deep understanding of hardware accelerators and networking topologies, with the ability to optimize communication patterns for different parallelism strategies
- You're skilled at identifying and resolving performance bottlenecks in training pipelines, whether they occur in data loading, computation, or communication between nodes
- You have experience working with diverse data types (text, images, video, audio) and can build data pipelines that handle heterogeneous inputs efficiently
What Sets You Apart
- You've implemented custom sharding techniques (tensor/pipeline/data parallelism) to scale training across distributed GPU clusters of varying sizes
- You have experience optimizing data pipelines for multimodal datasets with sophisticated preprocessing requirements
- You've built fault-tolerant checkpointing systems that can handle complex model states while minimizing training interruptions
- You've contributed to open-source training infrastructure projects or frameworks
- You've designed training infrastructure that works efficiently for both parameter-efficient specialized models and massive multimodal systems
What You'll Actually Do
- Design and implement high-performance, scalable training infrastructure that efficiently utilizes our GPU clusters for both specialized and large-scale multimodal models
- Build robust data loading systems that eliminate I/O bottlenecks and enable training on diverse multimodal datasets
- Develop sophisticated checkpointing mechanisms that balance memory constraints with recovery needs across different model scales
- Optimize communication patterns between nodes to minimize the overhead of distributed training for long-running experiments
- Collaborate with ML engineers to implement new model architectures and training algorithms at scale
- Create monitoring and debugging tools to ensure training stability and resource efficiency across our infrastructure
What You'll Gain
- The opportunity to solve some of the hardest systems challenges in AI, working at the intersection of distributed systems and cutting-edge multimodal machine learning
- Experience building infrastructure that powers the next generation of foundation models across the full spectrum of model scales
- The satisfaction of seeing your work directly enable breakthroughs in model capabilities and performance

0 applies
9 views
Other Jobs from Liquid AI
Member of Technical Staff - Edge AI Inference Engineer
Member of Technical Staff - Machine Learning Research Engineer, Post-Training
Member of Technical Staff - Applied Machine Learning Lead
Member of Technical Staff - Applied Machine Learning Engineer
Member of Technical Staff - Machine Learning Engineer, Data
Similar Jobs
Data Scientist - AI Agent Development
Staff Software Engineer, Cloud FinOps
Machine Learning Engineer
AI Frameworks Engineer
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say