Staff ML Systems Engineer, Distributed Systems
Team: Systems Engineering
Location: Seattle, WA, Irvine, CA
Commitment: Full time
Workplace Type: onsite
Salary:
Our salary range is highly competitive with the market, but we take into consideration an individual's background and experience in determining final salary. Base pay offered may vary depending on geographic location, job-related knowledge, skills, and experience.
In addition to competitive compensation, FieldAI offers comprehensive benefits, equity participation, and the opportunity to contribute to cutting-edge advancements in AI and robotics.
We are seeking a Senior / Staff ML Systems Engineer to architect and build the distributed infrastructure that powers large-scale machine learning workflows across the organization.
This role sits at the intersection of machine learning, distributed systems, and platform engineering. You will be responsible for designing scalable systems that support data processing, model training, evaluation, and post-processing pipelines while enabling ML teams to efficiently develop, operate, and scale production-grade workflows.
You will play a critical role in defining the architectural patterns, tooling, and infrastructure that underpin our machine learning platform.
What You'll Get To Do
- Design and build scalable distributed machine learning pipelines across data processing, model training, evaluation, and post-processing workflows.
- Architect distributed execution systems, including parallelization strategies, workload scheduling, resource allocation, and fault tolerance mechanisms.
- Develop reusable abstractions, frameworks, and libraries that simplify distributed pipeline development.
- Optimize performance across distributed CPU and GPU environments, improving throughput, utilization, and reliability.
- Design systems that effectively manage data partitioning, memory utilization, serialization overhead, and compute efficiency.
- Partner closely with ML engineers, data engineers, and infrastructure teams to productionize research workflows and enable large-scale model development.
- Establish best practices and engineering standards for distributed machine learning infrastructure.
- Evaluate and guide decisions around distributed computing frameworks, infrastructure technologies, and system design trade-offs.
- Improve observability, debugging, monitoring, and operational tooling for distributed systems at scale.
What You Have
- 5+ years of experience building distributed systems, backend infrastructure, machine learning platforms, or large-scale data processing systems.
- Strong Python programming skills, including experience with concurrency, performance optimization, and systems development.
- Experience with distributed computing frameworks such as Ray, Spark, Dask, Flink, or similar technologies.
- Experience designing and scaling data pipelines or machine learning workflows.
- Strong system design skills with demonstrated expertise in scalability, reliability, and performance optimization.
- Experience diagnosing and resolving bottlenecks in distributed environments.
- Ability to work cross-functionally and drive technical decisions across multiple teams.
The Extras That Set You Apart
- Experience building infrastructure for machine learning training and inference systems.
- Familiarity with modern ML frameworks such as PyTorch or TensorFlow.
- Experience with multi-node or multi-GPU training architectures, including DDP, FSDP, DeepSpeed, or similar technologies.
- Experience operating Kubernetes-based infrastructure and large-scale cloud systems.
- Deep understanding of distributed systems concepts including data locality, serialization costs, scheduling, and resource management.
- Experience with distributed debugging, observability, and workflow orchestration platforms.
- Proven ability to establish technical direction and influence architecture across organizations.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
