ML Platform Engineer
Location: Austin, TX
Department: Infrastructure
About the team
The ML Platform team at Avride builds the infrastructure that powers large-scale ML training and data processing for autonomous driving. We sit between Cloud Platform and ML engineers, turning low-level compute, storage, and networking primitives into an ML platform that teams actually use — scalable orchestration, distributed compute, and production-grade tooling for the full model lifecycle.
About the role
As an ML Platform Engineer at Avride, you'll own critical pieces of the ML stack: workflow orchestration, distributed execution, resource governance, performance.You will shape how ML teams across the company run experiments and train models at scale. You will build the abstractions and services that make training workloads reliable, cost-efficient, and fast, helping ML teams run at scale on Kubernetes with strong reliability and excellent developer experience.
What you will do
- Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration
- Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance — scheduling, priorities, quotas, and policy enforcement across GPU, CPU, memory, and IO
- Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention
- Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes
- Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs
What you will need
- Strong proficiency in Python or Go; C++ is a plus
- Track record of designing and building scalable, maintainable systems and services
- Experience operating production services end-to-end: APIs, reliability practices, observability
- Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure
- Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
- Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution
Nice to have
- Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling
- Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines
- Track record of optimizing resource usage and performance in distributed environments
Candidates are required to be authorized to work in the U.S. The employer is not offering relocation sponsorship, and remote work options are not available.
Avride is an equal opportunity employer and committed to providing reasonable accommodations to qualified applicants and employees with disabilities to ensure they have equal access to employment opportunities. Avride complies with the Americans with Disabilities Act (ADA), if you need a reasonable accommodation to assist with the application or hiring process, or to perform the essential functions of a job, please email [email protected].
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
