Staff Machine Learning Engineer (Infra)
Location: San Francisco
Department: Machine Learning
Who are we?
Aarki is an AI-driven company specializing in mobile advertising solutions designed to fuel revenue growth. We leverage AI to discover audiences in a privacy-first environment through trillions of contextual bidding signals and proprietary behavioral models. Our audience engagement platform includes creative strategy and execution. We handle 5 million mobile ad requests per second from over 10 billion devices, driving performance for both publishers and brands. We are headquartered in San Francisco, CA, with a global presence across the United States, EMEA, and APAC.
Role Overview
We are seeking a Staff Machine Learning Engineer (Infra) to design, build, and operate the model training and deployment infrastructure that powers our Demand-Side Platform (DSP). This role focuses on building scalable, flexible, and reliable systems for training models on billions of records across bidding, ranking, pacing, and fraud use cases.
You will work at the intersection of machine learning, data platforms, and infrastructure, with a strong focus on automation, reproducibility, and reliability. The ideal candidate has experience building production-grade ML training systems and is motivated by improving the velocity and reliability of model development.
What will you do?
- Own and evolve shared ML infrastructure for training, deployment, and lifecycle management; deliver measurable gains in reliability, cost, and developer velocity.
- Lead cross-pod initiatives end-to-end (design → build → production), reducing org bottlenecks and aligning stakeholders on goals and success metrics.
- Build scalable training and orchestration systems (Prefect-first) for billion-scale datasets with strong failure recovery and backfill support.
- Build and operate high-throughput, low-latency serving/inference systems for DSP models (bidding, ranking, pacing, fraud), including safe rollouts and performance guardrails.
- Establish ML observability across the lifecycle: data quality, training stability, drift/anomalies, and regression monitoring with actionable alerting and runbooks.
- Standardize reproducibility and governance: versioning, lineage/traceability, and experiment tracking (MLflow), with clear production readiness criteria.
- Drive operational excellence for owned components: on-call ownership, incident response, postmortems, and reliability improvements.
- Build foundations for feature management (feature pipelines/feature store) and offline/online consistency guarantees.
What are we looking for?
- 6+ years building and operating production ML systems, including training pipelines and online inference.
- Strong Python and Spark for large-scale processing (on-prem/YARN environments preferred).
- Proven experience with workflow orchestration for ML (Prefect or similar) and production-grade automation.
- Experience designing and operating serving systems in high-throughput, low-latency environments (REST/gRPC, canary/rollback strategies).
- Strong DevOps/MLOps practices: CI/CD, automated testing, infrastructure as code, and reliability engineering.
- Strong understanding of experimentation and reproducibility: dataset/model versioning, lineage, and traceability; MLflow familiarity preferred.
- Solid grounding in core ML methods to evaluate and diagnose model/data issues.
- Strong communication skills across ML and engineering stakeholders.
Nice-to-Have
- Familiarity with system programming languages including C++ and Rust is a plus.
- Strong grasp of probability, statistics, and data analysis principles.
- Ad-tech familiarity: auction dynamics, pacing, fraud signals, creative personalization.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
