Essential AI

Member of Technical Staff: Machine Learning Infrastructure Engineer

San Francisco, CA
USD 225k - 225k
Machine Learning Kubernetes Docker AWS GCP
Search for More Jobs Talk to a recruiter now 💪
Description

About Us

Essential AI’s mission is to deepen the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today. We believe that building delightful end-user experiences requires innovating across the stack - from the UX all the way down to models that achieve the best user value per FLOP.

We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building a world-class multi-disciplinary team who are excited to solve hard real-world AI problems. We are well-capitalized and supported by March Capital and Thrive Capital, with participation from AMD, Franklin Venture Partners, Google, KB Investment, NVIDIA.

The Role

The Machine Learning Infrastructure Engineer will be responsible for architecting, building the compute infra that powers training and serving of our models. This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels. In addition, the role requires familiarity with tools and services common in cloud based infra like Kubernetes and Dockers.

What you’ll be working on

  • Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications

  • Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way

  • Develop tools and frameworks to automate and streamline ML experimentation and management

  • Collaborate with other researchers and product engineers to bring magical product experiences through large language models

  • Working on lower levels of the stack to build high performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements

  • Be willing to optimize performance and efficiency across different accelerators

What we are looking for

  • A strong understanding of architectures of new AI accelerators like TPU, IPU, HPU etc and their tradeoffs.

  • Knowledge of parallel computing concepts and distributed systems.

  • Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued.

  • 6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems.

  • Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc

  • Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA

  • Experience with INT8/FP8 training and inference, quantization and/or distillation

  • Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc.

  • Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc

We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.

We are based in-person in SF and work fully onsite 5 days a week. We offer relocation assistance to new employees.

The base pay range target for the role seniority described in this job description is up to $225,000 in San Francisco, CA. Final offer amounts depend on various job-related factors, including where you place on our internal performance ladders, which is based on factors including past work experience, relevant education, and performance on our interviews and our benchmarks against market compensation data. In addition to cash pay, full-time regular positions are eligible for equity, 401(k), health benefits, and other benefits like daily onsite lunches and snacks; some of these benefits may be available for part-time or temporary positions.

Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.

Essential AI
Essential AI
Artificial Intelligence (AI) Information Technology Software

0 applies

2 views

Other Jobs from Essential AI

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 401 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say