FlexAI

Staff AI Runtime Engineer

Santa Clara, CA
Python C++ PyTorch TensorFlow JAX Kubernetes Ray TorchElastic
Description

Staff AI Runtime Engineer

Location: Santa Clara, CA

Department: Engine

Role Overview

At FlexAI, we’re building a high-performance, cloud-agnostic AI compute platform designed for next-generation training and inference workloads. As a Staff AI Runtime Engineer, you’ll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond).


This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. You’ll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.


What You'll Do

Lead Runtime Design & Development:

  • Own the core runtime architecture supporting AI training and inference at scale.
  • Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery) within our custom PyTorch stack.
  • Optimize distributed training reliability, orchestration, and job-level fault tolerance.

Drive Performance at Scale:

  • Profile and enhance low-level system performance across training and inference pipelines.
  • Improve packaging, deployment, and integration of customer models in production environments.
  • Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU setups.

Build Internal Tooling & Frameworks:

  • Design and maintain libraries and services that support model lifecycle: training, checkpointing, fault recovery, packaging, and deployment.
  • Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.
  • Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.

Collaborate & Mentor:

  • Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.
  • Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team’s capabilities.



What You’ll Need to Be Successful

  • 8+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.
  • Experience in delivering PaaS services.
  • Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.
  • Strong programming skills in Python and C++ (Go or Rust is a plus).
  • Familiarity with distributed training frameworkslow-level performance tuning, and resource orchestration.
  • Experience working with multi-GPUmulti-node, or cloud-native AI workloads.
  • Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.

Nice to Have

  • Contributions to PyTorch internals or open-source DL infrastructure projects.
  • Familiarity with LLM training pipelinescheckpointing, or elastic training orchestration.
  • Experience with KubernetesRayTorchElastic, or custom AI job orchestrators.
  • Background in systems researchcompilers, or runtime architecture for HPC or ML.
  • Start up previous experience

This position is In-Person and located at our Santa Clara, CA Office.


What We Offer

  • A competitive salary and benefits package
  • Work on cutting-edge AI infrastructure
  • Build products used by developers and enterprises
  • High ownership, fast execution, real impact
  • Collaborative, high-caliber team

About the Company

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.


Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.

If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

FlexAI
FlexAI

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say