FlexAI

Staff DevOps Engineer/Site Reliability Engineer

Bangalore, India
Kubernetes Docker Terraform Python Go Bash Rust AWS GCP Azure Prometheus Grafana OpenTelemetry GitOps
Description

Staff DevOps Engineer/SRE

Location: Bangalore, India

Department: InfraOps

Role Overview

FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments.


You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity.


What You’ll Do

Own Reliability & Architecture:

  • Design and evolve the infrastructure backbone for our AI and PaaS platform
  • Build highly available, fault-tolerant, and scalable systems
  • Define and drive SRE practices (SLIs, SLOs, error budgets)

Build Infrastructure at Scale:

  • Lead Infrastructure as Code using Pulumi
  • Own and scale Kubernetes clusters and containerized workloads
  • Standardize and automate infrastructure for global deployments

CI/CD & Automation:

  • Design and scale CI/CD pipelines for fast, reliable releases
  • Build self-healing systems and automated remediation workflows
  • Drive GitOps and platform engineering practices

Observability & Performance:

  • Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces)
  • Identify and resolve performance bottlenecks (latency, throughput, cost)
  • Lead incident response, root cause analysis, and postmortems

Leadership & Collaboration:

  • Partner with backend, AI, runtime, and security teams
  • Guide infrastructure decisions and scaling strategy
  • Mentor engineers and raise the bar on reliability and engineering standards

Security & Resilience:

  • Embed security into infrastructure and deployment workflows
  • Design for resilience (disaster recovery, chaos testing, capacity planning)

What You'll Need to Be Successful

  • 8+ years of experience in DevOps, SRE, or Infrastructure Engineering
  • Proven experience operating large-scale, distributed systems in production
  • Deep expertise in:
    • Kubernetes & container orchestration
    • Pulumi (or similar IaC tools)
    • Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
    • Observability stacks (Prometheus, Grafana, OpenTelemetry)
  • Strong experience with CI/CD, automation, and release engineering
  • Proficiency in Python, Go, or Bash
  • Strong systems thinking and debugging skills in high-scale environments
  • Experience defining and operating with SLOs / SLAs
  • Experience in startup environments
  • Comfortable leveraging AI coding tools and agents to move faster

Nice to Have

  • Experience with AI/ML infrastructure or GPU workloads
  • Familiarity with distributed or high-performance compute systems
  • Exposure to platform engineering / internal developer platforms
  • Experience scaling systems from Beta to production

Why FlexAI

  • Work on cutting-edge AI infrastructure
  • Build systems that power developers and enterprises
  • High ownership, fast execution, real impact
  • Collaborative, high-caliber team

About the Company

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.


Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.

 If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

FlexAI
FlexAI

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say