Staff DevOps Engineer/SRE
Location: Bangalore, India
Department: InfraOps
Role Overview
FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments.
You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity.
What You’ll Do
Own Reliability & Architecture:
- Design and evolve the infrastructure backbone for our AI and PaaS platform
- Build highly available, fault-tolerant, and scalable systems
- Define and drive SRE practices (SLIs, SLOs, error budgets)
Build Infrastructure at Scale:
- Lead Infrastructure as Code using Pulumi
- Own and scale Kubernetes clusters and containerized workloads
- Standardize and automate infrastructure for global deployments
CI/CD & Automation:
- Design and scale CI/CD pipelines for fast, reliable releases
- Build self-healing systems and automated remediation workflows
- Drive GitOps and platform engineering practices
Observability & Performance:
- Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces)
- Identify and resolve performance bottlenecks (latency, throughput, cost)
- Lead incident response, root cause analysis, and postmortems
Leadership & Collaboration:
- Partner with backend, AI, runtime, and security teams
- Guide infrastructure decisions and scaling strategy
- Mentor engineers and raise the bar on reliability and engineering standards
Security & Resilience:
- Embed security into infrastructure and deployment workflows
- Design for resilience (disaster recovery, chaos testing, capacity planning)
What You'll Need to Be Successful
- 8+ years of experience in DevOps, SRE, or Infrastructure Engineering
- Proven experience operating large-scale, distributed systems in production
- Deep expertise in:
- Kubernetes & container orchestration
- Pulumi (or similar IaC tools)
- Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
- Observability stacks (Prometheus, Grafana, OpenTelemetry)
- Strong experience with CI/CD, automation, and release engineering
- Proficiency in Python, Go, or Bash
- Strong systems thinking and debugging skills in high-scale environments
- Experience defining and operating with SLOs / SLAs
- Experience in startup environments
- Comfortable leveraging AI coding tools and agents to move faster
Nice to Have
- Experience with AI/ML infrastructure or GPU workloads
- Familiarity with distributed or high-performance compute systems
- Exposure to platform engineering / internal developer platforms
- Experience scaling systems from Beta to production
Why FlexAI
- Work on cutting-edge AI infrastructure
- Build systems that power developers and enterprises
- High ownership, fast execution, real impact
- Collaborative, high-caliber team
About the Company
About FlexAI
Build and Deploy AI the right way, anywhere.
The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.
Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
