Senior Site Reliability Engineer
Team: Cloud Services Engineering
Location: Bengaluru
Commitment: Full-Time
Workplace Type: hybrid
We’re a fast-moving AI Security Company building AI-native infrastructure and applications powered by LLMs and autonomous agents. Our stack is deeply integrated with AWS, Kubernetes, and OpenAI-based systems, and we’re rethinking reliability in a world where software can reason, adapt, and self-heal.
We’re hiring a Senior SRE Engineer to own reliability across our cloud-native and AI-driven platform. You’ll work at the intersection of distributed systems, Kubernetes operations, and LLM-powered automation, building systems that don’t just scale—but think and fix themselves.
WHAT YOU BRING
- 5+ years in SRE / DevOps / Platform Engineering.
- Strong hands-on experience with:
- AWS infrastructure at scale
- Kubernetes (production-grade clusters)
- Proven ability to debug complex distributed systems under pressure.
- Strong coding skills (Python or Go)—you build internal platforms and tools.
- Experience implementing monitoring, alerting, and incident management systems.
- Experience working with LLM APIs such as the OpenAI API.
- Familiarity with agent frameworks like:
- LangChain
- AutoGen
- Built or experimented with:
- AI agents for DevOps / SRE workflows
- Retrieval-Augmented Generation (RAG) systems
- Vector databases (Pinecone, Weaviate, etc.)
- Exposure to AIOps or intelligent automation systems.
Bonus (AI / LLM Focus)
WHAT YOU WILL BE DOING
- Own uptime, reliability, and performance of services running on AWS + Kubernetes (EKS).
- Design and implement self-healing infrastructure using automation and AI agents.
- Build LLM-powered operational tooling using APIs such as the OpenAI API for:
- Intelligent alert triage
- Incident summarization
- Root cause analysis
- Runbook automation
- Manage and scale Kubernetes workloads:
- Deployments, autoscaling, resource optimization
- Cluster reliability and cost efficiency
- Build and evolve observability systems:
- Metrics (Prometheus), dashboards (Grafana)
- Logs (ELK / OpenSearch)
- Tracing (OpenTelemetry)
- Define and enforce SLOs, SLAs, and error budgets tied to business metrics.
- Automate infrastructure using Terraform and CI/CD pipelines.
- Lead incident response, postmortems, and continuous reliability improvements.
- Introduce chaos engineering practices to proactively test system resilience.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
