Site Reliability Engineer - SRE
Team: Engineering
Location: India
Commitment: Full Time
Workplace Type: remote
As a Senior Site Reliability Engineer at Drivetrain, you will be a cornerstone of our engineering organization, ensuring our fast-growing SaaS platform remains highly available, performant, and secure. At this stage of our growth, scaling infrastructure efficiently while maintaining the rigorous security and reliability standards required for financial data is paramount. You will take ownership of our multi-cloud infrastructure, drive automation, champion observability, and collaborate closely with development teams to build a culture of reliability from code commit to production.
Key Responsibilities
Cloud Infrastructure & Orchestration
-
Multi-Cloud Management: Architect, manage, and continuously optimize highly available cloud infrastructure across both AWS and GCP. Balance workload demands to ensure maximum cost-efficiency, scalability, and strict security compliance across both platforms.
-
Advanced Kubernetes Orchestration: Lead the design, deployment, and management of scalable Kubernetes clusters. Utilize configuration management tools like Kustomize to enforce standardized, repeatable, and automated deployment configurations across all environments.
-
Service Mesh & Security Integration: Implement and maintain service mesh technologies (e.g., Istio, Linkerd) to secure, control, and observe service-to-service communication. Drive container security best practices, including image scanning, runtime protection, and strict RBAC enforcement.
CI/CD & Automation
-
Pipeline Engineering: Architect, maintain, and optimize robust CI/CD pipelines using Git and Jenkins. Focus on reducing deployment friction, accelerating release velocity, and enforcing automated testing and security gates.
-
Infrastructure as Code (IaC): Treat infrastructure as software. Write, review, and maintain Terraform modules to provision and manage cloud resources predictably and safely.
-
Operational Automation: Aggressively reduce operational toil. Develop robust Python scripts and tooling to automate routine maintenance, data backups, scaling operations, and system recovery processes.
Observability & Reliability
-
Comprehensive Monitoring: Design and enhance our observability stack to provide deep, real-time insights into system health. Manage and scale tools including Prometheus, Grafana, ELK/EFK stack, AWS CloudWatch, and GCP Operations Suite.
-
Reliability Engineering: Spearhead reliability initiatives critical to a scaling SaaS platform. Drive rigorous capacity planning exercises to stay ahead of growth.
-
Incident Management & SLOs: Own the incident response lifecycle. Facilitate blameless postmortems to extract actionable learnings. Define, track, and enforce SLIs, SLOs, and SLAs, ensuring the platform consistently meets its reliability guarantees.
Collaboration & Leadership
-
DevOps Culture: Act as an embedded reliability advocate. Collaborate closely with software engineers early in the development lifecycle to ensure applications are designed for deployability, scalability, and resilience.
-
Continuous Improvement: Proactively identify system bottlenecks and architectural weaknesses. Contribute to process improvements, build internal developer tooling, and maintain comprehensive documentation to elevate team productivity and system understanding.
Required Proficiency & Qualifications
-
Experience: 5+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles, preferably within a fast-paced SaaS environment.
-
Cloud Platforms: Deep, proven proficiency in AWS (EC2, EKS, RDS, VPC, IAM, S3) AND GCP (GKE, Compute Engine, Cloud SQL, IAM, Cloud Storage). Ability to navigate and optimize multi-cloud architectures.
-
Containerization: Expert-level knowledge of Docker and Kubernetes, including advanced deployment strategies and lifecycle management.
-
Automation/IaC: Strong programming skills in Python and extensive experience with Terraform.
-
Observability: Hands-on expertise building dashboards and alerting systems using Prometheus, Grafana, and log aggregation stacks (ELK/EFK).
-
Networking & Security: Solid understanding of cloud networking (VPC peering, load balancing, DNS) and zero-trust security principles in a containerized environment.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
