System Reliability Engineer / T2 Support Engineer
Location: Gurugram, Haryana, India
Department: Technology
Workplace: on_site
Description
About the Role
We are looking for an engineer who enjoys understanding how systems behave in real production, not just writing features. This role is responsible for maintaining reliability, stability, and smooth functioning of our live platform running on Google Cloud.
You will act as the first technical owner of production systems β monitoring services, investigating alerts, resolving issues, and performing controlled configuration and operational changes. This role works closely with backend developers, QA, and infrastructure teams to prevent incidents and reduce downtime.
This is not a call-center support role and not a pure development role β it is a hands-on technical position focused on debugging, incident handling, and system operations.
Tech Stack
- Google Cloud Platform (Compute, Logging, Monitoring)
- Java (Spring Boot based microservices)
- MongoDB
- Apache Kafka (event-driven architecture)
- Redis cache
- Linux servers
Key Responsibilities
Production Monitoring & Alert Handling
- Monitor application health, latency, errors, consumer lag, database connections, and resource utilization
- Acknowledge and investigate monitoring alerts
- Perform first-level troubleshooting and stabilize services
- Identify whether issue is infra, application, database, or messaging related
Incident Response
- Participate in on-call rotation
- Diagnose production incidents and restore services with minimal downtime
- Safely restart services, scale instances, or rollback deployments when required
- Communicate incident status to stakeholders
Technical Support & Operational Changes
- Handle technical support tickets requiring engineering understanding
- Update configurations and feature flags
- Manage scheduled jobs / cron triggers
- Trigger or replay events in Kafka
- Assist in minor Java configuration/code fixes when needed
- Coordinate production releases
Database & Messaging Operations
- Investigate MongoDB performance issues and slow queries
- Monitor and resolve Kafka consumer lag and stuck messages
- Manage Redis cache behavior (TTL, eviction, connection issues)
Logs & RCA
- Analyze logs and metrics to determine root cause of issues
- Prepare basic Root Cause Analysis (RCA) reports
- Suggest preventive actions to reduce recurring incidents
Requirements
Required Skills
Core Technical Skills
- Good understanding of Linux commands and server behavior
- Experience analyzing application logs and debugging runtime issues
- Basic Java knowledge (stack trace reading, configuration changes, rebuild & deploy)
- Practical experience with MongoDB (indexes, connections, slow queries)
- Understanding of Kafka concepts (consumer, offset, lag, partitions)
- Basic Redis knowledge (caching behavior, TTL)
Cloud & Tools
- Hands-on experience with any cloud platform (GCP preferred / AWS acceptable)
- Experience using monitoring tools (GCP Monitoring, Prometheus, Grafana, ELK, or similar)
- Understanding of REST APIs and HTTP status codes
What We Expect From You
- Ability to investigate problems logically rather than randomly restarting services
- Comfort working with live production systems
- Willingness to participate in on-call support
- Strong ownership mindset and attention to detail
- Good communication during incidents
Good to Have
- Experience in e-commerce, fintech, logistics, or high-traffic systems
- Exposure to CI/CD pipelines and deployments
- Basic scripting (Shell or Python)
- Experience writing RCA documents
Experience
3 β 6 years of relevant experience in production support, application support, SRE, DevOps operations, or similar roles.
Benefits
Why Join Us
- Direct exposure to real distributed systems
- Hands-on production debugging experience
- Opportunity to learn system architecture deeply
- Close interaction with development and platform teams
Important Note
This role involves handling live production systems and occasional on-call responsibilities. Candidates interested only in feature development or pure infrastructure automation may not find this role suitable.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
π₯³π₯³π₯³ 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineersβ¦ in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. π οΈ
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. π
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. π―
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. π
What Fellow Engineers Say
