Lead Site Reliability Engineer
Team: Engineering
Location: Bangalore - EC
Commitment: Full-time
Workplace Type: onsite
Responsibilities:
- System Reliability: Ensuring the reliability of software systems by designing, implementing, and maintaining scalable and reliable infrastructure.
- Automation: Developing automation tools and scripts to streamline operational tasks, reduce manual intervention, and improve overall system efficiency.
- Incident Response and Resolution: Monitoring system performance and responding to incidents promptly to minimize downtime and ensure high availability.
- Capacity Planning: Analyzing system usage patterns and forecasting future capacity needs to ensure that the infrastructure can handle current and future demands.
- Performance Optimization: Identifying and addressing performance bottlenecks in software systems through optimization and tuning.
- Infrastructure as Code (IaC): Implementing infrastructure as code practices, using tools like Terraform or Ansible, to define and manage infrastructure in a version-controlled and automated manner.
- Monitoring and Logging: Implementing and maintaining monitoring and logging solutions to gain insights into system behavior, troubleshoot issues, and proactively address potential problems.
- Security: Collaborating with security teams to implement and maintain security best practices in infrastructure and application
- Disaster Recovery Planning: Developing and maintaining disaster recovery plans to ensure that systems can quickly recover from major outages or failures
- Continuous Improvement: Continuously analyzing system performance, reliability, and incidents to identify areas for improvement and implementing changes to enhance overall system resilience.
- Team Leadership: Ability to lead and motivate a team of SREs, providing guidance and support.
- Mentorship and Coaching: Providing mentorship and coaching to team members to foster their professional development.
- Conflict Resolution: Skill in resolving conflicts and addressing challenges within the team
Skills:
- Programming Languages: Proficiency in one or more programming languages, commonly Python, Go, Shell, Bash.
- Automation and Scripting: Strong automation skills using tools like Ansible, Puppet, Chef, or custom scripts. Knowledge of Infrastructure as Code (IaC) tools like Terraform
- Containerization and Orchestration: Experience with containerization technologies like Docker and container orchestration platforms like Kubernetes.
- Cloud Computing: Proficiency in any of the cloud platforms such as AWS, Azure, or Google Cloud Platform, and knowledge of managing infrastructure in the cloud.
- Monitoring and Logging: Familiarity with monitoring tools (e.g., Prometheus, Grafana, ELK stack) and logging frameworks to track system performance and troubleshoot issues.
- Networking: Understanding of networking concepts, protocols, and troubleshooting skills.
- Security: Knowledge of security best practices, including encryption, access controls, and vulnerability management.
- Continuous Integration/Continuous Deployment (CI/CD): Understanding and implementation of CI/CD pipelines for automated testing and deployment.
- Load Balancing: Experience in incident response, troubleshooting, and resolution.
- Version Control: Proficient use of version control systems like Git.
Experience & Qualifications:
- 7 - 9 years of experience in site reliability engineering.
- B.Tech/M.Tech in computer science, information technology or a related field.
- Having experience working for a product organization is a plus.
- Certifications from cloud service providers like AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or Microsoft Certified is a plus
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
