Coralogix is a modern, full-stack observability platform transforming how businesses process and understand their data. Our unique architecture powers in-stream analytics without reliance on expensive indexing or hot storage. We specialize in comprehensive monitoring of logs, metrics, trace and security events with features such as APM, RUM, SIEM, Kubernetes monitoring and more, all enhancing operational efficiency and reducing observability spend by up to 70%.
We are seeking a Site Reliability Engineering (SRE) Group Leader to join our fast-paced and dynamic environment. As the Site Reliability Engineering (SRE) Group Leader, you will be at the forefront of ensuring the availability, stability, and performance of Coralogix's production platform. You will lead three specialized teams focusing on production availability and stability, observability, and production insights, while maintaining 99.9% uptime and ensuring immediate response to production issues.This role requires deep expertise in cloud technologies, Kubernetes, and the observability ecosystem. You'll work collaboratively across teams, setting objectives, defining metrics, and driving measurable improvements in platform reliability.
Key Responsibilities
- Production Reliability: Ensure the platform achieves and maintains 99.9% uptime by implementing robust SRE practices.
- Incident Response: Oversee immediate response to any production issues, ensuring timely resolution and minimizing downtime.
- Strategic Leadership: Lead and mentor three teams specializing in production availability, observability, and production insights, fostering a culture of accountability and collaboration.
- Cloud and Kubernetes Expertise: Drive optimization and reliability improvements using cloud technologies, Kubernetes, and Kubernetes operators.
- Observability Leadership: Develop and enhance observability solutions, ensuring comprehensive monitoring, alerting, and actionable insights across production systems.
- Data-Driven Decision-Making: Leverage production insights and metrics to drive system optimization and improvements.
- Cross-Team Collaboration: Partner with engineering, product, and support teams to align on priorities, objectives, and deliverables for production excellence.
- Production Focus: Extensive experience managing large-scale production systems with a focus on maintaining high availability (≥99.9%).
- Incident Management Expertise: Proven ability to manage incident response processes and ensure rapid resolution of production issues.
- Observability Knowledge: Strong understanding of observability tools like Prometheus, Grafana, OpenTelemetry, and the broader observability ecosystem.
- Leadership Skills: Proven ability to manage and scale engineering teams, with experience leading multiple teams or groups.
- OKR Experience: Ability to define objectives, measure performance, and drive results through OKR frameworks.
- Problem-Solving Skills: Demonstrated expertise in troubleshooting and optimizing distributed systems and cloud environments.
- Collaboration Skills: Strong ability to work across teams and departments, aligning technical efforts with organizational goals.
Preferred Qualifications:
- Experience in companies within the observability domain (e.g., Datadog, New Relic, Sumologic).
- Familiarity with incident management tools (PagerDuty, OpsGenie, etc.) and chaos engineering practices.
- Background in designing and implementing SLOs for production systems.
- Experience optimizing systems for high-throughput and low-latency workloads.
Cultural Fit
We’re seeking candidates who are hungry, humble, and smart. Coralogix fosters a culture of innovation and continuous learning, where team members are encouraged to challenge the status quo and contribute to our shared mission. If you thrive in dynamic environments and are eager to shape the future of observability solutions, we’d love to hear from you.
Coralogix is an equal opportunity employer and encourages applicants from all backgrounds to apply.
Other Jobs from Coralogix
Senior Software Engineer
Backend Tech Lead
Cloud infrastructure Team Lead
Database Reliability Engineer (DBRE)
Software Engineering Group Leader (Metrics)
Similar Jobs
Backend Software Engineer
Senior Software Engineer, Full-Stack
Guidewire Software Engineer- Mid Level
Senior Software Engineer, Backend
Senior Software Engineer, Backend
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 401 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say