Site Reliability Engineer
Department: Tech Operations
Employment Type: Full Time
Location: Tbilisi, Georgia, Tbilisi, Georgia, Remote
About Intermedia
We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. You will build and maintain monitoring using Prometheus/VictoriaMetrics, integrate alerts and events with BigPanda, and participate in on-call rotations to drive fast incident response and continuous improvement across Windows and Linux environments.
Key Responsibilities
- Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
- Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
- Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
- Create and maintain dashboards and operational visibility (Grafana or equivalent)
- Develop and maintain runbooks, operational playbooks, and incident response procedures
- Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
- Perform root-cause analysis, postmortems, and implement corrective/preventive actions
- Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
- Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
- Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)
Skills, Knowledge and Expertise
- Bachelor in Computer Science or related field
- Experience in SRE / Operations / DevOps with production incident ownership
- Hands-on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
- Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
- Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
- Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
- Experience with Git-based workflows for monitoring-as-code and configuration management
Nice to have
- Grafana administration and dashboard design standards
- Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
- Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
- Messaging/cache/proxy operations: RabbitMQ, Redis, Nginx
- Experience with Windows clustering or HA environments
- Experience defining SLOs/SLIs and operational KPIs
- Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
- Experience with load balancing components ( F5 LTM, F5 GTM)
- Experience with Virtualization platforms such as VMWare or HyperV
- Experience with administering AWS or Azure tenants
- Participation in a rotating on-call schedule (including nights/weekends as needed)
- Ownership of incident response: rapid triage, escalation, mitigation, and follow-up improvements
- Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR
Diversity, Inclusion, and Equal Opportunity
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
