Description

Company Overview

Company Overview

With 80,000 customers across 150 countries, UKG is the largest U.S.-based private software company in the world. And we’re only getting started. Ready to bring your bold ideas and collaborative mindset to an organization that still has so much more to build and achieve? Read on.

At UKG, you get more than just a job. You get to work with purpose. Our team of U Krewers are on a mission to inspire every organization to become a great place to work through our award-winning HR technology built for all.

Here, we know that you’re more than your work. That’s why our benefits help you thrive personally and professionally, from wellness programs and tuition reimbursement to U Choose — a customizable expense reimbursement program that can be used for more than 200+ needs that best suit you and your family, from student loan repayment, to childcare, to pet insurance. Our inclusive culture, active and engaged employee resource groups, and caring leaders value every voice and support you in doing the best work of your career. If you’re passionate about our purpose — people —then we can’t wait to support whatever gives you purpose. We’re united by purpose, inspired by you.

We are seeking a highly skilled and motivated Site Reliability Engineering (SRE) & Observability Manager to lead our site reliability engineering efforts and establish best-in-class observability practices across our infrastructure. As an SRE and Observability leader, you will be responsible for ensuring the reliability, availability, and performance of our systems, as well as implementing robust monitoring, alerting, and reporting frameworks to provide actionable insights into our systems' health.
In this role, you will work closely with engineering, operations, and product teams to create a culture of proactive reliability, incident management, and continuous improvement, driving the performance and operational excellence of mission-critical services.

Key Responsibilities
• Lead and manage a team of SREs responsible for the design, deployment, and maintenance of scalable, highly available, and reliable systems.
• Define and enforce service-level objectives (SLOs), service-level indicators (SLIs), and error budgets to ensure reliability and performance targets are met.
• Design and implement automated systems for scaling, monitoring, deployment, and incident response, minimizing manual intervention.
• Collaborate with software engineering teams to build a culture of “you build it, you run it,” advocating for best practices in code quality, testing, and operationalization of software.
• Develop and maintain incident response processes, including post-mortems, root cause analysis, and continuous improvement cycles.
• Drive the development and improvement of internal tooling to streamline operations and increase team efficiency.
• Participate in on-call rotation for production systems as needed, and help lead efforts to automate and reduce on-call load.
• Lead the implementation and evolution of observability practices, including monitoring, logging, tracing, and alerting systems, ensuring comprehensive visibility into production systems.
• Own the development and maintenance of real-time dashboards and reporting systems to provide key stakeholders with actionable insights into system health.
• Establish best practices for capturing telemetry data and ensuring data is consistent, meaningful, and actionable.
• Work with engineering teams to ensure observability is baked into the software development lifecycle and that observability is not an afterthought.
• Continuously improve alerting mechanisms to reduce noise and improve signal-to-noise ratio, ensuring that alerts are timely and relevant.
• Own the performance and reliability of key monitoring and observability platforms, ensuring they meet the needs of the organization.
• Lead and mentor a growing team of SREs, helping them to grow their skills, increase efficiency, and align with organizational goals.
• Drive a culture of collaboration and shared responsibility for reliability and performance between development, operations, and product teams.
• Collaborate with engineering teams to address reliability challenges early in the design and development process, including architecting for failure, capacity planning, and disaster recovery.
• Provide leadership during major incidents and drive the incident management process, ensuring rapid and effective resolutions, and fostering a blameless post-incident review process.
• Work closely with other technical leaders to establish and execute on strategic goals, ensuring that reliability and observability align with business objectives.

Qualifications
Required
• Proven experience (5+ years) in Site Reliability Engineering, DevOps, or a similar infrastructure-focused role.
• Strong experience managing teams, including mentoring and developing engineers.
• In-depth knowledge of observability tools and frameworks such as Prometheus, Grafana, ELK Stack, Splunk, Datadog, New Relic, etc.
• Experience with cloud platforms (AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes).
• Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation, or similar.
• Expertise in automation, CI/CD pipelines, and release engineering best practices.
• Strong understanding of distributed systems, microservices architecture, and fault tolerance.
• Experience with incident management, postmortem analysis, and improving operational reliability.
• Strong programming or scripting skills in languages like Python, Go, Bash, or similar.
• Experience with load balancing, auto-scaling, and performance optimization techniques.
• Familiarity with security best practices in a cloud-native environment.
Preferred
• Bachelor’s degree or higher in Computer Science, Engineering, or a related field (or equivalent experience).
• Experience in a leadership or managerial role with direct responsibility for team growth and operational excellence.
• Experience with chaos engineering or other techniques for validating system resiliency.
• Experience with API management, service meshes (e.g., Istio), and distributed tracing (e.g., OpenTelemetry).
• Certification or formal training in cloud technologies or SRE methodologies (e.g., Google Professional Cloud Architect, Kubernetes Certification, etc.).

Skills & Attributes
• Leadership: Ability to inspire, mentor, and grow a team of engineers while setting clear goals and expectations.
• Analytical: Strong problem-solving skills, with the ability to analyze and troubleshoot complex systems and data.
• Communication: Excellent communication skills, both written and verbal, with the ability to interact with technical and non-technical stakeholders.
• Collaboration: Proven ability to work cross-functionally with engineering, product, and operational teams.
• Proactive: Ability to anticipate challenges, take ownership, and drive change within a dynamic environment.

Where we’re going

UKG is on the cusp of something truly special. Worldwide, we already hold the #1 market share position for workforce management and the #2 position for human capital management. Tens of millions of frontline workers start and end their days with our software, with billions of shifts managed annually through UKG solutions today. Yet it’s our AI-powered product portfolio designed to support customers of all sizes, industries, and geographies that will propel us into an even brighter tomorrow!

UKG is proud to be an equal opportunity employer and is committed to promoting diversity and inclusion in the workplace, including the recruitment process. 

Disability Accommodation 

For individuals with disabilities that need additional assistance at any point in the application and interview process, please email UKGCareers@ukg.com

UKG

Bookkeeping and Payroll Human Resources Software Bookkeeping and Payroll Human Resources Software Bookkeeping and Payroll Human Resources Software

0 applies

0 views

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say

UKG

Mgr Site Reliability Engineering

Other Jobs from UKG

Lead Software Engineer

Lead Product Manager

Software Development Internship - Summer 2025/Stage en développement logiciel - Été 2025

Sr DevOps Engineer

Sr Business Data Analyst

Similar Jobs

Sr DevOps Engineer

Senior Lead Software Engineer, DevOps

Senior DevOps Engineer

Staff Cloud Infrastructure Software Engineer

Google Cloud Engineer IV