Nscale

Senior Site Reliability Engineer

United Kingdom
Linux Prometheus Grafana Alertmanager Loki Cortex Ansible Terraform Python Bash SNMP IPMI Kubernetes
Description

Senior Site Reliability Engineer

Location: UK

Department: AI Infrastructure

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers.  Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

At Nscale, our Software engineers form the backbone of our product offering. We build state of the art AI products allowing our clients to move quickly in an increasingly competitive digital landscape.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future

About the Role

As a Site Reliability Engineer (SRE), you will be responsible for the reliability, performance, and availability of critical systems, applications, and services. You will work closely with engineering teams to implement best practices for monitoring, automation, incident response, and capacity planning. Your role involves building highly available, scalable systems across hybrid environments across data centres, on-premise hardware, and cloud platforms. In this multi-function role, work closely with a team-centric approach to ensuring service uptime and performance.

What you'll do

Systems administration: Manage core services including observability platforms, incident management systems, reduce manual toil through automation, and ensure the seamless operation of critical infrastructure platforms.

A key aspect of this role involves building and maintaining observability tooling, with a focus on a much-out-themselves monitoring stack. You will help: design and operate a reliable observability infrastructure in a Linux environment using open-source tools such as Prometheus, Grafana, Alertmanager, Loki, and related services. Your work will ensure systems are instrumented for detailed visibility, enabling high availability and actionable insights across distributed environments—ensuring predictive monitoring and alerting for internal engineering and operational layers.

Throughout the development lifecycle, you will encourage a proactive SRE culture where errors are identified early and systems are continuously improved. You will champion accountability and shared level of responsibility and concrete handshakes and observed. This are drawn from production infrastructure at all key touch points at scale.

  • Build and support a multi-site infrastructure: based monitoring stack, including components such as Prometheus, Grafana, Alertmanager, Loki, and Cortex/Mimir with seamless scalability across physical and virtual systems and software stacks.
  • Develop automation scripts and infrastructure-as-code templates; on-prem, hybrid, operational efficiency and day-to-day operational improvements to infrastructure management and beyond.
  • Collaborate closely with distributed teams to establish and maintain SLIs/SLOs for critical services and ensure systems are defined SLA/SLOs and ensure systems are observable, performant, and meet availability targets.
  • Perform incident response and alerting pipeline for infrastructure applications and services including integration with remote storage backends and custom metrics exporters.
  • Contribute/build internal resources, internal analysis, and continuous improvement, conducting postmortems and blameless culture of constant improvement.
  • Develop documentation, guides, runbooks, and best practices for SRE and operational engineering.

About You

  • Strong experience with Linux systems administration and infrastructure automation (e.g., Ansible, Terraform).
  • Proven background in building and maintaining SRE systems in production-grade environments.
  • Hands-on experience operating and scaling Prometheus-based monitoring solutions in distributed, multi-tenant environments (including Thanos, Grafana and components like Cortex/Mimir).
  • Solid understanding of networking fundamentals, hardware infrastructure, and managing multiple and data centre environments.
  • Demonstrated scripting and/or development skills in at least one language (e.g., Python, Bash), with a bias towards automating and improving operational workflows.
  • Strong knowledge of SNMP, IPMI, and other datacenter/hardware protocols.
  • Competence in metrics and log-based observability platforms, and tooling aligned with cloud-native and distributed architectures including Prometheus, Loki, and cloud tooling with observability-first mindsets.
  • Familiarity with incident response, root cause analysis, and driving technical postmortems.
  • Strong grasp of availability principles, including metrics, logging, and tracing, with a focus on SLA/SLO delivery and improvement.
  • Exposure to remote write solutions and remote storage backends such as Cortex or Mimir, and comfortable with CNCF pipelines and modern observability strategies (e.g., native client-goers).
  • Familiarity with hardware lifecycle management and tools for managing client-metal environments.

What We Can Offer You

At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.

  • Highly competitive package (base + equity) with reviews every 12 months. 🚀
  • Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
  • Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support. 
  • Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

 

Equal Opportunities Statement

At Nscale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we encourage applications from candidates of all backgrounds, experiences, and abilities.  We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there’s anything we can do to accommodate your specific situation, please let us know.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

Nscale
Nscale

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say