Director of Software Engineering (Fleet Management)
Location: London
Department: AI Infrastructure
About Nscale
Nscale is taking on the hyperscalers by building a vertically integrated GenAI cloud platform. We own the data centres, software, and applications that power today's AI applications using sustainable technology solutions. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. Collaboration is key, and we work together swiftly and respectfully, embracing adaptability and resilience in all we do.
About the Role
We're hiring a Director of Software Engineering (Fleet Management) to lead the team that keeps Nscale's bare-metal GPU fleet running. This is a hands-on role where you will be a core individual contributor as well as a leader of people and technology: you'll write production Python deployed using Helm and Kubernetes, design distributed systems, and steer architecture - while also hiring, mentoring, and driving delivery for a growing engineering team.
Fleet Manager automates the entire operational lifecycle of our compute infrastructure from initial device enrolment through multi-day burn-in testing, to ongoing health monitoring and automated remediation. The problems are challenging and the stakes are high: the software you design and build will determine Nscale’s success in scaling its GPU fleet to meet demand, and put you at the centre of some of the highest-impact work in the company.
What you'll work on
- Large scale business-critical automation that configures BMCs, manages DHCP reservations, drives bare-metal provisioning state machines, runs GPU burn-in tests and remediation workflows.
- Complex workflow orchestration and event-driven state machines that span multiple days, survive crashes, resume from checkpoints, support human-in-the-loop approval gates, and let thousands of concurrent idempotent workflows operate without stepping on each other.
- Multi-site hub and spoke infrastructure tooling that works across geographically distributed data centres with independent trust boundaries.
- Integration and ensuring consistency with data-centre inventory management tooling (DCIM), bare-metal provisioning systems, credential stores and monitoring infrastructure.
- Observability: structured logging, metrics, distributed tracing and tooling that lets operators troubleshoot effectively.
What you'll lead
- A team of highly talented software engineers, from the front, building hardware lifecycle automation.
- The technical roadmap and architecture for how Nscale provisions, validates, monitors, and remediates hardware at massive scale.
- Writing code in critical areas of the codebase, shipping to production regularly, and setting the bar for execution: getting things done.
- Engineering standards: code review, testing, CI/CD, incident response, and on-call practices.
- Tight collaboration with Product, Infrastructure, Platform, SRE, and UI/UX to capture requirements early, align on interfaces, and ship integrations that meet operator needs.
- Hiring and developing engineers who thrive in a high-autonomy, high-accountability environment.
About You
- 10+ years building, owning, and operating complex distributed systems, with at least 2 years leading engineering teams.
- Hands-on experience with workflow orchestration (Temporal, Airflow, Prefect, or similar).
- Bare-metal expertise across compute, networking, and storage: BMC/IPMI/Redfish, PXE boot, DHCP, VLAN management, and provisioning systems like Ironic, MAAS, or equivalent.
- Confidence working at the intersection of software and physical infrastructure - debugging sometimes means asking "is the cable plugged in?"
- You've built systems that had to be fault-tolerant, resumable, and observable (so failures don’t turn into 3am pages).
- You stay effective while context-switching between deep work, judgement calls, and people leadership - writing a workflow activity, reviewing an ADR, and unblocking a team member in the same morning.
- Use of AI as a force multiplier: to speed up specs, scaffolding, tests, refactors, data exploration, incident triage, and docs with modern AI tools.
Ways to stand out
- You've worked with OpenStack Ironic, NetBox, or similar data centre inventory and management platforms.
- You’ve used HPC workload schedulers like SLURM
- You've designed multi-site architectures for infrastructure tooling.
- You've built hardware burn-in, validation, or remediation automation.
- You’ve owned results storage, analysis, and reporting for large-scale computational testing. Experience with HPC simulations or ML training is a plus.
What We Can Offer You
At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.
- Highly competitive package (base + equity) with reviews every 12 months. 🚀
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
Equal Opportunities Statement
At NScale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we encourage applications from candidates of all backgrounds, experiences, and abilities. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.
If there’s anything we can do to accommodate your specific situation, please let us know.
For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.
For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
