NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.
The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment. You will collaborate with a diverse and experienced team, constantly improving infrastructure provisioning and resiliency to ensure a high level of service availability.
What you will be doing:
Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.
Continuously improve infrastructure provisioning, management, and monitoring through automation.
Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem.
Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.
Participate in the team's on-call rotation to support critical infrastructure.
Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance.
What we need to see:
Minimum BS degree in Computer Science (or equivalent experience), with 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.
Expertise in designing, deploying, and running production-level cloud services.
Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).
Strong proficiency with Linux operating systems and TCP/IP fundamentals.
Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
Diligent with strong communication and documentation skills.
Ways to stand out from the crowd:
Experience managing large-scale Slurm and/or BCM deployments in production environments.
Expertise in modern container networking and storage architectures.
Proven track record to define and drive operational excellence in highly distributed, high-performance environments.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
Other Jobs from NVIDIA
Senior DGX Cloud Software Engineer- Infrastructure Automation and Distributed Systems
System and Software Networking Architect, HPC
Senior Mechanical Engineer
Similar Jobs
Senior DevOps Engineer - GPU Clusters
Lead Software Engineer, DevOps (Remote-Eligible)
Senior Manager, Software Engineering, DevOps
Lead Software Engineer, DevOps
Lead Software Engineer, DevOps
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
π₯³π₯³π₯³ 401 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineersβ¦ in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. π οΈ
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. π
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. π―
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. π
What Fellow Engineers Say