NVIDIA is looking for a hardworking Senior Compute Cluster Deployment Engineer to join our Professional Services team.
You'll join a small team working around the globe to build some of the most cutting-edge Datacenters in the world. This role will focus on working to deploy server and compute clusters built with brand new GPU platforms responsible for AI and Machine Learning. You'll be working with some of the world's largest and most sophisticated customers and supercomputers. You'll work alongside our Infiniband and Ethernet network engineers to deploy a complete solution for customers looking to adopt NVIDIA solutions into their business.
Opportunities for global travel and learning about the newest GPU-related technologies are plentiful as we seek to build, shape and expand this new aspect of our business.
What you will be doing:
Primary responsibilities will include managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
Support operational and reliability aspects of large scale AI clusters with focus on performance at scale, real time monitoring, logging and alerting
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
Be part of an on call rotation to support production systems
What we need to see:
5+ years providing in-depth support and deployment services, solving problems for hardware and software products.
Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network-routing/advanced networking (tuning and monitoring).
Cluster management technologies, EX: Bright Cluster Manager
Scripting proficiency.
Good social skills with the ability to maintain and deliver resolutions for customer blocking issues as they arise.
Superb communication and presentation/oral skills.
Excellent verbal and written English skills.
Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
Candidates should have a minimum of a four-year degree from an accredited university or college in Computer Science, or Electrical or Computer Engineering.
Industry-standard Linux certifications.
Ways to stand out of a crowd:
InfiniBand experience.
Experience with GPU focused hardware/software.
Experience with MPI.
Automation tooling background (Ansible, Salt, Puppet etc.).
Ethernet and Storage technologies.
Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/.
Jobs from our Partners
ETS Engineer II – Platform Engineering, Virtual Server Engineering (VSE)
Other Jobs from NVIDIA
Senior Software Engineer – Simulation and Virtualization
Senior System Software Engineer
Embedded Memory Qualification Software Engineer
Senior Technical Program Manager - Deep Learning Compute Server Software
Senior Technical Program Manager - Datacenter Compute Server Software
Similar Jobs
ETS Engineer II – Platform Engineering, Virtual Server Engineering (VSE)
Site Reliability Engineer (SRE) - Data
Site Reliability Engineer (SRE) - Data
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 320 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
Cancel anytime / Money-back guarantee