Roles & Responsibilities
- Providing enterprise-level operational support to Managed Services customers for incident, problem, and change management activities
- Design, deploy, and manage Kubernetes clusters optimized for HPC workloads, with a focus on integrating and managing NVIDIA DGX systems.
- Optimize cluster performance, resource utilization, and cost-effectiveness, specifically addressing the unique requirements of DGX systems.
- Implement monitoring, logging, and alerting solutions for HPC Linux clusters, Kubernetes, and DGX infrastructure
- Ensure the security of the Kubernetes infrastructure and HPC workloads, including the protection of sensitive data processed by DGX systems.
- Troubleshoot and resolve issues related to Kubernetes, DGX systems, HPC applications, and infrastructure
- Stay up to date on the latest technologies and trends in Kubernetes, HPC, and NVIDIA DGX systems, including new hardware and software releases
- Work across technical teams to troubleshoot complex infrastructure issues
- Create and maintain detailed documentation
- Serve as a subject matter expert and escalation point for HPC technologies
- Work with vendors to resolve infrastructure issues
- Communicate with customers and internal team with transparency
- Participate in on-call rotation
- Completion of training and certification as assigned to further skills and knowledge
Qualifications
- Bachelor’s degree or equivalent Information Systems or related field. Unique education, specialized experience, skills, knowledge, training, or certification may be substituted for education
- 5+ years of expert level experience managing infrastructure in high-performance computing environments including configuration, troubleshooting, and best practice
- Strong understanding of Kubernetes architecture, components, and networking
- Hands-on experience with deploying, managing, and optimizing NVIDIA DGX systems preferred
- Linux engineer with experience in RedHat, Ubuntu, and Rocky distributions
- Experience with deploying and managing Kubernetes clusters in production environments, including those with GPU acceleration
- Experience with HPC workloads, schedulers (e.g., SLURM, PBS, Torque), and applications, particularly in the context of AI/ML and deep learning
- Experience with containerization technologies (e.g., Docker, Singularity)
- Experience with Infrastructure-as-Code (IaC) tools (e.g., Terraform, Ansible)
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana), experience integrating with Elastic Observability
- Strong scripting skills (e.g., Bash, Python)
- Excellent problem-solving and troubleshooting skills
- Experience configuring, maintaining and troubleshooting Kubernetes
- Experience with storage technology (e.g., Ceph, Vast Data Platform) and distributed file systems (e.g., Lustre, GPFS, NFS, GlusterFS)
- Experience configuring, maintaining and troubleshooting Nvidia/Mellanox (Cumulus OS) switches a plus
- Experience with both ethernet and InfiniBand networking a plus
- 1+ years working with an enterprise ITSM system: Service Now is a bonus
- Managed Services or consulting experience is required
- Strong background with customer service
- High level problem-solving and communication skills
- Strong oral and written communications skills
- Related certifications are a bonus
0 applies
2 views
Other Jobs from AHEAD
Senior M365 Engineer
Senior Engineer - AVD
Principal Technical Consultant Lead, Platform Engineering
Product Engineer I
Similar Jobs
SRE DevOps - Associate - Cybersecurity Engineering
Lead Cloud Engineer - UK 2025
DevSecOps Engineer
Software Engineer 2
Software Engineer in the DevOps field
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say