Description

AHEAD builds platforms for digital business. By weaving together advances in cloud infrastructure, automation and analytics, and software delivery, we help enterprises deliver on the promise of digital transformation.

At AHEAD, we prioritize creating a culture of belonging, where all perspectives and voices are represented, valued, respected, and heard. We create spaces to empower everyone to speak up, make change, and drive the culture at AHEAD.

We are an equal opportunity employer, and do not discriminate based on an individual's race, national origin, color, gender, gender identity, gender expression, sexual orientation, religion, age, disability, marital status, or any other protected characteristic under applicable law, whether actual or perceived.

We embrace all candidates that will contribute to the diversification and enrichment of ideas and perspectives at AHEAD.

The High-Performance Computing Infrastructure Engineer is primarily responsible for the overall health and maintenance of HPC infrastructure in our managed services customer's environments. Our HPC Infrastructure Engineers are a valued member of the Managed Services Infrastructure Practice responsible for Tier 3 incident management, service request management and change management infrastructure support for all Managed Services customers.

Roles & Responsibilities

Providing enterprise-level operational support to Managed Services customers for incident, problem, and change management activities
Design, deploy, and manage Kubernetes clusters optimized for HPC workloads, with a focus on integrating and managing NVIDIA DGX systems.
Optimize cluster performance, resource utilization, and cost-effectiveness, specifically addressing the unique requirements of DGX systems.
Implement monitoring, logging, and alerting solutions for HPC Linux clusters, Kubernetes, and DGX infrastructure
Ensure the security of the Kubernetes infrastructure and HPC workloads, including the protection of sensitive data processed by DGX systems.
Troubleshoot and resolve issues related to Kubernetes, DGX systems, HPC applications, and infrastructure
Stay up to date on the latest technologies and trends in Kubernetes, HPC, and NVIDIA DGX systems, including new hardware and software releases
Work across technical teams to troubleshoot complex infrastructure issues
Create and maintain detailed documentation
Serve as a subject matter expert and escalation point for HPC technologies
Work with vendors to resolve infrastructure issues
Communicate with customers and internal team with transparency
Participate in on-call rotation
Completion of training and certification as assigned to further skills and knowledge

Qualifications

Bachelor’s degree or equivalent Information Systems or related field. Unique education, specialized experience, skills, knowledge, training, or certification may be substituted for education
5+ years of expert level experience managing infrastructure in high-performance computing environments including configuration, troubleshooting, and best practice
Strong understanding of Kubernetes architecture, components, and networking
Hands-on experience with deploying, managing, and optimizing NVIDIA DGX systems preferred
Linux engineer with experience in RedHat, Ubuntu, and Rocky distributions
Experience with deploying and managing Kubernetes clusters in production environments, including those with GPU acceleration
Experience with HPC workloads, schedulers (e.g., SLURM, PBS, Torque), and applications, particularly in the context of AI/ML and deep learning
Experience with containerization technologies (e.g., Docker, Singularity)
Experience with Infrastructure-as-Code (IaC) tools (e.g., Terraform, Ansible)
Experience with monitoring and logging tools (e.g., Prometheus, Grafana), experience integrating with Elastic Observability
Strong scripting skills (e.g., Bash, Python)
Excellent problem-solving and troubleshooting skills
Experience configuring, maintaining and troubleshooting Kubernetes
Experience with storage technology (e.g., Ceph, Vast Data Platform) and distributed file systems (e.g., Lustre, GPFS, NFS, GlusterFS)
Experience configuring, maintaining and troubleshooting Nvidia/Mellanox (Cumulus OS) switches a plus
Experience with both ethernet and InfiniBand networking a plus
1+ years working with an enterprise ITSM system: Service Now is a bonus
Managed Services or consulting experience is required
Strong background with customer service
High level problem-solving and communication skills
Strong oral and written communications skills
Related certifications are a bonus

Why AHEAD:

Through our daily work and internal groups like Moving Women AHEAD and RISE AHEAD, we value and benefit from diversity of people, ideas, experience, and everything in between.

We fuel growth by stacking our office with top-notch technologies in a multi-million-dollar lab, by encouraging cross department training and development, sponsoring certifications and credentials for continued learning.

USA Employment Benefits include:

- Medical, Dental, and Vision Insurance

- 401(k)

- Paid company holidays

- Paid time off

- Paid parental and caregiver leave

- Plus more! See benefits https://www.aheadbenefits.com/ for additional details.

The compensation range indicated in this posting reflects the On-Target Earnings (“OTE”) for this role, which includes a base salary and any applicable target bonus amount. This OTE range may vary based on the candidate’s relevant experience, qualifications, and geographic location.

AHEAD

Cloud Computing Information Technology Software Staffing Agency Virtualization

0 applies

2 views

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say

AHEAD

HPC Infrastructure Engineer

Roles & Responsibilities

Qualifications

Other Jobs from AHEAD

Associate Network Engineer Field Technician

Senior M365 Engineer

Senior Engineer - AVD

Principal Technical Consultant Lead, Platform Engineering

Product Engineer I

Similar Jobs

SRE DevOps - Associate - Cybersecurity Engineering

Lead Cloud Engineer - UK 2025

DevSecOps Engineer

Software Engineer 2

Software Engineer in the DevOps field