Why do you charge job seekers to use EchoJobs?

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.

How many software engineering jobs are on EchoJobs?

We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!

So, where do the jobs come from?

We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.

What makes EchoJobs different?

We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️

How often are new jobs added?

Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀

How fast can I find a job?

Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯

How often should I check EchoJobs?

Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

Description

Principal Deployment Engineer

Location: Seattle; US

Department: AI Infrastructure

Principal Deployment Engineer – GPU Supercluster Bringup

About Us

We are building AI infrastructure for frontier-scale workloads. Our platform is designed for high-density, high-performance GPU clusters that push the limits of power, networking, and distributed compute.

As a startup, we move fast, operate with ownership, and expect technical leaders to define standards—not just follow them.

The Role

We are hiring a Principal Deployment Engineer to architect and lead the bringup of large-scale GPU clusters (hundreds to thousands of GPUs). This is a technical leadership role responsible for defining how we deploy, validate, and scale AI superclusters across sites.

You will own the full lifecycle of deployment—from rack design and fabric architecture to cluster validation frameworks and production readiness standards. You will set the bar for performance, reliability, and operational excellence.

This role combines deep hands-on expertise with system-level thinking and cross-functional leadership.

What You’ll Do

End-to-End Supercluster Bringup Ownership

Define the technical standards for node, rack, and full-cluster bringup.
Lead large-scale GPU cluster deployments (multi-rack, multi-pod environments).
Architect high-performance network fabrics (IB, RoCE, Ethernet) optimized for AI workloads.
Establish cluster-level acceptance criteria and validation frameworks.

Performance & Fabric Architecture

Tune and validate NCCL, RDMA, GPUDirect, and collective operations at scale.
Identify and eliminate performance bottlenecks across hardware, topology, and firmware layers.
Drive congestion control and fabric optimization strategies.
Define performance benchmarking methodology for AI training workloads.

Deployment Strategy & Scalability

Design repeatable deployment models for multi-site expansion.
Build automation frameworks for provisioning and cluster validation.
Establish deployment SLAs, quality gates, and operational readiness standards.
Reduce time-to-capacity while increasing reliability.

Technical Leadership

Serve as the escalation point for complex bringup and performance issues.
Mentor senior engineers and shape infrastructure best practices.
Influence hardware selection, rack topology, and data center design decisions.
Partner with executive leadership on infrastructure scaling strategy.

What We’re Looking For

Required

10+ years of experience in large-scale infrastructure or HPC environments.
Proven experience bringing up large GPU clusters (hundreds+ GPUs).
Deep expertise in high-speed networking (InfiniBand, RoCE, Ethernet fabrics).
Strong understanding of server architecture (PCIe, NUMA, memory hierarchy).
Experience debugging performance issues across compute and network layers.
Strong automation and systems-level thinking.

Strongly Preferred

Experience scaling AI training clusters for frontier models.
Experience with liquid cooling or ultra-high-density deployments.
Knowledge of distributed storage systems (Lustre, Ceph, NVMe-oF).
Experience defining infrastructure standards in a fast-growing organization.

What Success Looks Like

Superclusters are brought online quickly, predictably, and at peak performance.
Deployment processes scale from first cluster to multi-site expansion.
Infrastructure becomes a competitive advantage.
You define the technical blueprint for how we scale AI infrastructure.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

Nscale

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say