Senior Network Engineer - Supercomputing
Team: Engineering
Location: Sunnyvale, CA
Commitment: Full-time
Workplace Type: onsite
Salary:
Job Responsibilities
- Design & Optimization: Develop and tune RDMA-based communication systems leveraging NVIDIA GPUs, Mellanox NICs (InfiniBand, RoCE), and low-level networking technologies to support ultra-fast data transfers between nodes.
- Performance Engineering: Implement and optimize GPUDirect RDMA to enable direct memory access between GPUs and network interfaces, minimizing CPU overhead.
- Automation & Monitoring: Build network-aware software and observability tools with extensive metrics coverage, automate configuration management, and ensure robust, secure deployment pipelines through Infrastructure-as-Code (IaC) best practices.
- Integration & Collaboration: Integrate RDMA solutions within Kubernetes-based workloads and containerized environments. Collaborate closely with AI researchers, network engineers, and infrastructure teams to accelerate data pipelines and optimize collective communications using NCCL, MPI, and SHARP.
- Troubleshooting: Quickly investigate, debug, and resolve network-side issues across the full stack—from physical InfiniBand fabrics to high-level orchestration services—ensuring continuous operational excellence.
Tech Stack
- Languages & Tools: Python, Go, Rust, C/C++
- Networking Protocols & Technologies: TCP/IP, BGP, RDMA, InfiniBand, RoCE, SHARP, GPUDirect RDMA
- AI & HPC Communication Frameworks: NCCL, MPI
- Container & Orchestration: Kubernetes
- Cluster Management: Slurm
- Monitoring & Automation: Prometheus, Grafana, Ansible, Terraform
- Hardware: NVIDIA GPUs, Mellanox networking solutions
Professional Experience
- High-Performance Networks: Hands-on experience with NVIDIA RDMA technologies (e.g., GPUDirect RDMA, RoCE, InfiniBand) in HPC or AI supercomputing environments.
- Job Scheduling & Cluster Management: Familiarity with Slurm workload manager and experience troubleshooting and optimizing network performance within Slurm-managed environments.
- Advanced Communication Frameworks: Proven expertise in optimizing distributed systems using NCCL, SHARP, MPI, or similar frameworks tailored for GPU-accelerated workloads.
- Programming & System Optimization: Proficiency in Python, Go, and low-level programming languages such as Rust, C, or C++ to design and optimize networking software.
- Networking Fundamentals: In-depth knowledge of network protocols (TCP/IP, BGP, RDMA) and network architectures, both physical and logical.
- Kubernetes & Containerization: Familiarity with Kubernetes networking and experience integrating RDMA into containerized environments.
- Troubleshooting & Debugging: Strong analytical and debugging skills with a track record of rapidly resolving network-side errors and performance bottlenecks.
- Collaboration & Metrics-Driven Approach: Experience working closely with network engineers and systems architects, using extensive metrics to drive prioritization and improvements.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
