High Performance Computing (HPC) Engineer
Team: Engineering
Location: Palo Alto, CA
Commitment: Full Time
Workplace Type: onsite
Job Description
- GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
- Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
- Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
- Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into GenBio AI’s model development and deployment frameworks.
- Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements.
- Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
- Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.
Job Requirements:
- Master’s or Ph.D. degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
- 2+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
- Strong expertise in distributed deep learning and parallel training techniques.
- Proficiency in popular deep learning frameworks like PyTorch, Megatron-LM, DeepSpeed, etc.
- Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
- Knowledge of performance profiling and optimization tools for HPC and deep learning.
- Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes)
- Strong background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes)
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
