Staff Platform Engineer - High Performance Computing Platform Management
Team: Big Data Platform
Location: Singapore, Singapore
Commitment: Full-time
Workplace Type: onsite
Role
- We are seeking an experienced HPC Staff Engineer to join our team, responsible for managing and optimizing our HPC infrastructure platform. The successful candidate will have a deep understanding of HPC systems, architectures and technologies, as well as experience with managing large-scale computing environments. The role will involve designing, implementing and maintaining the HPC infrastructure platform, ensuring high availability, scalability and performance.
Responsibilities
- Lead a team to deliver resilient, scalable and secure HPC platform, including compute nodes, storage systems, networks and job scheduling systems.
- Lead, design, implement and manage the HPC infrastructure platform to meet organisational needs.
- Design and implement storage solutions for HPC workloads to ensure efficient data storage and retrieval.
- Design and implement high-performance networking solutions, including InfiniBand, Ethernet, and other interconnects.
- Plan and manage HPC resource capacity, including forecasting, procurement and deployment of new hardware and software.
- Manage HPC clusters, including optimizing, monitoring and troubleshooting cluster performance, as well as managing job scheduling and resource allocation.
- Ensure the security and compliance of the HPC infrastructure platform, including managing access controls, implementing security patches, and conducting regular security checks.
- Collaborate with stakeholders like data scientists and developers to optimize application performance on the HPC platform and provide technical support on using the HPC infrastructure platform.
Requirements (Minimum Qualifications)
- Background in Computer Science, Computer Engineering, or a related field.
- 8+ years of experience in managing HPC systems, including experience with Linux, Unix, or other operating systems.
- Strong knowledge of HPC architectures, including clusters, grids, and clouds.
- Experience with HPC job scheduling systems, such as Slurm, Torque and LSF.
- Strong understanding of storage systems, including SANs, NAS, and object storage.
- Experience with high-performance networking, including InfiniBand, Ethernet, and other interconnects.
- Experience with cloud computing platforms, such as AWS, Azure, or Google Cloud.
- Experience with scripting languages, such as Python, Perl, or Bash.
- Experience with containerization (Docker, Kubernetes) and proficient in a range of complementary technologies, including Knative, Run:AI, Grafana, Prometheus, Kyverno, ArgoCD, Rancher, NVIDIA BCM and knowledge of NVIDIA Superpod architecture.
- Experience in leading engineering teams.
Nice to Have
- Certifications in NVIDIA AI Infrastructure and Operations, and Certified Kubernetes Administrator.
- Experience with machine learning or deep learning frameworks, such as TensorFlow or PyTorch.
- Familiarity with agile development methodologies and version control systems, such as Git.
Why join us?
- The work is purposeful and meaningful
- You will work with the best engineers
- We work with modern technologies and tech stacks
- We have excellent engineering culture and work-life balance
- We aspire to engineering and operational excellence
- We empower to innovate
- We grow together as a family
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
