THE TEAM: AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people. THE ROLE: We are seeking a highly motivated and skilled GPU Cluster System/Network Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies. The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development. THE PERSON: The Cluster System/Network Engineer plays a critical role in shaping the future of AI/ML training and inferencing systems as they move into the Ethernet era. This individual will collaborate with a broad range of internal and external partners, including NIC, Switch, and Software Enablement teams, to integrate state-of-the-art technology solutions that pave the way for ethernet to be used as a viable network technology for the GPU-to-GPU communication required during AI inferencing and training. KEY RESPONSIBILITIES: Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments Scalability Testing: Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE) Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance PREFERRED EXPERIENCE: Proven experience in optimizing the performance of GPU clusters Strong understanding of GPU architectures, parallel computing concepts, and network protocols Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis Experience with system level performance analysis tools and methodologies for GPU clusters Analytical mindset with excellent problem-solving and debug skills Familiarity with cluster management tools and systems Excellent communication and collaboration skills for effective teamwork RDMA network configuration, troubleshooting and performance tuning Linux kernel networking expertise Machine learning and/or HPC system design ACADEMIC CREDENTIALS: Bachelors or Master’s degree in computer science or equivalent experience #LI-RW1 #LI-HYBRID
At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
Tags: No, USD $192,570.00/Yr., USD $275,100.00/Yr., US Careers (External)
Other Jobs from AMD
Sr. Manager Physical Design (Silicon Design Engineering)
Software Development Engineer 2
Similar Jobs
Director Lead Solution Architecture
Senior Machine Learning Engineer, GFT
Senior Data Engineer, GFT
Senior Director, CAE Product Management, Innovation & Delivery
Data Engineer - Central Machine Learning
Machine Learning Engineer II - Promotions Optimization
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say