Microsoft

Principal High Performance Computing (HPC) / Artificial Intelligence (AI) Load Planning Engineer

US
USD 137k - 294k
C# Java JavaScript Python Machine Learning Azure C++
Search for More Jobs Talk to a recruiter now 💪
Description

Microsoft Azure Artificial Intelligence production team is looking for a Principal High Performance Computing (HPC) / Artificial Intelligence (AI) Load Planning Engineer to drive the design, validation, and orchestration of multi-megawatt scale solutions needed to manage the power draw of high throughput Graphics Processing Unit (GPU)-enabled AI training clusters. Azure is building world’s largest supercomputers to cater to the massive computational demands of AI workloads, evident from the various HPC virtual machines such as ND H100 v5 that have already made the mark on Top500, MLPerf and Graph500 rankings and robust solutions to stabilize the power draw of these large clusters is needed to safely operate them.

 

As a Principal High Performance Computing (HPC) / Artificial Intelligence (AI) Load Planning Engineer,  you would  provide the best practices driving architectural changes. You will also  influence the roadmap of relevant software and hardware components. Your work will directly impact on the business goals of a wide range of users and facilitate the next wave of growth and innovation in AI, and HPC in the cloud in general.


At supercomputing scale, novel tools and techniques are needed to maintain the reliability, runtime performance, health of the system and running jobs continuing to meet the expectations of users. The responsibilities of this position would be to use state-of-the-art methods, design, build and validate novel tools, find operational gaps and instrument features to achieve the smooth operation of cloud-native supercomputers. 

 

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Required Qualifications:

  • Bachelor's Degree in Computer Engineering, Electrical Engineering, or related technical field AND 6+ years technical engineering experience in software design and developement, with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
    • OR equivalent experience
  • 3+ years of experience in Power Architecture
  • 3+ years of experience in running and analyzing HPC or AI applications on clusters
  • 3+ years familiarity with HPC environments and systems

Other Requirements:

 

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: 
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

 

Preferred Qualifications:

  • Master's or PhD in Computer Science, Electrical Engineering, or related areas
  • Exposure to operational challenges of running HPC systems (availability, fault tolerance) and mitigation mechanisms
  • Previous experience with running and troubleshooting machine learning workloads on GPU clusters
  • Exposure to Cloud Computing, Virtualization and Container Technologies

 

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.
  
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay    
  
Microsoft will accept applications for the role until July 10, 2024. 

 

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances.  We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

 

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

 

#azurecorejobs

You will join a team of engineers and researchers with experience in high performance computing infrastructure, acutely familiar with the behavior of bulk synchronous loads in large scale systems, middleware, and software. The following values drive us:

  • Drive for Results: We’re here to build great products. We take on whatever work is right for the product and strive for the best possible results.
  • Modesty and Adaptability: The right answer is more important than being right. We search for solutions as a team, adapt quickly and value transparent and open feedback.

Your mission will be to help ensure the Azure platform is consistent on power, performance, can scale on-demand, and engineered to withstand unparalleled computing demand from the customer workloads. You will help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality. In addition to the below responsibilites: 

 

  • Manages, oversees, provides guidance to, and reviews the work of individual contributors and people managers to accomplish operational plans and results.
  • Provides oversight and support to the Cloud Operations and Innovation group in developing and implementing programs; reports on policy issues regarding current and long-range planning, advising the Azure HPC leadership and recommending solutions.
  • Develops, coordinates, communicates, and implements procedures for reviewing all solutions to drive power design discussions within Azure AI+HPC.
  • Presents to council, boards and leadership forums, and customers and represents the Azure HPC at meetings and events; attends evening meetings and events based on organizational responsibilities and/or requirements.
  • Evaluates policies/ideas for necessary updates, changes, and additions; develops and recommends options and implementation plans of power stabilization features.
  • Develops best practices to operate and monitor supercomputers running complex workloads.
  • Identifies, tracks, and assesses features to manage power draw or manage power swings in GPU hardware, rack-level instruments or datacenters; compiles and submits data, analyses, and reports.
  • Coordinates with department and leadership to create and implement the annual work program, including assignments to staff and participation in software development and review process.
  • Ensures resolution of problems and controversial or difficult technical issues by working with other employees, departments, architects, datacenter teams, software developers, and product/program managers.
  • Assigns, manages, or conducts special studies pertaining to planning and zoning.
  • Establishes and maintains effective working relationships with those interacted with during work regardless of race, color, religious creed, national origin, ancestry, sex, sexual orientation, gender identity, age, genetic information, disability, political affiliation, military service, or diverse cultural and linguistic backgrounds.
  • Reviews and evaluates work methods and procedures and meets with management staff to identify and resolve problems.
  • Assesses and monitors workload; identifies opportunities for improvement and implements changes.
  • Selects, trains, motivates, and evaluates employees; provides or coordinates staff training; works with employees to correct deficiencies; implements discipline procedures per established policies, procedures, and executive guidance.
  • Oversees and participates in the development and administration of the departmental budget; approves the forecast of funds needed for staffing, equipment, materials, and supplies; approves expenditures and implements budgetary adjustments as appropriate and necessary.
  • Embody our culture and values
Microsoft
Microsoft
Data Management Developer Tools DevOps Enterprise Software Operating Systems

0 applies

11 views

Other Jobs from Microsoft

Software Engineer

Hyderabad, India

Software Engineer 2

Bengaluru, India

Devops Engineer II

Hyderabad, India

Senior Software Engineer

Bucharest, Romania

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 307 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

Cancel anytime / Money-back guarantee

Wall of love from fellow engineers