Azure AI Infrastructure team is looking for passionate engineers to build the largest deep-learning infrastructure service at Microsoft. In this role you will be tasked with building new components to bring the latest innovations in AI Infrastructure onto the Azure AI Platform. You will partner with top engineering talent within Azure AI Infrastructure and across Azure to work on cluster orchestration, job scheduling, storage, networking, containerization and operating system integration. Your work will enable various AI languages and run-times on Azure AI Infrastructure to bring distributed deep learning training and inferencing to life. In addition, you will build infrastructure components required to build, deploy, monitor and service highly available and scalable Microsoft Service Fabric and Kubernetes clusters under your care. You will lead development and customer support from the frontline and establish architecture, service excellence guidelines and a high-quality bar.
Candidates must have a track record for delivering engineering and service excellence on a mid-to-large scale service
Who are We?
We are engineers on Azure AI Infrastructure. We believe that building a planet-scale AI Supercomputer from the ground-up which addresses the fundamental pain-points of data scientists and AI practitioners and takes AI to the unprecedented scale is an opportunity of a lifetime. If you share the same dream as us, come join us!
What Is Azure AI Infrastructure?
High scale AI workloads are always testing the limits of the infrastructure stack. Large-scale model training and inference with huge data volumes of training data on hundreds-thousands of GPUs make it a true engineering challenge. Azure AI Infrastructure is a globally distributed, multi-tenant service that provides robust, cost-effective and competitive AI infrastructure (compute, networking and storage) for AI training and inferencing. By abstracting workloads from underlying infrastructure, Azure AI Infrastructure creates a shared pool of resources that can be dynamically provisioned for full utilization of expensive GPU compute, and enabling data scientists to productively build, scale, experiment, and iterate their models on top of a robust, performant, scalable and cost-effective distributed infrastructure built for AI. In Azure AI Infrastructure, we are constantly seeking to apply the best ideas from AI, ML, distributed systems, distributed databases, machine learning, information retrieval, networking, and security.
- 8+ years of experience with coding in one of C#, C or C++, Rust, go
- Experience working with the Linux operation system and Kubernetes cluster orchestration
- Experience with improving service operations or engineering fundamentals
- Excellent collaboration skills
- A master’s or bachelor’s degree in computer science or a related field
- At least 5 years of experience building and shipping production software or services
#IDCAIPlatformHiring
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
- Deliver a robust container orchestration platform for Azure AI Infrastructure
- Design and build the scheduling sub-system that is responsible for delivering on the SLAs for AI training and inferencing workloads
- Design and build storage and caching system for efficient DNN training and inferencing
- Design and build control plane APIs for creation and management of training jobs and inference model metadata
- Deliver node management, fault detection and node repair as a service to improve job/model reliability
- Deliver world-class monitoring systems and telemetry pipelines to enhance service and job observability for both end-users and operators
- Codify security and compliance requirements by building and strengthening system defenses against malicious attacks and exploits
- Leverage performance and profiling tools to identify hot spots and bottlenecks across hardware and software boundaries: from CPU, GPU, microcode, OS, networking code and drive end-to-end job performance
0 applies
3 views
Other Jobs from Microsoft
Research Intern - Economics and Computation
Principal Software Engineer (Infra)
Software Engineer
Software Engineer 2
Lead Server Programmer - World's Edge
Similar Jobs
System Development Engineer I
Software Development Engineer, NSV
System Development Engineer I, L4, ICON
Security Engineer, AWS Security Vulnerability Management
System Development Engineer, Payment Services
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 401 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say