Description

InstaDeep, founded in 2014, is a pioneering AI company at the forefront of innovation. With strategic offices in major cities worldwide, including London, Paris, Berlin, Tunis, Kigali, Cape Town, Boston, and San Francisco, InstaDeep collaborates with giants like Google DeepMind and prestigious educational institutions like MIT, Stanford, Oxford, UCL, and Imperial College London. We are a Google Cloud Partner and a select NVIDIA Elite Service Delivery Partner. We have been listed among notable players in AI, fast-growing companies, and Europe's 1000 fastest-growing companies in 2022 by Statista and the Financial Times. Our recent acquisition by BioNTech has further solidified our commitment to leading the industry.

Join us to be a part of the AI revolution!

Are you an aspiring ML Ops intern eager to work on cutting-edge AI-powered scheduling systems? Do you want to apply your knowledge of Kubernetes, reinforcement learning, and resource management in a high-impact project?

If so, keep reading—this internship might be the perfect fit for you.

Topic:

AI-Powered HPC Co-Scheduler for Kubernetes with Reinforcement Learning (RL)

About the Project:

High-performance computing (HPC) systems are pivotal for large-scale scientific computing and data analytics. Managing these resources efficiently is critical for maximising system utilization and minimizing job completion times. Traditional schedulers often struggle to adapt to the dynamic resource demands of HPC workloads, especially with AI training jobs that require precise resource allocation.

This internship provides a unique opportunity to work on implementing cutting-edge scheduling strategies using Reinforcement Learning (RL) within a Kubernetes environment. Drawing inspiration from the ASAX architecture, you will help develop a Kubernetes-native solution aimed at optimising resource allocation for complex workloads involving AI training, balancing CPU, GPU, memory, and other critical resources in real time.

Key Responsibilities:

Research & Development: Study and apply concepts from the ASAX HPC co-scheduler model, focusing on multi-objective resource metrics like CPU, memory, network I/O, and GPU, essential for AI training and scientific workflows.

Algorithm Design: Design and implement expert systems to manage co-scheduling within Kubernetes. You’ll analyze real-time cluster states and application performance metrics to optimize container placement across multiple objectives.

Reinforcement Learning (RL) Integration: Implement RL algorithms to learn optimal policies for co-scheduling jobs, particularly AI training workloads, improving efficiency and resource utilization.

Model Training & Serving:

Training: Develop and refine an RL model based on Kubernetes resource metrics, creating a continuous training pipeline that adapts over time.

Serving: Deploy the trained model in Kubernetes, enabling real-time scheduling decisions that dynamically adjust resource allocations for AI workloads.

Kubernetes Integration: Seamlessly integrate the RL-powered scheduler into Kubernetes, working with components like the Kube-scheduler for precise control over resource allocation.

Simulation & Testing: Evaluate the scheduler in simulated Kubernetes environments, comparing its performance to traditional scheduling methods and measuring gains in efficiency.

Learning Outcomes:

Apply RL techniques to solve real-world resource scheduling challenges.

Gain in-depth experience with Kubernetes, container orchestration, and resource management for AI and HPC jobs.

Develop and serve machine learning models in production environments for dynamic, real-time decision-making.

Collaborate with experts in AI, HPC, and cloud-native systems, building a strong professional network.

Who you are

Enrolled in or recently completed a degree in Computer Science, AI, or a related field.
Hands-on experience with Kubernetes, container orchestration, and cloud-native infrastructure.
Knowledge of RL algorithms, decision trees, and multi-objective optimization.
Familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch) and model-serving tools (e.g., TensorFlow Serving, TorchServe).
Strong programming skills in Python; familiarity with Go is a plus.
Knowledge of RL applications in scheduling and resource management for AI workloads is an advantage.

How to Apply:

Submit your resume, a brief cover letter, and a link to your GitHub profile. In your cover letter, explain your interest in AI-powered Kubernetes scheduling and highlight any relevant experience.
Deadline 6th of November.

Our commitment to our people

We empower individuals to celebrate their uniqueness here at InstaDeep. Our team comes from all walks of life, and we’re proud to continue encouraging and supporting applicants from underrepresented groups across the globe. Our commitment to creating an authentic environment comes from our ability to learn and grow from our diversity, and how better to experience this than by joining our team? We operate on a hybrid work model with guidance to work at the office at least 2 to 3 days per week to encourage close collaboration and innovation. We are continuing to review the situation with the well-being of InstaDeepers at the forefront of our minds.

Right to work: Please note that you will require the legal right to work in the location you are applying for.

InstaDeep

Artificial Intelligence (AI) Information Technology

0 applies

3 views

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 401 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say

InstaDeep

Dev Ops / ML Ops Intern

Ugh.. sorry 😔 This job is closed.

Check out similar jobs below 😊

Who you are

How to Apply:

Other Jobs from InstaDeep

Software Engineer intern

AI Research Intern

MLOps - DevOps Engineer

BioAI Machine Learning Intern (Optimisation)

Research Scientist / Engineer, AI for Biology

Similar Jobs

Senior Software Engineer, Back End - GenAI (Enterprise Platforms Technology)

Principal Software Architect - AI Technology

Full-Stack Engineer (Go/Python)

Senior Solutions Architect (AI/ML) (RapidScale)

Principal Software Engineer