Description

Job Responsibilities

Infrastructure Development: Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable solutions.
AI/ML Solutions: Develop advanced AI/ML infrastructure solutions to enhance the efficiency of our ML teams.
System Design: Design and implement solutions for distributed storage systems, scheduling systems, high availability, and core reliability issues within large-scale GPU clusters.
Performance Optimization: Monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization.
Automation Tools: Develop and deploy automation tools, monitoring solutions, and operational strategies to streamline infrastructure management and reduce manual tasks.
Collaboration: Work with various teams, including ML developers, data engineers, and DevOps professionals, to create a cohesive and integrated AI/ML infrastructure ecosystem.
Parallel Training: Optimize large-scale parallel training for state-of-the-art deep learning algorithms, including large language models, multi-modality models, diffusion, and reinforcement learning.
Research & Development: Research and develop our machine learning systems, including accelerated computing architecture, management, and monitoring.
Deployment: Deploy machine learning systems for distributed training and inference.
Cross-layer Optimization: Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC).

Minimum Qualifications

Bachelor's degree in Computer Science, Engineering, or a related technical field.
5-8+ years of experience in software engineering, with a strong background in developing and managing large-scale distributed systems, ideally within the AI/ML infrastructure domain.
Proficiency in programming languages such as Python, Go, or C++, with knowledge of cloud computing platforms like AWS, Azure, etc.
Familiarity with machine learning algorithms, platforms, and frameworks such as PyTorch and Jax. Basic understanding of GPU and/or ASIC functionality.
Expertise in at least one or two programming languages in a Linux environment: C/C++, CUDA, Python.
Familiar with open-source distributed scheduling/orchestration/storage frameworks, such as Kubernetes (K8S), Yarn (Flink, MapReduce), HDFS, Redis, S3, etc., with practical experience in machine learning system development.
Mastery of distributed systems principles and participation in the design, development, and maintenance of large-scale distributed systems.
Strong communication and collaboration abilities, effective in working with diverse teams and individuals.

Preferred Qualifications

In-depth understanding of AI/ML workflows, including model training, data processing, and inference pipelines.
Practical experience with containerization technologies (Docker, Kubernetes), automation tools, and monitoring solutions (Prometheus, Grafana).
Exceptional problem-solving skills, capable of analyzing complex systems, identifying bottlenecks, and implementing scalable solutions.
A passion for continuous learning and staying abreast of new technologies and best practices in the AI/ML infrastructure space.
Experience with GPU-based high-performance computing, RDMA high-performance networks (MPI, NCCL, ibverbs).
Familiarity with distributed training framework optimizations (e.g., DeepSpeed, FSDP, Megatron, GSPMD).
Knowledge of AI compiler stacks (torch FX, XLA, MLIR).
Experience with large-scale data processing and parallel computing.
In-depth CUDA programming and performance tuning experience (cutlass, triton).

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy

Together AI

Artificial Intelligence (AI) Generative AI Internet IT Infrastructure Open Source

0 applies

42 views

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 389 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
Salaries for the engineering jobs on our site range from $100K-$200K. On average, senior engineer positions on our EchoJobs are about $160K.
The EchoJobs positions have been sourced and vetted from the top companies to work for in the US as a software engineer, including LinkedIn and other reputable job sites. We also have syndicated jobs from companies that have just raised funding, as well as those that have great unique products and culture. From all of these sources, our founder, Morgan, has also resourced the company's authenticity in terms of their website, public appearance, and more.
Yes, our users asked us for just this, so now our search filters allow you to search for your top jobs via location, as well as by onsite, remote, or both. Approximately 30% of our jobs are remote, so you’ve got the best options for you!
We have not yet implemented this option, but are considering doing so in the future. For the moment, you would need to cancel your subscription, and resubscribe when you wanted to come back.
We add new jobs to EchoJobs every day! We scan our sources for the newest jobs, verify them, and post them to EchoJobs within minutes. We add about 2,000-3,000 new jobs for you each day!
From starting your job search to getting hired, the entire job search process can take us software engineers anywhere between 3-6 months. However, at EchoJobs, we’re striving to shorten this duration by finding the best, newest jobs for you, so you can do less job searching, and more applying.
We’d recommend checking EchoJobs daily, as we add new jobs to the site each day. Additionally, if you got a chance to read our previous email on “what makes EchoJobs different from any other job search tools,” we also recommended that you set a job alert based on your job filters, so if you get emails on those new jobs, you could be checking more than once per day.
If you decide to continue with us after the 1-month trial, we definitely recommend this, as we all know it usually takes 3-6 months to find a quality job as a software engineer these days. So to best support you, we just adjusted our membership options at EchoJobs to monthly, 3 months, or 12 months (this option is more for passive job seekers looking a little bit for the future if they want to come back to work or make a job switch potentially. This lets you see what’s out there in case an even better fit job becomes available.)
EchoJobs is truly the only job site of its kind. We want to be THE spot for you to find the best job for you, and haven’t encountered any other company doing this. Other job sites are in niches besides software engineering or focus on a small portion of engineering jobs (like a specific coding language). In the words of Morgan, our founder, “I think what makes EchoJobs different is the amount of jobs, frequency that we add new jobs (we add 2,000-3,000 new jobs daily!), and the powerful search engines to find exactly the job you want more easily and efficiently. We can provide you with the most jobs that are vetted by us, we’ll continually find more new jobs for you, and we make it easier for you to apply and get hired.

What Fellow Engineers Say

Together AI

Software Engineer, LLM Training Frameworks Engineer

Ugh.. sorry 😔 This job is closed.

Check out similar jobs below 😊

Job Responsibilities

Minimum Qualifications

Preferred Qualifications

Other Jobs from Together AI

Project Manager, Hardware & Business Operations

Systems Research Engineer, Machine Learning Systems

Systems Research Engineer, GPU Programming

Software Engineer, LLM Inference Frameworks Engineer

Software Engineer