Site Reliability Engineer - AI Infrastructure
Department: Engineering
Location: Global Remote / San Francisco, CA
Employment Type: FullTime
Site Reliability Engineer - AI Infrastructure
Location: Global Remote / San Francisco · Full-Time
About Andromeda
Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.
We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.
Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.
Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.
We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
What You’ll Do
Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
Build automation and tooling to streamline cluster deployments and integrations.
Debug customer issues across networking, storage, scheduling, and system layers.
Improve reliability and scalability of both training and inference infrastructure.
Design and implement monitoring, alerting, and observability for critical systems.
Collaborate with engineering and product teams to plan and deliver infrastructure for new services.
Participate in on-call and incident response, leading postmortems and reliability improvements.
What We’re Looking For
5+ years experience in SRE, DevOps, or infrastructure engineering roles.
Strong Linux systems and networking fundamentals.
Deep experience with Kubernetes and container orchestration at scale.
Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).
Strong automation and scripting skills (Python, Go, or Bash).
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).
Track record of operating production systems and leading incident response.
Nice to Have
Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.).
Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).
Customer-facing support or consulting experience.
Why You’ll Love It Here
This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
