Member of Technical Staff, DevOps / Infrastructure Engineering
Location: Anywhere - Remote
Department: First Principles Foundation
About FirstPrinciples:
FirstPrinciples is a non-profit organization building an autonomous AI Physicist to understand the nature of reality: the underlying structure, governing principles, and fundamental laws of our universe. We're developing an intelligent system that can explore theoretical frameworks, reason across disciplines, and generate novel insights to tackle the deepest unsolved problems in physics. By combining AI, symbolic reasoning, and autonomous research capabilities, we're developing a platform that goes beyond analyzing existing knowledge to actively contribute to physics research. Our goal is to accelerate progress on the questions that have captivated humanity for centuries.
We operate as a global nonprofit organization, with a Canadian foundation, a US-based 501(c)(3).
Job Description:
We're seeking a Member of Technical Staff, DevOps / Infrastructure Engineering to architect, automate, and scale the infrastructure that underpins our large-scale model training and research workflows. This role spans both cloud environments (AWS) and HPC infrastructure (Buzz & Lambda HPC GPU clusters with high-speed interconnects), requiring you to design and codify the systems, pipelines, and automation that enable our researchers and engineers to move fast with confidence. This is not a "click in the console" role - you'll bring strong fundamentals in Unix/Linux, experience in CI/CD and infrastructure-as-code, and a systems mindset to build automation and establish the standards that power breakthrough scientific discoveries.. You'll be instrumental in building the reliable, scalable foundation that powers our autonomous AI Physicist while partnering closely with training engineers and researchers.
Key Responsibilities:
Infrastructure Architecture & Automation:
- Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs.
- Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly.
- Automate configuration management and drift detection using tools like Ansible, Salt, or Chef.
- Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations.
CI/CD & Developer Experience:
- Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in.
- Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation.
- Create self-service infrastructure patterns that empower researchers and engineers.
- Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility.
HPC & GPU Cluster Management:
- Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration.
- Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments.
- Optimize cluster scheduling and resource allocation for high-performance GPU workloads.
- Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise.
Monitoring, Observability & Reliability:
- Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and OpenTelemetry.
- Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs.
- Build observability stacks that provide visibility into both system health and job-level performance.
- Proactively detect and resolve infrastructure issues before they impact research workflows.
Security & Compliance:
- Implement and manage secrets management and identity security solutions (Vault, KMS, IAM).
- Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure.
- Design infrastructure with least privilege principles and strong security hygiene from the start.
- Maintain zero-trust security posture and comprehensive auditing capabilities.
Collaboration:
- Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions.
- Document best practices, create runbooks, and evangelize DevOps culture across the organization.
- Mentor teammates on infrastructure patterns, automation techniques, and operational excellence.
- Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration.
Qualifications:
- Educational Background: Bachelor's or Master's degree in Computer Science, Engineering, or related field.
- Experience: 3-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based - if you're a strong intermediate engineer ready to own infrastructure and grow into a senior role, we want to hear from you).
- Strong Unix/Linux systems background including kernel tuning, networking, storage, and process control experience.
- Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation.
- Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.).
- Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure fundamentals.
- Cluster orchestration and job scheduling experience with Kubernetes and Slurm.
- Monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
- Demonstrated success scaling infrastructure for high-performance or GPU workloads.
- Track record of managing GPU-accelerated clusters or HPC infrastructure.
- Experience in automating workflows that reduced toil and scaling deployments safely.
- Skills: Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency.
- Collaboration & Communication: Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences.
- Mindset: Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history.
- Demonstrated passion for physics and for making scientific knowledge accessible and impactful.
Bonus Skills:
- Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave).
- Experience designing self-service infrastructure or internal developer platforms.
- Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand).
- Cost management and optimization experience for large-scale compute infrastructure.
- Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja).
- Experience supporting AI/ML research environments and training pipeline infrastructure.
What Excites Us (Beyond the technical qualifications, we're looking for someone who):
- Thinks automation first - You reflexively reduce toil by codifying repeatable operations rather than clicking through UIs.
- Builds system love - Reproducibility and robust CI/CD excite you, not bore you. You're eager to build a state-of-the-art platform — your own death star — that researchers love using.
- DevOps philosophy - You understand why DevOps exists and live and breathe the philosophy, not just use the tools.
- HPC comfort - You can (or want to learn to) debug Slurm jobs, GPU driver quirks, or InfiniBand hiccups without blinking.
- Cloud + HPC pragmatism - You know (or are eager to learn) when to use AWS primitives versus optimizing HPC schedulers.
- Security from day one - You design infrastructure with least privilege and secrets management from the start, not as an afterthought.
- Collaborative builder - You help mentor and elevate the team, not just build in isolation.
Application Process:
- Interested candidates are invited to submit their resume, a cover letter detailing their qualifications and vision for the role, and references. Please include "Member of Technical Staff, DevOps / Infrastructure Engineering" in the cover letter.
Join us at FirstPrinciples and be a part of a transformative journey where science drives progress and unlocks the potential of humanity.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
