Senior Site Reliability Engineer
Department: Engineering
Location: San Francisco, CA
Employment Type: FullTime
Who We Are
Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection of AI and open-source technology, we believe in an open future where AI innovation is limited only by imagination, not by access to resources. We're looking for forward-thinking individuals who share our passion for making AI universally accessible, secure, and affordable. Join us in building a platform that empowers innovators everywhere to turn their visionary AI projects into reality.
As we prepare for growth after our Series A, our team โ led by co-founders with PhDs in AI, Math, and Computer Science โ is poised to redefine computing.
About the Role
We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product-critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7.
In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post-mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high-impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale.
Who You Are
Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems
Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems
Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience
Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)
Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening
Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation
Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)
Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure
Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines
Preferred Qualifications
Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
Background in distributed systems, peer-to-peer networks, or decentralized infrastructure
Knowledge of multi-tenancy security patterns, container security, and runtime security tools
Experience with chaos engineering, fault injection, and resilience testing
Familiarity with cost optimization strategies for cloud infrastructure and GPU resources
Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs)
Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups
Contributions to open-source reliability, observability, or security tools
Hyperbolic is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
๐ฅณ๐ฅณ๐ฅณ 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineersโฆ in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. ๐ ๏ธ
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. ๐
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. ๐ฏ
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. ๐
What Fellow Engineers Say
