About SandboxAQ
SandboxAQ is a high-growth company delivering AI solutions that address some of the world's greatest challenges. The company’s Large Quantitative Models (LQMs) power advances in life sciences, financial services, navigation, cybersecurity, and other sectors.
We are a global team that is tech-focused and includes experts in AI, chemistry, cybersecurity, physics, mathematics, medicine, engineering, and other specialties. The company emerged from Alphabet Inc. as an independent, growth capital-backed company in 2022, funded by leading investors and supported by a braintrust of industry leaders.
At SandboxAQ, we’ve cultivated an environment that encourages creativity, collaboration, and impact. By investing deeply in our people, we’re building a thriving, global workforce poised to tackle the world's epic challenges. Join us to advance your career in pursuit of an inspiring mission, in a community of like-minded people who value entrepreneurialism, ownership, and transformative impact.
About the Role
As a Senior Staff Site Reliability Engineer at SandboxAQ, you will be responsible for maintaining and improving the reliability, performance, and scalability of our infrastructure and services. You will work closely with engineering teams to ensure that our systems are resilient, highly available, and optimized for performance. Your expertise will guide the development of reliable software, and you will play a key role in shaping the reliability culture within the organization.
What You'll Do
- Incident Management: Lead efforts in incident response, root cause analysis, and postmortem processes, while developing strategies to minimize incidents and reduce recovery times.
- Capacity Planning: Analyze system performance and growth trends, and create capacity plans to ensure systems scale appropriately as demand increases.
- Monitoring & Observability: Design and maintain comprehensive monitoring, logging, and alerting solutions to ensure quick detection and resolution of system anomalies.
- Collaboration with Engineering Teams: Partner with software engineers, product teams, and DevOps to design systems that are both reliable and performant.
- Cost Optimization: Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance.
- Automation & Tools Development: Build and improve automation tools, monitoring systems, and deployment pipelines to streamline operations and increase efficiency.
- Mentorship & Leadership: Mentor junior and mid-level engineers, providing technical leadership and guidance on SRE best practices, incident management, and system design.
- On-Call Rotation: Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems.
About You
- 10+ years of experience in Site Reliability Engineering, DevOps, or similar roles.
- Strong experience with cloud platforms (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
- Proven ability to lead post-incident reviews and drive continuous improvement in system reliability.
- Excellent communication and collaboration skills, with the ability to work across cross-functional teams.
- Expertise in systems administration, networking, and security in a cloud-native environment.
- Deep understanding of monitoring, observability, and logging tools (Prometheus, Grafana, ELK, Datadog, etc.).
- Proficiency in scripting languages (e.g., Python, Go, Bash) and configuration management tools (e.g., Ansible, Chef, Puppet).
- Experience designing and implementing scalable and reliable microservices architectures.
- Strong knowledge of CI/CD pipelines and related tools (CircleCI,Jenkins, GitLab, etc.).
- Location: Are located in East Coast
Nice to Haves
- Experience with large-scale distributed systems and databases (e.g., Kafka, PostgreSQL, Cassandra, MySQL).
- Experience with service mesh (e.g., Istio, Linkerd) and serverless architectures.
- Strong understanding of compliance and security frameworks.
- Familiarity with chaos engineering practices and tools (e.g., Gremlin, Chaos Monkey).
The US base salary range for this full-time position is expected to be $183k-$304k per year. Our salary ranges are determined by role and level. Within the range, individual pay is determined by factors including job-related skills, experience, and relevant education or training. This role may be eligible for annual discretionary bonuses and equity.
SandboxAQ welcomes all.
Other Jobs from SandboxAQ
Staff/Senior Machine Learning Engineer
Staff Security Engineer, Data Protection
Senior/Staff Systems Research Engineer - R&D (EMEA)
Similar Jobs
Senior Cloud Engineer
Senior Cloud Engineer
Senior Solutions Architect (AI/ML)
Senior Solutions Architect (AI/ML)
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say