Staff Site Reliability Engineer
Location: Remote
Department: Infrastructure & Security
Location Type: REMOTE
Employment Type: FULL_TIME
About the Role
What You'll Do
- Infrastructure & Kubernetes Orchestration
- Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users.
- Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform.
- Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability.
- AI-Assisted Operations & Automation
- Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks.
- Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe.
- Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems.
- Observability & Incident Management
- Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs.
- Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR).
- Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards.
- Compliance & Collaboration
- Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements.
- Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews.
Why You Might Be a Good Fit
- You are a deeply proficient engineer who excels at the intersection of cloud infrastructure, automation, and system design.
- You possess a meticulous approach to observability and a passion for finding the "root cause" rather than just applying a patch.
- You enjoy exploring the "next frontier" of SRE, including how AI and agentic tools can make operations more efficient.
- You thrive in fast-paced environments where technical rigor is balanced with pragmatism and clinical-grade safety.
This Might Not Be The Right Fit If...
- You prefer working on static infrastructure rather than evolving systems through code and automation.
- You are uncomfortable with the "agile" pace of tech-driven platform development or integrating AI tools into your daily workflow.
- You prefer a siloed role that does not involve active participation in incident response or collaborative postmortems.
Your Qualifications
- 8+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale.
- Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management.
- Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems.
- Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go.
- Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency.
- A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture.
The national pay range for this role is $140,000.00 – $170,000.00 per year. Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. Certain roles may also be eligible for additional compensation, including a comprehensive benefits package such as medical, dental, vision, unlimited PTO, and a 401(k) plan, stock options and bonuses. If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. Expected compensation ranges for this role may change over time.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
