Shein

Senior Site Reliability Engineer

San Diego, CA
USD 92k - 149k
Linux Kubernetes Kafka Elasticsearch Redis Consul Etcd Zookeeper Python Go Prometheus Grafana Git Ansible Hadoop Spark APISIX Nginx
Description

Senior Site Reliability Engineer

Location: San Diego

Department: Site Reliability Engineering

About SHEIN 

SHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore, with more than 15,000 employees operating from offices around the world, SHEIN is committed to making the beauty of fashion accessible to all, promoting its industry-leading, on-demand production methodology, for a smarter, future-ready industry. 

Position Summary 

We are seeking a Senior Site Reliability Engineer (Official Title: Senior Site Reliability Engineer I) with deep experience operating and evolving large-scale, mission-critical systems where availability and reliability are non-negotiable. At SHEIN, Site Reliability Engineers are hybrid software and systems engineers responsible for keeping production services always on while enabling the platform to scale rapidly and safely. In this role, you will own and support complex services and infrastructure, ensuring they consistently meet reliability and performance expectations. The SRE team owns and maintains critical open-source and in-house technologies that underpin the platform and serves as a core contributor to major engineering initiatives. We are accountable for driving platform operability forward by reducing incident frequency, minimizing MTTR, and improving system resilience, efficiency, and resource utilization. You will work closely with global, cross-functional teams to design, build, and evolve observability and operational tooling—including metrics, logs, traces, alerting, and automation—providing deep visibility into system behavior. Through hands-on engineering and operational excellence, you will proactively identify risks and failure modes, help prevent incidents before they occur, and lead fast, effective responses when they do. To succeed in this role, you will combine strong software engineering skills, solid to deep expertise in Linux, networking, and distributed systems, and a passion for solving problems of scale, complexity, and reliability. Your work will directly contribute to delivering a stable, scalable, and high-performing experience for customers worldwide.

 Job Responsibilities 

  • Keep SHEIN’s mission-critical production systems running 24/7/365, participating in on-call rotations and acting decisively during incidents. 
  • Triage and resolve production incidents, leveraging AI-assisted log analysis and anomaly detection to accelerate root cause identification; drive continuous improvements that reduce MTTR and prevent recurrence.
  • Monitor and manage capacity planning and resource utilization, partnering with cross-functional teams to ensure systems scale safely while remaining cost-effective.
  • Own and operate core open-source infrastructure such as APISIX, Nginx, Kubernetes, Kafka, Elasticsearch, Redis, Consul, Etcd, Zookeeper and other large-scale distributed systems.
  • Design, build, and maintain observability solutions (metrics, logs, traces, alerting), incorporating AI-powered anomaly detection and intelligent alert correlation to surface actionable signals from high-volume telemetry, improving system visibility and resiliency.
  • Automate operational workflows and eliminate manual toil through scripting, tooling, and process improvements, including the use of AI-assisted development tools (e.g., Claude Code) to accelerate the building and iteration of internal operational platforms.
  • Develop and maintain technical documentation, including runbooks, architecture diagrams, operational procedures, and on-call playbooks.
  • Work closely with global engineering teams to improve infrastructure reliability and performance through better system design and operational discipline.

 Job Requirements 

  • Bachelor’s degree in Computer Science, Information Systems, or a related technical discipline, or equivalent practical experience. 
  • 3+ years of experience owning and operating large-scale, high-traffic, 24/7 production systems, ideally in cloud or cloud-native environments. 
  • Solid foundations in Linux, networking, and distributed systems, with the ability to debug complex production issues end to end.
  • Hands-on experience with incident response, troubleshooting, and performance optimization in distributed systems.
  • Experience applying AI/LLM-powered tools to reliability engineering, including designing and building automation or internal tools using AI-assisted development tools (e.g., Claude Code).
  • Strong software engineering skills with experience building automation, tooling, or platforms in languages such as Python or Go. 
  • Experience operating or supporting open-source infrastructure components such as APISIX, Nginx, Kubernetes, Kafka, Elasticsearch, Redis, Consul, Etcd, Zookeeper, etc.
  • Experience with observability and monitoring systems (Prometheus, Grafana, Zabbix, etc.) and performance analysis.
  • Familiarity with Git, CI/CD pipelines, and configuration management tools (e.g., Ansible).
  • A strong sense of ownership, a systematic approach to problem-solving, and a passion for making systems more reliable.
  • Strong communication skills and the ability to collaborate effectively with geographically distributed teams.

 Nice to Have 

  • Bilingual fluency in Mandarin and English.
  • Kubernetes Administrator certification or equivalent real-world experience.
  • Experience operating big data platforms (Hadoop, Yarn, HBase, Hive, Spark).
  • Experience applying AI/LLM-powered tools to reliability engineering, including designing and building automation or internal tools using AI-assisted development platforms (e.g., Claude Code).

Benefits and Perks 

  • Bonus eligible
  • Healthcare (medical, dental, vision, prescription drugs) 
  • Health Savings Account with Employer Funding 
  • Flexible Spending Accounts (Healthcare and Dependent care) 
  • Company-Paid Basic Life/AD&D insurance 
  • Company-Paid Short-Term and Long-Term Disability 
  • Voluntary Benefit Offerings (Voluntary Life/AD&D, Hospital Indemnity, Critical Illness, and Accident) 
  • Employee Assistance Program 
  • Business Travel Accident Insurance 
  • 401(k) Savings Plan with discretionary company match and access to a financial advisor  
  • Vacation, paid holidays, floating holiday and sick days   
  • Employee discounts 
  • Free weekly catered lunch 
  • Dog-friendly office (available at select locations) 
  • Free gym access (available at select locations) 
  • Free swag giveaways 
  • Annual Holiday Party 
  • Invitations to pop-ups and other company events 
  • Complimentary daily office snacks and beverages

#LI-ED1

Pay Range
$92,400$148,800 USD
Shein
Shein

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say