WHO WE ARE
Zeta Global (NYSE: ZETA) is the Data-Powered Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to make it easier for marketers to acquire, grow, and retain customers more efficiently. Through the Zeta Marketing Platform (ZMP), our vision is to make sophisticated marketing simple by unifying identity, intelligence, and omnichannel activation into a single platform – powered by one of the industry’s largest proprietary databases and AI. Our enterprise customers across multiple verticals are empowered to personalize experiences with consumers at an individual level across every channel, delivering better results for marketing programs. Zeta was founded in 2007 by David A. Steinberg and John Sculley and is headquartered in New York City with offices around the world.
We’re looking for experienced Site Reliability Engineers (SREs) who can write production-grade code, have mastery of SLIs, SLOs, and error budgets, and are passionate about building scalable observability systems.
If you:
· Can code confidently in Python or Golang and solve real-world problems through automation. (not only scripting)
· Have hands-on experience implementing SLIs, SLOs, and distributed tracing in production.
· Understand Kubernetes, Terraform, and Infrastructure as Code tools.
· Are excited about working with high-throughput, distributed systems processing millions of transactions daily…
Then this role might be for you!
Key Responsibilities:
· Design, implement, and manage SLOs, SLIs, and error budgets, ensuring reliability aligns with user expectations and business objectives.
· Develop production-grade software to enhance system reliability and reduce manual toil through automation.
· Implement and optimize observability solutions using tools like OpenTelemetry, with a focus on high-cardinality metrics, distributed tracing, and actionable insights.
· Drive postmortem processes and lead in-depth root cause analyses for incidents, ensuring lessons learned are effectively applied to prevent recurrence.
· Define and monitor MTTx metrics (MTTA, MTTR, MTTF), using them to guide system improvements and measure reliability progress.
· Collaborate with engineering teams to design systems with reliability and scalability in mind, incorporating capacity planning, resiliency patterns, and modern deployment strategies (e.g., Canary, Blue-Green).
· Lead design reviews for alerting strategies, ensuring effective signal-to-noise ratios in monitoring and incident management.
· Advocate for and implement best practices in incident response and system design to achieve optimal uptime and performance.
Your experience:
Strong Coding Background:
· 3+ years of experience as an SRE or in a similar role with hands-on coding.
· 2+ years of software development experience in Python or Golang, with a focus on building maintainable, production-quality code.
SRE Expertise:
· Deep understanding of SRE principles, particularly SLIs, SLOs, error budgets, and their real-world application.
· Hands-on experience conducting postmortems and implementing observability at scale.
Observability Skills:
· Expertise in designing and implementing end-to-end observability solutions using tools like OpenTelemetry, Prometheus, Grafana, or Honeycomb.
· Experience with distributed tracing and handling high-cardinality metrics in production environments.
Infrastructure Knowledge:
· 3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
· Strong understanding of distributed systems, microservices architectures, and containerization (Docker, Kubernetes).
Monitoring and Automation:
· Hands-on experience with CI/CD platforms (GitOps, Jenkins, ArgoCD) and building automated pipelines.
· Familiarity with tools and frameworks for incident management and operational automation.
Additional Skills:
· Knowledge of modern deployment strategies (e.g., Canary, Blue-Green) and resiliency patterns (e.g., circuit breakers, retries).
· Experience with Kafka or similar distributed messaging systems.
· Strong analytical skills for statistical analysis of metrics to identify and resolve performance bottlenecks.
BENEFITS & PERKS
- Unlimited PTO
- Excellent medical, dental, and vision coverage
- Employee Equity and Stock Purchase Plan
- Employee Discounts, Virtual Wellness Classes, and Pet Insurance And more!!
COMPENSATION RANGEThe compensation range for this role is $140,000.00 - $190,000.00, depending on location and experience.
PEOPLE & CULTURE AT ZETA
Zeta considers applicants for employment without regard to, and does not discriminate on the basis of an individual’s sex, race, color, religion, age, disability, status as a veteran, or national or ethnic origin; nor does Zeta discriminate on the basis of sexual orientation, gender identity or expression.
We’re committed to building a workplace culture of trust and belonging, so everyone feels invited to bring their whole selves to work. We provide a forum for employees to celebrate, support and advocate for one another. Learn more about our commitment to diversity, equity and inclusion here: https://zetaglobal.com/blog/a-look-into-zetas-ergs/
ZETA IN THE NEWS!
https://zetaglobal.com/press/?cat=press-release
#LI-DD1
#LI-Remote
Other Jobs from Zeta Global
Senior Python Software Engineer
Technical Project Manager
Senior QA Automation & Tooling Engineer
Senior Software Engineer
Principal Software Engineer
Similar Jobs
Site Reliability Engineer
Platform Engineer
Platform Engineer
Senior Platform Engineer
AWS DevOps Engineer (m/f/d)
Senior DevOps Engineer - Product Metrics
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say