Senior Platform & Reliability Engineer (SRE)
Department: EPD (Engineering, Product, Design)
Location: San Francisco
Employment Type: FullTime
Agency Notice: We are not currently working with recruiting agencies for this role. Please do not contact Vizcom employees regarding this position. Any resumes submitted without a prior agreement will be considered unsolicited.
About Vizcom
Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.
We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.
Role Mission
Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.
Compensation
$200,000 – $250,000 base salary + meaningful equity
What You’ll Own
Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.
Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.
Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.
Traits We’re Looking For
Calm, structured incident commander under pressure.
Thinks in failure modes and blast radius by default.
Pragmatic: can stabilize quickly, then implement durable fixes.
High ownership and strong written communication.
First 90 Days
Establish baseline reliability metrics and identify top platform risks.
Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).
Deliver high-impact hardening fixes across probes/startup paths/queue safety.
Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.
If possible please include one incident you personally led and send to [email protected] :
1) what failed,
2) how you contained it,
3) what permanent fixes you shipped, and measured.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
