Lifesight

Platform Reliability Engineer

Bengaluru, India
CI/CD Cloud Infrastructure Incident Management Terraform Python Bash TypeScript Go PostgreSQL Redis AWS GCP Azure
Description

Platform Reliability Engineer (SRE / DevSecOps)

Location: Bengaluru, India

Department: Technology

Skills: CI/CD, platform engineering, Cloud Infrastructure, Incident Management

About the role

We are building an AI-native software factory that rapidly launches SaaS products into the market. Our product engineers use AI-assisted development tools to build fast. We are hiring a senior, hands-on Platform Reliability Engineer to make sure every product we launch is production-grade: deployable, observable, secure, scalable, resilient, and cost-efficient.

You will own the shared production layer across our portfolio. Your job is to turn “it works” into “it runs reliably for customers.” You will define the standards, tooling, and operating practices that let a small engineering team launch and maintain many products without operational chaos.

What you will own

This role owns productionization across the service lifecycle: deployment standards, production readiness reviews, observability, SLOs and alerting, automated recovery, scalability, security hardening, and incident response. The goal is to reduce manual operational toil by turning repeatable ops work into software, templates, and automation.

Responsibilities


Build and own the “golden path” for launching and operating products in production:
  • infrastructure templates
  • CI/CD pipelines
  • environment provisioning
  • secrets management
  • DNS, SSL, and edge configuration
  • rollout and rollback workflows
  • backups and restore testing
  • monitoring, logging, tracing, dashboards, and alerts
Define and enforce a Production Readiness Review for every launch, covering reliability, security, scalability, rollback, observability, and recovery. Define service-level indicators and service-level objectives for each product, and build alerting tied to customer impact rather than noisy infra events. Architect and operate reliable cloud infrastructure for multi-product SaaS workloads:
  • autoscaling
  • load balancing
  • caching
  • queues and background jobs
  • database reliability
  • failover and disaster recovery
  • capacity planning and performance tuning
Own runtime and cloud security hardening:
  • IAM and least-privilege access
  • secret rotation and key management
  • dependency and container scanning
  • patching and vulnerability management
  • network boundaries and service-to-service access
  • audit logging
  • WAF/CDN and edge protections
  • secure release controls
Lead incident response for production issues:
  • triage
  • mitigation
  • root cause analysis
  • postmortems
  • follow-through remediation
Reduce operational toil by automating repetitive support, maintenance, and recovery work. Partner closely with the product engineers from design through launch so every new app is deployable through a standard platform, not a one-off setup. For AI-native products, design runtime guardrails around:
  • model/API credentials
  • provider rate limits
  • graceful degradation during vendor issues
  • latency and cost monitoring
  • fallback behavior for core AI workflows

What we’re looking for

  • 5+ years of hands-on experience in SRE, platform engineering, production engineering, DevSecOps, or an infra-heavy backend role with direct production ownership
  • Strong experience with at least one major cloud platform such as AWS, GCP, or Azure
  • Strong infrastructure-as-code skills with Terraform, OpenTofu, Pulumi, or equivalent
  • Strong CI/CD and release engineering experience
  • Strong observability skills across logs, metrics, traces, dashboards, and alerting
  • Strong security fundamentals across IAM, secrets, network controls, vulnerability management, and secure delivery
  • Experience operating containers and/or serverless systems in production
  • Solid coding and scripting ability in at least one language such as TypeScript, Python, Go, or Bash
  • Experience with PostgreSQL, Redis, queues, background workers, and modern web app infrastructure
  • Experience owning on-call, incidents, postmortems, and recovery processes
  • Comfort working in a fast-moving startup where many products are launched from shared building blocks
  • Comfort reviewing and hardening AI-generated or AI-assisted code and infrastructure changes

Nice to have

  • Experience with multi-tenant SaaS products
  • Experience building internal developer platforms
  • SOC 2, ISO 27001, or security compliance preparation experience
  • Experience with LLM/AI application operations
  • Experience with FinOps or cloud cost optimization
  • Experience supporting a product portfolio rather than a single application

Success in the first 90 days

  • Establish a standard production deployment template for all new products
  • Put centralized monitoring, logging, tracing, and alerting in place
  • Create and enforce a production readiness checklist for launches
  • Define initial SLOs for core products
  • Implement backups and successfully test restore procedures
  • Roll out a baseline security hardening standard across all production apps
  • Create incident response runbooks and escalation paths

Success metrics

  • Time from product-ready codebase to production launch
  • Change failure rate
  • Mean time to detect and mean time to recover
  • Uptime and latency performance against agreed SLOs
  • Number of critical production incidents
  • Backup restore success rate
  • Security findings closed within target time
  • Infrastructure cost per product and per active customer
Lifesight
Lifesight

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say