Redwood Software

Manager, Site Reliability Engineering

Remote US
API Kubernetes Terraform Docker AWS Bash Python
Search for More Jobs Talk to a recruiter now 💪
Description

It's fun to work in a company where people truly BELIEVE in what they're doing!

We're committed to bringing passion and customer focus to the business.

For this role, we are considering applicants in the United States or the United Kingdom.

OUR MISSION  

At Redwood Software we unleash human potential. We empower our customers with lights-out automation for their mission-critical business processes.

Redwood Software is the leader in full stack automation for mission-critical business processes. With the first SaaS-based composable automation platform specifically built for ERP, we believe in the transformative power of automation. Our unparalleled solutions empower organizations to orchestrate, manage and monitor their workflows across any application, service or server – in the cloud or on premise – with confidence and control.

CORE VALUES

One Team. One Redwood

Make Your Own Weather

Obsess over Customer Success

Work the Problem

Be Curious

Own the Outcome

Respect Each Other

YOUR IMPACT

The SRE Manager is responsible for leading the Site Reliability Engineering (SRE) team, owning and optimizing the incident management process, and ensuring the reliability and performance of the company's SaaS products. This role requires strong leadership, excellent communication skills, and the ability to work collaboratively across various departments to achieve organizational goals. The ideal candidate will have a deep understanding of cloud infrastructure, incident response, and customer support.

  • Leadership and Team Management:

    • Lead and manage the SRE team, providing guidance, training, and support.

    • Own and lead the incident management process, ensuring incidents are managed effectively from detection to resolution.

    • Establish and maintain incident management policies and procedures.

    • Act as the primary point of contact for all incident-related activities, ensuring clear communication with stakeholders.

    • Manage and build a global team to scale with the growing demand of the SaaS product offering.

  • Incident Response and Resolution:

    • Oversee the day-to-day management of alerts, system checks, and issue escalation.

    • Ensure the team provides 24x7 on-call support for critical SaaS events and emergencies.

    • Coordinate and lead incident response efforts, ensuring timely and effective resolution of incidents.

    • Perform Root Cause Analysis (RCA) and take corrective actions to prevent recurrence.

    • Ensure Mean Time to Resolution (MTTR) targets for escalated tickets are met by implementing effective escalation procedures and monitoring performance.

  • Service Level Management:

    • Define, monitor, and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure reliability and performance standards are met.

    • Regularly review and analyze service performance data to identify areas for improvement and ensure compliance with SLAs.

  • Process Improvement and Automation:

    • Proactively develop and implement monitoring and alerting systems within the EKS/K8S ecosystem.

    • Enhance infrastructure health by implementing automated checks and remediation scripts.

    • Continuously improve deployment code and automate manual tasks to streamline operations.

  • Collaboration and Communication:

    • Work closely with Support, Customer Success, Migration, and Professional Services teams to ensure exceptional customer service.

    • Maintain clear and detailed documentation of issues, remediation steps, and RCAs.

    • Work closely with management, product architects, and product team leads to highlight product issues impacting our SaaS offering quality, performance, and SLAs.

    • Communicate effectively with customers and internal teams, ensuring transparency and understanding of incident impacts and resolutions.

  • Innovation and Technology Integration:

    • Stay current with new technologies and integrate them into the cloud infrastructure to enhance performance and reliability.

    • Deploy applications to EKS/K8s clusters using Terraform and Helm and maintain existing infrastructure under Docker Swarm.

YOUR EXPERIENCE

  • Proven experience as an AWS Cloud Engineer with hands-on expertise in EKS, Terraform, and Helm.

  • Strong background in Docker and Docker Swarm.

  • In-depth knowledge of AWS IAM roles, policies, and CloudWatch logs.

  • Proficient in Linux environments and scripting languages such as Bash and Python.

  • Excellent understanding of web technologies, REST APIs, and DevSecOps principles.

  • Experience with monitoring solutions like Grafana and Prometheus.

  • Exceptional oral and written communication skills.

  • Strong customer-facing communication skills, capable of effectively explaining issues and RCAs.

  • Experience in product/application support for SaaS-based products.

  • Understanding of APIs, databases, systems architecture, and design.

  • AWS Certified Solutions Architect.

  • Working knowledge of IaC, CI/CD and observability

Desired Attributes

  • Ability to work independently and collaboratively within a team.

  • Strong problem-solving skills and the ability to troubleshoot issues in production environments.

  • Customer-focused mindset, always considering the impact on customers when planning deployments and updates.

  • Ability to lead and motivate a team, fostering a culture of continuous improvement and excellence.

This role requires a proactive leader who can manage and optimize the incident management process, ensuring the highest level of support and service for our SaaS product offerings.

If you like growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

THE LEGAL BIT

Redwood is an equal opportunity employer. Redwood prohibits unlawful discrimination based on race, colour, religion, sex, gender identity, marital or veteran status, age, national origin, ancestry, citizenship, physical or mental disability, medical condition, genetic information or characteristics (or those of a family member), sexual orientation, pregnancy or any other consideration made unlawful by regional or local laws. We also prohibit discrimination based on a perception that anyone has any of those characteristics or is associated with a person who has or is perceived as having any of those characteristics. All such discrimination is unlawful and will have a zero tolerance policy applied to it.
 

Redwood will comply with all local data protection laws, including GDPR when it comes to the handling and processing of personal data. All resume’s submitted to Redwood will be retained for 6 months (12 months with your consent) after submission for recruitment purposes. Should you wish for us to remove your personal data from our recruitment database, please email us directly on Recruitment@Redwood.com

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 307 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

Cancel anytime / Money-back guarantee

Wall of love from fellow engineers