Tech Holding

Senior Site Reliability Engineer

Remote Mexico
Kubernetes Ansible Chef Python Bash GCP
Search for More Jobs Talk to a recruiter now 💪
Description

About us:

Working at Tech Holding isn't just a job, it's an opportunity to be a part of something bigger. We are a full-service consulting firm that was founded on the premise of delivering predictable outcomes and high-quality solutions to our clients.  Our founders and team members have industry experience and have held senior positions in a wide variety of companies – from emerging startups to large Fortune 50 firms – and we have taken our combined experiences and developed a unique approach that is supported by the principles of deep expertise, integrity, transparency, and dependability.

The Role:  

We are seeking a highly skilled and experienced Senior Site Reliability Engineer to join our growing team. You will play a critical role in ensuring the reliability, scalability, and performance of our critical infrastructure and applications. Beyond core SRE responsibilities, you will also serve as a key liaison across various teams, fostering collaboration and ensuring seamless operations.

Responsibilities:

Site Reliability Engineering:

  • Proactively identify and mitigate potential issues impacting infrastructure and applications.
  • Partner with development teams to implement best practices for building reliable and scalable systems.
  • Stay up-to-date on the latest SRE trends and technologies.

Monitoring and Observability:

  • Design, implement, and maintain robust monitoring solutions using tools like Prometheus and Grafana.
  • Develop and configure alerts within tools like PagerDuty to ensure timely notification of potential issues.
  • Analyze and troubleshoot issues using collected application and infrastructure metrics.

Incident Management:

  • Lead incident response, ensuring timely resolution and minimizing downtime.
  • Document and communicate incident details effectively to stakeholders.
  • Conduct post-incident reviews to identify root causes and implement preventative measures.

Service Level Agreements (SLAs):

  • Collaborate with product and engineering teams to define clear and measurable SLAs for our SaaS offerings.
  • Establish Service Level Objectives (SLOs) for key metrics based on SLA requirements.
  • Define Service Level Indicators (SLIs) to track progress towards achieving SLOs.
  • Monitor SLO compliance and proactively identify potential SLA breaches.

Automation:

  • Identify opportunities for automation to improve efficiency and reliability.
  • Develop and implement automation scripts using tools like Python or Bash.
  • Automate routine tasks and incident response workflows.

Cross-Team Collaboration:

  • Act as a liaison between SRE, Product, Security, Application Engineering, and Customer Operations teams.
  • Facilitate communication and information sharing across teams to ensure smooth operations.
  • Work collaboratively to define and implement solutions that meet the needs of all stakeholders.

Mentorship and Knowledge Sharing:

  • Mentor and collaborate with junior SRE engineers.
  • Share knowledge and best practices within the team.
  • Contribute to the development and documentation of internal SRE processes.

Required Skills:

  • 5-8 years of experience as a Site Reliability Engineer (SRE) or related role.
  • Experience with cloud platform GCP
  • Proven experience with monitoring tools like Prometheus and Grafana.
  • Strong understanding of incident management best practices.
  • Experience with alerting tools like PagerDuty.
  • Experience with scripting languages like Python or Bash for automation.
  • Excellent communication and collaboration skills.
  • Ability to work independently and as part of a team.
  • Strong problem-solving and analytical skills.
  • Passion for building reliable and scalable systems.

Nice to Have:

  • Experience with container orchestration platforms like Kubernetes.
  • Experience with chaos engineering principles.
  • Experience with configuration management tools like Ansible or Chef.

What we offer:

  • Remote Work Opportunities
  • Flexible Work Hours

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 307 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

Cancel anytime / Money-back guarantee

Wall of love from fellow engineers