Description

Synopsis of the role

Site Reliability Engineering (SRE) combines software and systems engineering to create scalable and highly reliable software systems. SREs are responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their services.

What experience you need

8-10 years of experience doing hands-on DevOps engineering, Reliability engineering and production support for large scale IT systems on cloud platforms like GCP and AWS
A good level of hands on experience in Kubernetes (GKE, EKS)
Strong scripting skills (Python, Shell, Groovy)
Good command over Linux, Networking on Cloud and Docker
Ability to understand and code pipelines for CI/CD automation using Jenkins
Capable of coding infrastructure using terraform.
Exposure to maintaining databases like MongoDB, Postgres.

What you’ll do

Design, architect and develop cloud native solutions using services like GKE, Cloud Functions, CloudSQL, BigQuery, Pub/Sub, Composer, Dataflow etc on Google cloud platform
Build and own infrastructure through Terraform code and maintain a high quality code base
Work closely with development teams to remove repetitive processes using Automation (Jenkins, Python, Groovy, gcloud)
Troubleshoot production incidents using tools like DataDog, Google Cloud Operations suite, Grafana, ChaosSearch
Participate in the SRE team’s on-call rotations, respond to incidents and provide expert support in resolving customer impacting production issues
Plan and Implement Disaster Recovery for the systems and conduct regular DR tests to ensure business continuity during the event of a disaster
Actively contribute to the SRE operational artifacts
- Engineering documentation
- Standard operating procedures
Perform cloud cost optimization on the resources owned by SRE
Proactively keep up with all the security scans and reports to maintain a secure system and perform regular patching of all cloud resources

What could set you apart

A good exposure to security patching of resources on google cloud
Ability to document engineering solutions and share the information across the team
Ability to help with developing standard operating procedures for SRE operations within the company
Willingness to go through official product documentations to build academically correct and secure systems
Exposure to Vertex AI on google cloud is a plus
Exposure to maintaining databases like MongoDB, Postgres
Availability to work extended hours during production incidents and production changes.

Primary Location:

CAN-Toronto-5700 Yonge

Function:

Function - Tech Engineering and Service Ops

Schedule:

Full time

Equifax

Analytics Consulting Database

0 applies

27 views