Job Description:
Position Description:
Builds and operates highly resilient platforms in AWS cloud environments. Coordinates systems using Infrastructure as Code tools (IAM, ARM, Terraform, and Chef). Performs reliability engineering throughout the entire Software Development Lifecycle (SDLC) using Python, NodeJS, or Java. Deploys and supports distributed multi-tiered application systems using Kubernetes and Continuous Integration/Continuous Deployment (CI/CD) pipelines. Creates dashboards to capture the latency, availability, error, and saturation (performance) of applications using Splunk, Grafana, Prometheus, Catchpoint, and Datadog. Creates Service-Level Indicator/Service-Level Objective (SLI/SLO) dashboards and automated processes to update changes and create new dashboards. Identifies and resolves application issues using DataDog, Prometheus, and Splunk. Creates, maintains, and tune monitors using ELK, OpenSearch, and OpenTelemetry. Supports applications hosted in Amazon Web Services (“AWS”) Cloud and Kubernetes. Builds, deploys, automates, and supports application services spanning multiple technology platforms, frameworks, and languages.
Primary Responsibilities:
Provides automated solutions for business and technology operational activities and manual tasks.
Analyzes the observability, resiliency, availability, and performance of applications.
Triages, deep dives, and executes root cause analysis.
Provides resolution of business, and system issues through enhancement initiatives.
Resolves issues as required during critical outages to avoid negative business impact.
Contributes to product architectural solutions, addressing high impact system issues.
Deploys and supports distributed multi-tiered application systems.
Manages the scalability and resiliency of applications.
Ensures daily business operations are not impacted by system issues (trade processing and correction, fund and sweep translation, and cash position and reconciliation).
Consults across the enterprise to plan for and implement enhancements to systems to avoid system outages and ensure seamless implementations.
Establishes end-to-end flow of application systems to quickly identify and resolve critical business issues.
Tests the resiliency of application systems using Chaos Engineering techniques.
Mentors junior team members.
Education and Experience:
Bachelor’s degree (or foreign education equivalent) in Computer Information Systems, Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) maintaining and improving the reliability, performance, and scalability of distributed applications.
Or, alternatively, Master’s degree (or foreign education equivalent) in Computer Information Systems, Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) maintaining and improving the reliability, performance, and scalability of distributed applications.
Skills and Knowledge:
Candidate must also possess:
Demonstrated Expertise (“DE”) performing site reliability engineering to analyze the observability, resiliency, availability, instrumentation, and performance of distributed applications; creating dashboards and monitors to capture the latency, availability, error, and saturation performance of distributed applications using Splunk, Grafana, Prometheus, Catchpoint, Telemetry, and Datadog; and creating SLI/SLO dashboards, monitors, and automated processes to update changes and create new dashboards.
DE developing Kubernetes platforms and automations in public and private Cloud -- RKS (Rancher), EKS (AWS), and AKS (Azure) -- using Python, Shell Scripting, GIT, Docker, and Kubernetes.
DE automating business and technology operational activities — Kubernetes cluster rehydration, application recycling, patching, disaster recovery, and ITSM reporting -- using Jenkins Core, uDeploy, RunDeck, Ansible and AWX.
DE performing triage and root cause analysis (RCA) in a multi-tiered, fund accounting application system related to hardware, software, network, applications, and cloud service providers, on multiple platforms -- Unix, Windows, and AWS cloud Environments, using DataDog, Splunk, Grafana, and Kibana.
#PE1M2 #LI-DNI
Certifications:
Category:
Information TechnologyFidelity’s hybrid working model blends the best of both onsite and offsite work experiences. Working onsite is important for our business strategy and our culture. We also value the benefits that working offsite offers associates. Most hybrid roles require associates to work onsite every other week (all business days, M-F) in a Fidelity office.
Other Jobs from Fidelity
Nonqualified Implementation Project Manager
Principal Data Engineer (SQL, JavaScript)
Principal Performance Engineer (Trading)
Similar Jobs
Site Reliability Engineer - USDS (MTV)
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say