Pfizer

Lead Site Reliability Engineer (SRE) – Network Operations

Costa Rica
Python Perl Shell
Search for More Jobs Talk to a recruiter now 💪
Description

ROLE SUMMARY

Site Reliability Engineering (SRE) is a set of principles and practices that seeks to take a software engineering approach to solving IT Operations problems. In the Digital Command One team, we seek to apply an SRE approach to managing network operations, with a focus on:

  • Taking an analytical approach to ensuring services are running smoothly: identifying and regularly reviewing operational health indicators and taking action to ensure services are delivered to defined service levels.
  • Continuously analyzing service data to identify and execute upon service improvement opportunities. Opportunities will be assessed on their potential to remove manual effort (‘toil’), improve service reliability, and enhance customer experience.
  • Developing, or partnering with other teams to develop automation that will remove toil from the environment, and deliver a more reliable, cost-effective service to the company.

The SRE will take the lead in applying SRE principles for all technologies supported by Network Operations. These technologies include LAN, WAN, Wireless, DDI, Security, and IP Telephony.

ROLE RESPONSIBILITIES

  • Embed core principles and practices of Site Reliability Engineering in the delivery of network services. These services include LAN, WAN, Wireless, DDI, Security, and IPT.
  • Ensure services are meeting the quality and reliability outcomes expected by customers: identify and regularly review health indicators, taking actions as required.
  • Apply core principles of Chaos engineering where appropriate to improve network reliability at Pfizer’s critical sites.
  • Analyse network capacity and performance trends to anticipate future needs and ensure the network can handle growth.
  • Continuously analyze service data to identify improvement opportunities. Opportunities will be assessed on their potential to remove manual effort (‘toil’), improve reliability, and enhance customer experience. Maintain a backlog of prioritized opportunities.
  • As it will be a key mechanism for removing toil, the Lead SRE will drive automation efforts for the services in question. This will entail both developing themselves and partnering with other teams to develop automation code to agreed standards.
  • Collaborate closely with Product Owners for Network products. This collaboration will include understanding roadmaps, evaluating automation and improvement opportunities.
  • Develop and maintain monitoring artefacts that allow for proactive and pre-emptive monitoring of services, with an objective of avoiding service disruptions. Partner closely with Command Center to integrate the artefacts and associated processes into their day-to-day operations.
  • Assume a leadership role on major outages or planned events that impact the services in question. Collaborate with the Command Center to ensure processes are optimized for management of critical incidents including escalation matrices, troubleshooting plans and impact statements.
  • Lead post-mortems for major events pertaining to the service. Ensure that learnings from service failures are identified and that action plans are develop and executed upon.

QUALIFICATIONS

  • Bachelor’s degree in computer science, Information technology or a related field, or equivalent practical experience.
  • 5+ years of relevant experience in enterprise network operations settings.
  • Proficiency in networking protocols such as TCP/IP, DNS, DHCP, BGP, and others.
  • As driving automation efforts is a core responsibility for this role, an understanding of software development concepts and scripting skills (python, perl, shell scripting) is required.
  • Understanding of network security principles and practices, including firewalls, VPN, IPS, and encryption.
  • Strong technical aptitudes, with demonstrated experience in network technologies outlined above. Exposure to Site Reliability Engineering principles and practices.
  • Keen data literacy and analytical ability. Must be able to aggregate data from different sources, derive insights and formulate actions accordingly.
  • Proven ability to build and improve processes and workflows. Relentless focus on removing toil (manual effort) through process re-design and automation.
  • Embraces accountability and exhibits an ownership demeanour, be it in leading the response to a major incident or facilitating service improvement. Includes the ability to effectively lead others more senior than them, and not directly on the same team.
  • Exposure to agile ways of working and aptitudes to be effective in such a paradigm. These include a bias toward action and the discipline to do the day-to-day with excellence (examples: facilitate stand-ups, review health indicators, groom backlog).
  • Poise in high-pressure situations, must have a track record of dealing with situations like major outages effectively.
  • Communicates in a succinct, accurate and timely fashion. Ability to effectively communicate technical issues/challenges to business recipients.

PHYSICAL/MENTAL REQUIREMENTS

  • Data Literacy - the ability to analyze, interpret and use data to provide actionable insights.

NON-STANDARD WORK SCHEDULE, TRAVEL OR ENVIRONMENT REQUIREMENTS

  • Occasional travel (less than 5%)
  • After hours or weekend work may be occasionally required to participate in global meetings or in support of major incidents
 
Work Location Assignment: Flexible

EEO (Equal Employment Opportunity) & Employment Eligibility 

Pfizer is committed to equal opportunity in the terms and conditions of employment for all employees and job applicants without regard to race, color, religion, sex, sexual orientation, age, gender identity or gender expression, national origin, or disability.

Information & Business Tech

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 307 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

Cancel anytime / Money-back guarantee

Wall of love from fellow engineers