NVIDIA

Service Reliability Operations Engineer

Bengaluru, India
Git Shell Python Ansible
Search for More Jobs Talk to a recruiter now 💪
Description

NVIDIA's NGC (NVIDIA Gpu Cloud) team is looking for highly motivated System Administrator/DevOps engineers to design, develop and implement a global, dynamic, state-of-the-art Service Reliability Operations Center (known as Mission Control), to provide extraordinary levels of support for our Cloud products and services.As a key member of the Mission Control team, you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, DevOps teams, and other datacenter operations partners to help make our services capable of providing near 100% availability. On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue.Working in partnership with the development community the Mission Control team will develop monitors, alarms and alerts to help make the service more reliable and improve our customer experience. Additionally you will be very involved in selecting the technologies that we will use in the Mission Control to help monitor, run and measure the effectiveness of the environment.

What you will be doing:

  • The team will provide their services 24/7 with a follow-the-sun environment which will span continents.

  • You will directly report to a manager in Bangalore.

  • Each team member will need to work either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4days-per-week schedule) to ensure that the combination of the US and India teams provide 24/7 coverage.

  • The heart of Mission Control will be monitoring and triaging a growing On-prem and CSP (Cloud Service Provider) production compute and storage Datacenter environment.

  • Every Mission Control team member will utilize alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and execute predictive support or diagnostic routines.

  • Perform Linux administration tasks, network administration tasks, security incident monitoring to drive your actions.

  • Mission Control team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.

  • Strong communication and interpersonal skills will help keep the team engaged through incident resolution, including initiating the incident management procedure.

What we need to see:

  • BS/BE degree in Computer Science, Electronics  or equivalent experience.

  • Minimum of 3 years’ experience administering open system servers in a Production environment of demanding Internet, Cloud, or Telecommunications environments as a Linux Systems Administration, DevOps, SRE, or NOC role.

  • Strong problem-solving, analytical, and troubleshooting abilities on Linux Clusters on public or private clouds.

  • Strong Linux administration experience. Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc. RHCE or equivalent level of knowledge.

  • Experience scripting in Python and ansible playbooks is preferred, but not required.

  • Knowledge and understanding of application containers,  container orchestration systems and git workflow..

  • Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.

  • Demonstrate ability to master and maintain complicated environments.

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most forward-thinking and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you. ​

NVIDIA
NVIDIA
Artificial Intelligence (AI) GPU Hardware Software Virtual Reality

0 applies

1 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 401 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say