Production engineering is a team that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. Production Engineers possess expertise in different domains, such as storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, capacity planning, continuous delivery, and deployment, as well as open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Their responsibilities include ensuring reliable, scalable, high-performance storage solutions, optimizing data placement and access patterns, managing large-scale distributed storage systems, and ensuring low-latency data access for high-performance computing (HPC) and AI/ML workloads.
Production Engineers at NVIDIA ensure that our internal and external-facing GPU cloud services have reliability and uptime as promised to the users while enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency, and performance. This role also requires an approach focused on automating storage operations, improving data access efficiency, and optimizing storage performance. Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing the efficiency of storage and production systems.
What You Will Be Doing:
Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
Develop and maintain storage monitoring, logging, and alerting systems to ensure proactive detection and resolution of performance issues.
Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance. Improve the lifecycle of storage services – from inception and design to deployment, operation, and continuous optimization.
Support storage services before they launch through activities such as system design consulting, developing automation frameworks, capacity management, and launch reviews.
Maintain storage infrastructure once live by monitoring availability, latency, and system health, using predictive analytics and AI-driven automation.
Optimize storage efficiency through compression, duplication, tiering strategies, and intelligent workload placement.
Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques. Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
Practice sustainable incident response and blameless postmortems. Be part of an on-call rotation to support storage and production systems.
What We Need To See:
BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field (e.g., physics, mathematics), and 5+ years of practical experience.
Experience with high-performance storage solutions, including parallel file systems (Lustre, GPFS), distributed storage (Ceph, MinIO), and enterprise-scale object storage (S3, NetApp, Pure Storage, etc.).
Solid understanding of block, file, and object storage technologies, including their performance characteristics and standard methodologies.
Experience with storage networking protocols such as NFS, SMB, iSCSI, Fibre Channel, RDMA, and NVMe over Fabrics.
Expertise in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based storage systems.
Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby for storage automation, monitoring, and performance tuning.
Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
Experience with observability and tracing tools like InfluxDB, Prometheus, and the Elastic stack for monitoring storage system health.
Ways to stand out from the crowd:
Deep understanding of large-scale distributed storage architectures, replication strategies, and erasure coding techniques. Proven experience in capacity planning, performance tuning, and troubleshooting high-throughput storage systems.
Experience with Git, code review, pipelines, and CI/CD for handling infrastructure as code. Interest in analyzing and improving distributed storage system performance at scale. Strong debugging skills with a systematic problem-solving approach to identify complex storage issues. Experience using or running private and public cloud storage solutions based on Kubernetes, OpenStack, or hybrid cloud architectures.
Ability to design and implement automated storage migration, backup, and disaster recovery strategies. Thrive in collaborative environments and enjoy working with various teams to optimize storage performance. Flexible in adapting to different working styles and emerging storage technologies.
At NVIDIA, you’ll be at the forefront of innovative storage technologies, working on high-performance storage solutions that power the next generation of AI, HPC, and cloud computing. NVIDIA is leading in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. We have some of the most forward-thinking, and hardworking people on the planet working for us. If you're creative, passionate and self-motivated, we want to hear from you!
The base salary range is 148,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
Other Jobs from NVIDIA
Senior Data Engineer, Cloud Operations Engineering
Senior Firmware Engineer - Memory Subsystem
Senior Signal and Power Integrity Engineer - Hardware
Senior Mechanical Product Design Engineer
Senior Mixed Signal Design Validation Engineer
Senior ASIC Verification Engineer, Coherent High Speed Interconnect
Similar Jobs
DevOps Engineer (Deployment team)
DevOps Engineer (Deployment team)
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say