- Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to guide and develop Hardware Fault Management for various server products.
- Leverage deep understanding RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanism for better operation quality and cost/efficiency.
- Champion engineering and operational excellence, establishing metrics and process for regular assessment and improvement.
- Develop visibility through data visualization and implement systemic solutions to hardware health issues.
- Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues.
- Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders.
- Drive necessary discussion with external and internal teams on test specification and methodologies to improve test quality continuously.
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
- 5+ years of work experience in one or more domains such as: ASIC development (Silicon design or bringup or characterization), compute (ARM, x86), AI-ML hardware/software (GPUs, TPUs).
- Knowledge of architecture and components on one of the following products: server/PC/Laptop.
- Development or debug experience in one or more following areas: hardware fault management, error reporting, error handling on hardware products.
- 7+ years of experience with one subset of the following AI systems: Accelerator (GPU/ASIC), Kernel development, Performance optimization (e.g., NVIDIA, AMD, Intel, or other misc accelerator), computer architecture, HPC communication libraries (e.g., NCCL, MPI), performance enablement, tracing, profiling and debugging.
- Experience with architecture of disaggregated systems at scale.
- Understanding of hardware development process and how to scope out test plans accordingly.
- Experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries.
Other Jobs from Meta
Product Manager, Financial Integrity - ML & Platform
Network Engineer, Deployment & Support
Software Engineer (Product)
QA Engineering Lead
Production Systems Engineer, AI Systems
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 401 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say