Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.
Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.
About The Role
In this role, you will be building and leading a function that builds a world-class monitoring solution for a very large-scale AI cluster/supercomputer infrastructure. Such AI clusters have 100’s of Wafer-scale accelerator systems, 1000’s of high-end servers, and several 1000’s of networking ports including switches. Plus, there will be network attached storage, all in a large-scale datacenter.
You will be the primary engineering owner responsible for building a monitoring solution that tracks every element within the cluster as well as key data center metrics that interacts with our cluster. Cluster monitoring software involves both complex data collection/retention layer and a presentation layer that end-users directly interface with.
Additionally, it also involves building the necessary critical telemetry to gain insights as well as alerting mechanisms. You will be responsible for building an intuitive solution to simplify the management of the cluster and minimize the reaction time needed to address incidents on large-scale clusters. Overall, the engineering solution you will build is the primary source of management tool that the operators of the cluster will rely on to triage a variety of incidents on the cluster.
Responsibilities
- Be the primary engineering face and owner of this function.
- Provide strong technical leadership for Cerebras in cluster monitoring.
- Actively interface with users and product owners to gather and understand gaps and pain points in cluster monitoring.
- Develop, maintain and execute roadmap of the cluster monitoring software.
- Build an outstanding engineering team to deliver world-class monitoring product.
Skills And Qualifications
- 3+ years of demonstrated engineering leadership/management role in distributed systems monitoring.
- Proven track record of delivering product, launching and deploying distributed solutions to customers.
- Excellent communication, articulation, collaboration and ability to act like a stakeholder.
- Tough decision-making skills with data and trade-off analysis.
- Outstanding sense for product and user journeys (including frontend UI sense, backend scalability sense).
- Outstanding road map and schedule execution skills under tight timeline and budgets.
- Strong technical background in distributed systems software development (K8s and its ecosystem).
- Strong technical background in building observability/monitoring software (Similar to Prometheus/Grafana) is preferred.
- Experience straddling low-level bare metal and high-level services monitoring is preferred.
- Technical experience with bare metal cluster management software and related monitoring is preferred.
- Strong technical experience in computer and cluster networks is preferred.
Why Join Cerebras
People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:
- Build a breakthrough AI platform beyond the constraints of the GPU.
- Publish and open source their cutting-edge AI research.
- Work on one of the fastest AI supercomputers in the world.
- Enjoy job stability with startup vitality.
- Our simple, non-corporate work culture that respects individual beliefs.
Read our blog: Five Reasons to Join Cerebras in 2025.
Apply today and become part of the forefront of groundbreaking advancements in AI!
Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.
This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
Other Jobs from CerebrasSystems
Cluster Security Software – Engineering Lead
Senior Software Development Engineer in Test – AI Cluster
System Software Engineer
Staff System Integration Engineer, Electrical
Staff Product Manager, ML Platform
Similar Jobs
Staff AI Engineer
Data Scientist, MTS
Machine Learning Engineer, AI (FULLY REMOTE, USA)
Senior Staff Software Engineer, Commercial Bank
Software Engineer, Backend
Senior Data Analyst - Travel
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got about 70,000 jobs from 5,000 vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 5,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say