TensorWave

Operations Engineer

Las Vegas, NV
Grafana Datadog Prometheus Linux TCP/IP DNS Python Bash Kubernetes
Description

Operations Engineer

Department: Operations

Location: Las Vegas, Nevada

Employment Type: FullTime

About TensorWave

Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.

 

About the Role

We’re looking for an Operations Engineer to join our team during an exciting phase of growth. In this role, you’ll be responsible for monitoring systems, executing runbooks, and coordinating with on-site teams and engineering when escalation is needed, working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact.

 

What You’ll Do

  • Monitor customer environments in real time across TensorWave data centers using monitoring and observability platforms

  • Track key health indicators including GPU utilization, node availability, network performance, storage health, and Kubernetes cluster status

  • Identify anomalies, degradations, and emerging issues before they escalate into customer-impacting events

  • Maintain situational awareness of active customer workloads, scheduled maintenance windows, and known issues across the fleet

  • Provide regular health summaries and flag trends that may indicate systemic risks to customer environments

  • Serve as the first responder to customer-reported issues and system-generated alerts, performing initial triage and classification

  • Execute established runbooks to diagnose and resolve common infrastructure issues including node failures, connectivity problems, and resource contention

  • Escalate issues to L2 engineering or on-site data center teams with clear, actionable context

  • Maintain accurate incident records including timeline, actions taken, and resolution details in the ticketing system

  • Communicate status updates to internal stakeholders during active incidents, ensuring visibility across operations and customer-facing teams

  • Follow and contribute to operational runbooks and standard operating procedures, identifying gaps or improvements based on real-world incidents

  • Assist with monitoring and alerting tuning by providing feedback on alert quality, false positive rates, and coverage gaps

  • Document tribal knowledge, recurring issue patterns, and lessons learned to strengthen the team’s operational knowledge base

  • Participate in post-incident reviews, contributing observations from the frontline monitoring and response perspective

  • Support change management processes by monitoring customer environments during planned maintenance and infrastructure changes

  • Coordinate with on-site data center operations teams for hands-on remediation activities that require physical access

 

Who You Are

Required Qualifications

  • 1–3 years of experience in a NOC, operations center, technical support, systems administration, or similar infrastructure operations role

  • Experience monitoring production infrastructure using observability tools (Grafana, Datadog, Prometheus, or similar)

  • Foundational Linux systems administration skills with the ability to navigate systems, read logs, and execute diagnostic commands

  • Basic understanding of networking fundamentals including TCP/IP, DNS, and VLANs

  • Experience following operational runbooks and structured triage procedures in a production environment

  • Strong written communication skills, particularly the ability to write clear incident updates and escalation summaries under time pressure

  • Demonstrated ability to stay calm, prioritize effectively, and work methodically during high-pressure situations

  • Familiarity with ticketing and incident tracking systems (PagerDuty, Jira, ServiceNow, or similar)

  • Willingness to work shift schedules including nights, weekends, and holidays as part of a 24/7 coverage model

Preferred Qualifications

  • Experience in a customer-facing operations role at a cloud provider, managed services provider, or colocation facility

  • Exposure to GPU infrastructure, HPC clusters, or AI/ML compute environments

  • Familiarity with Kubernetes concepts and basic container troubleshooting

  • Scripting ability in Python, Bash, or similar for basic automation and log analysis

  • Experience with high-performance networking concepts (RDMA, InfiniBand, or RoCE)

  • Background working across multiple geographically distributed data center sites

  • Relevant certifications (CompTIA Server+, Linux+, RHCSA, CCNA, or equivalent)

 

What We Offer

  • Stock Options

  • 100% paid Medical, Dental, and Vision insurance for Employees

  • Company Health Savings Account Contributions

  • 100% paid Short Term and Long Term Disability Insurance for Employees

  • Life and Voluntary Supplemental Insurance Options

  • Other Insurance Options, such as Pet & Legal Insurance

  • Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support

  • Flexible Spending Account

  • 401(k)

  • Employee Assistance Program

  • Flexible PTO

  • Paid Holidays

  • Parental Leave

  • Other In-Office Perks

 

Equal Employment Opportunity

TensorWave is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of any protected status under applicable law.

 

Reasonable Accommodations

TensorWave provides reasonable accommodations in accordance with applicable laws. If you require accommodation during the hiring process, please contact [email protected].

 

Employment Eligibility

All offers of employment are contingent upon verification of identity and authorization to work in the United States, as required by law.

 

Background Checks

Where permitted by law, employment may be contingent upon the successful completion of a job-related background check.

 

Data Privacy Notice

By submitting an application, you acknowledge that TensorWave may collect, use, and retain your personal information for recruiting and employment-related purposes in accordance with applicable data privacy laws.

TensorWave
TensorWave

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say