Hark

Infrastructure Compute

San Jose, CA
Kubernetes Rust Go PyTorch Ray InfiniBand RoCE RDMA
Description

Infrastructure, Large-scale Training

Location: San Jose

Department: AI Infrastructure

About Hark

Hark is an artificial intelligence company building advanced, personalized intelligence. One that is proactive, multimodal, and capable of interacting with the world through speech, text, vision, and persistent memory.

We're pairing that intelligence with next-generation hardware to create a universal interface between humans and machines. While today's AI largely operates through chat boxes and decade-old devices, Hark is focused on what comes next: agentic systems that interact naturally with people and the real world.

To get there, we're developing multimodal models and next-generation AI hardware together - designed from the ground up as a single, unified interface for a new era of intelligent systems.

About the Role

We are looking for a Member of Technical Staff, Infrastructure Compute to lead and manage large-scale GPU computing clusters powering our AI training and deployment workloads. You'll work at the intersection of systems engineering and machine learning infrastructure, owning the reliability, scalability, and efficiency of the compute platform that our research and engineering teams depend on. This is a high-impact, highly technical role suited for someone who thrives in complex distributed systems environments and cares deeply about infrastructure as a product.

Responsibilities

  • Design, implement, and maintain Infrastructure as Code (IaC) best practices to enable repeatable, auditable, and scalable cluster provisioning.
  • Enhance and harden CI/CD deployment pipelines to ensure robust, secure, and low-latency model service delivery across production environments.
  • Own and evolve stable training infrastructure operating at the scale of 10,000+ GPUs, including job scheduling, fault tolerance, and network fabric optimization.
  • Partner closely with ML researchers and engineers to understand compute bottlenecks and translate them into infrastructure improvements.
  • Monitor system health, define SLOs, and lead incident response for critical training and inference workloads.
  • Drive capacity planning, cost efficiency initiatives, and hardware lifecycle management across the GPU fleet.
  • Contribute to internal tooling and platform abstractions that improve developer experience for teams consuming compute resources.

Requirements

  • 5+ years of experience in infrastructure, systems, or platform engineering, with at least 2 years working in ML or HPC environments.
  • Demonstrated experience managing GPU clusters or large-scale distributed compute infrastructure.
  • Strong proficiency in at least one systems or infrastructure programming language.
  • Deep understanding of networking fundamentals (RDMA, InfiniBand, or RoCE a plus) relevant to high-throughput training workloads.
  • Experience with container orchestration, job scheduling, and multi-tenant resource management.
  • Proven track record owning production systems with high reliability requirements.
  • Strong debugging and observability skills across the full infrastructure stack.

Bonus Qualifications

  • Kubernetes (K8s) — particularly experience operating large, GPU-aware clusters.
  • Pulumi or similar modern IaC tooling.
  • Rust and/or Go for systems-level tooling and performance-critical services.
  • Familiarity with PyTorch and Ray for understanding workload patterns and integration requirements.

Compensation

The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components and benefits depending on the specific role. This information will be shared if an employment offer is extended.

Hark
Hark

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say