Sanas

Principal Data Engineer

Palo Alto, CA
Python SQL PostgreSQL Snowflake Databricks ClickHouse Spark Flink Ray AWS GCP Airflow Dagster Kafka Machine Learning AI
Description

Staff+ Data Engineer (ML Infrastructure)

Location: Palo Alto, CA

Department: Science

About the Role

Our models are only as good as the data that trains them. As a Staff Data Engineer, you'll own the infrastructure that takes raw audio — millions of hours across accents, languages, noise conditions, and recording environments — and turns it into clean, reproducible, training-ready data at scale. You'll work directly with AI research scientists and ML engineers to design systems that move fast without breaking the data quality guarantees our models depend on.

Job Description

Data pipeline & lakehouse architecture

  • Design and implement large-scale data pipelines that ingest, transform, validate, and serve high-quality audio and metadata for AI model training, evaluation, and product telemetry.
  • Own the lakehouse architecture — table format choices (Iceberg vs. Delta Lake), partitioning strategies, metadata management, and schema evolution — with a bias toward reproducibility and auditability.
  • Build and maintain batch and streaming pipelines using Spark, Flink, and orchestration tooling (Airflow or Dagster), with a clear-eyed view of when each is the right tool.
  • Extend and maintain feature store infrastructure to serve low-latency, versioned features for both training and real-time inference.

Audio data at scale

  • Develop and maintain pipelines purpose-built for the unique challenges of audio data: large file volumes, time-series feature extraction, speaker and language metadata, and annotation versioning.
  • Build tooling that supports the full audio data lifecycle — from raw ingestion and quality filtering through augmentation, segmentation, and training split generation — with reproducibility guarantees at every stage.
  • Partner with ML engineers and research scientists to design data schemas, sampling strategies, and evaluation datasets that accurately reflect production conditions.
  • Own data pipelines that feed human-in-the-loop annotation workflows — ensuring clean round-trips between raw data, labeling platforms, and training-ready outputs.

Platform reliability & governance

  • Instrument pipelines with observability, data quality checks, lineage tracking, and alerting — so failures surface fast and root causes are traceable.
  • Drive build vs. buy decisions for data quality, observability, and cataloging tooling with a clear framework grounded in Sanas's scale and roadmap.
  • Own disaster recovery design for critical data assets — training datasets, evaluation benchmarks, and model checkpoints.

Technical leadership

  • Set the technical bar for the data engineering team — review designs and code, establish patterns, and document decisions in a way that raises the floor for everyone.
  • Work cross-functionally with AI research, infrastructure, product, and legal to align data architecture with business needs and regulatory requirements.
  • Contribute to hiring — identify strong candidates, conduct technical interviews, and help define what great looks like for data engineering at Sanas.

Qualifications

  • 5+ years of experience in data engineering, ML infrastructure, or data platform roles.
  • Deep expertise building distributed batch and streaming data systems in production.
  • Strong command of data processing frameworks: Spark, Flink, and Ray; and orchestrators: Airflow or Dagster.
  • Hands-on experience with cloud data platforms — Snowflake, Databricks, or ClickHouse — and object storage (S3, GCS) on AWS or GCP.
  • Solid understanding of data lifecycle management: privacy, security, compliance, and reproducibility from ingestion through model training.
  • Proven ability to work directly with ML researchers and engineers to translate model requirements into data infrastructure decisions.

Bonus

  • Direct experience with audio data pipelines — file handling at scale, time-series features, speaker metadata, or audio annotation tooling.
  • Familiarity with ASR, TTS, or speech enhancement model training workflows and the data requirements specific to each.
  • Experience with MLOps tooling — experiment tracking, dataset versioning (DVC, LakeFS), and training pipeline orchestration.

About the Company

Sanas is pioneering the future of human communication. Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world's first real-time speech AI platform capable of accent translation, noise cancellation, speech enhancement, cross-language communication, and more.

Sanas makes conversations clearer, more inclusive, and more effective, removing barriers that prevent people from being understood, regardless of accent, background noise, or native language.

Sanas is currently one of the fastest growing startups in Silicon Valley, growing from $16M to $50M ARR in 2025. The company's core business is profitable and is on track to end 2026 with >$120M ARR. Our team combines deep expertise in model innovation and systems engineering with a design-minded product engineering culture to build and ship cutting-edge AI models and experiences — entirely in-house.

Sanas is a 180-strong team, established in 2020. In this short span, we've successfully secured over $100 million in funding. Our innovation has been supported by the industry's leading investors, including Insight Partners, Google Ventures, Quadrille Capital, General Catalyst, Quiet Capital, and other influential investors. Our reputation is further solidified by collaborations with numerous Fortune 100 companies. With Sanas, you're not just adopting a product; you're investing in the future of communication.

If you’re looking to have a significant role in roadmapping and driving technical directions, if you’re looking to deploy challenging and big ideas without much overhead or slowness, if you're looking to leave your mark on an ambitious, generational mission to change how the worlds thinks about speech + AI, then Sanas is a well-suited place for you.

Sanas
Sanas

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say