Research Engineer - Data
Department: Bits: Research, LLMs, machine learning, infra
Location: Menlo Park
Employment Type: FullTime
About Periodic Labs
The most important scientific discoveries of our time won’t happen in a traditional lab. We’re an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what’s scientifically possible.
About the Role
You will build and drive the data foundation for our research efforts. This means owning data strategy end-to-end: sourcing and procuring external datasets, integrating internally generated experimental data into the training stack, and ensuring the team always has the right data — in the right shape — to train and improve frontier models.
This role sits at the intersection of data engineering, research infrastructure, and strategy. You will work closely with pretraining, midtraining, and RL researchers to understand what data the models need, then build the pipelines and systems to get it there. The work spans collecting and organizing diverse data sources, improving data quality through deduplication and preprocessing, and ensuring that new experimental results are incorporated in a structured, repeatable way that makes them useful for model development.
What You’ll Do
Own data strategy across the training stack — identifying gaps, evaluating new sources, and shaping the overall data roadmap in collaboration with research leads
Source, evaluate, and procure external datasets across scientific domains including chemistry, physics, materials science, mathematics, and lab instrumentation
Build and maintain robust pipelines for ingesting, processing, and versioning large-scale datasets from heterogeneous sources
Design and implement new evaluation datasets and new RL environments to track and improve our key capabilities
Integrate internally generated experimental data — from lab instrumentation, simulations, and model outputs — into the training stack in a structured and repeatable way
Build tooling that makes it easy for researchers to inspect, query, and understand the data that goes into training runs
Stay current with research on data-efficient training, synthetic data generation, and data selection methods — and bring relevant ideas into production
You Will Thrive in This Role If You Have
Experience building large-scale data pipelines for LLM pretraining or midtraining, including web-scale or scientific corpora
Familiarity with dataset versioning, lineage tracking, and reproducibility tooling such as DVC, Delta Lake, or custom solutions
Experience sourcing and evaluating third-party datasets, including licensing considerations and quality assessment
Strong Python engineering skills and comfort building production-quality tooling in a research environment
Experience making evaluations and RL environments
Experience collaborating directly with ML researchers to translate data needs into pipeline requirements and back again
A research-oriented mindset — you run experiments on data, measure outcomes, and iterate with rigor
Especially Strong Candidates May Also Have
Experience curating scientific datasets specifically for domain-adaptive continued pretraining or instruction tuning
Familiarity with synthetic data generation methods, including model-generated data pipelines and quality verification
A background in a physical science or engineering discipline that informs how you think about scientific data quality and structure
Experience with multimodal data — integrating text, structured numerical data, molecular representations, or spectral data into unified training pipelines
Mechanics
Minimum education: Bachelor’s degree or similar experience
Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role
Compensation: $250,000-350,000 + equity
Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.
We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
