ML Ops Engineer
Team: Engineering
Location: Vauxhall, London
Commitment: Full Time
Workplace Type: onsite
Key Responsibilities
- Own and extend Circadia’s ML pipeline orchestration using Apache Airflow, including training, evaluation, and deployment workflows.
- Build and maintain automated pipelines for model retraining, validation, and promotion across development, staging, and production environments.
- Implement pipeline monitoring, alerting, and failure recovery to eliminate silent failures and ensure operational reliability.
- Design pipeline architectures that support rapid experimentation while enforcing production-grade reproducibility.
- Deploy and manage ML models on AWS infrastructure (e.g. AWS Batch for batch inference workloads).
- Support deployment of models to edge devices, including Circadia’s clinical monitoring hardware, working with firmware and embedded engineering teams as needed.
- Manage model versioning, promotion, and rollback workflows through the MLflow model registry.
- Evaluate and implement strategies for safe model rollouts (e.g. shadow deployments, canary releases) as the platform matures.
- Maintain and improve the MLflow-based experiment tracking and model registry infrastructure.
- Establish conventions for experiment logging, artifact storage, model metadata, and lineage tracking.
- Enable ML engineers to move seamlessly from experimentation to production deployment with minimal friction.
- Implement and maintain training data versioning and dataset management practices to ensure reproducibility of model training runs.
- Track dataset lineage, labeling provenance, and feature dependencies alongside model versions.
- Collaborate with ML engineers and data engineers to formalise dataset release and validation workflows.
- Build monitoring systems for model performance in production, including data drift detection, prediction quality tracking, and alerting on degradation.
- Implement operational dashboards for pipeline health, compute utilisation, and deployment status.
- Collaborate with data engineering to ensure upstream data quality and pipeline reliability for ML feature inputs.
- Develop incident response procedures and runbooks for ML system failures.
- Manage and optimise AWS compute resources (Batch, EC2, or similar) used for model training and inference.
- Design infrastructure-as-code solutions for reproducible ML environments.
- Drive cost optimisation across ML compute, storage, and data transfer.
- Support Snowflake integrations for feature generation and training data pipelines.
- Introduce and champion ML engineering best practices including CI/CD for models, automated testing for ML pipelines, and reproducible training workflows.
- Build internal tooling and templates that accelerate the ML development-to-production cycle.
- Document operational processes, architecture decisions, and onboarding materials for the ML platform.
- Participate in architecture discussions and technical planning to ensure ML systems scale with Circadia’s growth.
- Ensure all ML pipelines and infrastructure meet healthcare security and privacy requirements, including HIPAA and SOC 2.
- Apply best practices for handling Protected Health Information (PHI) in training data, model artifacts, and inference outputs.
- Maintain audit trails for model decisions, data access, and deployment history.
Required Qualifications
- 4+ years of experience in MLOps, ML Engineering, DevOps, or a closely related infrastructure role.
- Strong proficiency in Python for ML pipeline development, tooling, and automation.
- Hands-on experience with ML pipeline orchestration tools, particularly Apache Airflow.
- Experience with model registries and experiment tracking platforms (MLflow preferred).
- Experience deploying and operating ML workloads on AWS (Batch, EC2, S3, IAM, CloudWatch).
- Solid understanding of the ML lifecycle: training, evaluation, deployment, monitoring, and retraining.
- Experience with containerisation (Docker) and infrastructure-as-code.
- Proficiency with Git and version control workflows.
- Familiarity with SQL and data warehousing platforms (Snowflake preferred).
- Experience implementing monitoring, logging, and alerting for production systems.
- Strong debugging and incident response skills for complex distributed systems.
Preferred Qualifications
- Experience deploying models to edge or embedded devices.
- Background in healthcare, medical devices, or clinical data systems.
- Familiarity with model serving frameworks (e.g., TorchServe, TF Serving, Triton, or custom solutions).
- Experience with CI/CD systems for ML (e.g., GitHub Actions, Jenkins, or similar).
- Experience with data versioning tools (e.g., DVC, LakeFS, or similar).
- Experience supporting data science or ML research teams in a production context.
- Exposure to HIPAA compliance and healthcare security best practices.
- Experience with distributed compute frameworks (e.g. Apache Spark, Dask) for large-scale data processing.
- Experience with streaming or real-time inference architectures.
What You Bring
- You take ownership of ML infrastructure end-to-end — from training pipelines to production monitoring.
- You care deeply about reliability, reproducibility, and operational excellence in ML systems.
- You have strong opinions (loosely held) on how to build a great ML platform, and you’re eager to put them into practice.
- You are comfortable working in a startup environment where you’ll wear multiple hats and move fast.
- You communicate clearly across engineering, data science, and clinical teams.
- You’re motivated by building technology that directly improves patient care.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
