Senior AI Infrastructure Engineer (LLMOps/MLOps)
Location: San Jose, CA
Department: AI Research
As a Senior AI Infrastructure Engineer, you will own the design, deployment, and scaling of our AI infrastructure and production pipelines. You’ll bridge the gap between our AI research team and engineering organization, enabling the deployment of advanced LLM and ML models into secure, high-performance production systems.
You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real-world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands-on, cross-functional, and driven to build world-class AI systems from the ground up.
Key Responsibilities:
Core (Mission-Critical)
- Own and manage the AI infrastructure stack — GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).
- Productionize LLMs and ML models developed by the AI team, deploying them into secure, monitored, and scalable environments.
- Design and maintain REST/gRPC APIs for inference and automation, integrating tightly with the core cybersecurity platform.
- Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.
Infrastructure & Reliability
- Build and maintain infrastructure-as-code (IaC) setups using Terraform or Pulumi for reproducible environments.
- Implement observability and monitoring — latency, throughput, model drift, and uptime dashboards with Prometheus / Grafana / OpenTelemetry.
- Automate CI/CD pipelines for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.
- Architect scalable, hybrid AI systems across on-prem and cloud, enabling cost-effective compute scaling and fault tolerance.
Security, Data, and Performance
- Enforce data privacy and compliance across AI pipelines (SOC2, encryption, access control, VPC isolation).
- Manage data and model artifacts, including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.
- Optimize inference latency, GPU utilization, and throughput, using batching, caching, or quantization techniques.
- Build fallback and failover mechanisms to maintain service reliability in case of model or API failure.
Innovation & Leadership
- Research and integrate emerging LLMOps and MLOps tools (e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).
- Create sandbox environments for AI researchers to experiment safely.
- Lead cost optimization and capacity planning, forecasting GPU and cloud needs.
- Document and maintain runbooks, architecture diagrams, and standard operating procedures.
- Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.
Qualifications:
Required
- 5+ years of experience in ML Infrastructure, MLOps, or AI Platform Engineering.
- Proven expertise with LLM serving, distributed systems, and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).
- Strong programming skills in Python and experience building APIs (FastAPI, Flask, gRPC).
- Proficiency with cloud platforms (Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).
- Solid understanding of CI/CD, Docker, containerization, and model registry practices.
- Experience implementing observability, monitoring, and fault-tolerant deployments.
Preferred
- Familiarity with vector databases (FAISS, Pinecone, Weaviate, Qdrant).
- Exposure to security or compliance-focused environments.
- Experience with PyTorch / TensorFlow and MLflow / Weights & Biases.
- Knowledge of distributed training or large-scale inference optimization (DeepSpeed, TensorRT, Quantization).
- Prior work at startups or fast-paced R&D-to-production environments.
About the Company
Kai is the AI company rebuilding cybersecurity for the machine-speed era. Founded by second time founders and trusted by Fortune 500 enterprises, Kai is building a future where security has no categories, no silos, and no human speed bottlenecks. The Kai Agentic Platform replaces fragmented, human-limited workflows with agentic AI systems that continuously contextualize, assess, reason, and execute security work at the speed of thought - making human defenders, superhuman.
Why Kai?
- $125M in Funding: We are well-funded and have the resources to innovate and scale rapidly.
- Proven Early Success with Fortune 500 Customers: We have started partnering with Fortune 500 companies, marking early success and growing trust in our innovative solutions. This highlights the immense potential and reliability of our AI-powered cybersecurity offerings.
- Experienced Leadership: Our founding team consists of second and third-time entrepreneurs, each with over 25 years of experience in the cybersecurity industry. Their proven expertise and vision drive our ambitious goals, positioning us to lead in AI-powered cybersecurity.
- World-Class Leadership Team: Our Heads of AI, Engineering, and Product bring extensive experience from some of the world’s most influential companies, ensuring top-tier mentorship, direction, and vision.
- Cutting-Edge AI Solutions: Our team leverages the most advanced AI technologies, including Large Language Models (LLMs) and Generative AI.
- Generous Compensation: We offer highly competitive salaries, equity options, and a supportive work environment. Your contributions will be valued and rewarded as we grow together.
- Cybersecurity Knowledge Preferred but Not Required: While experience in cybersecurity is a plus, we are primarily seeking top-tier talent in microservices architecture, software development, and/or DevOps who are passionate about solving complex problems.
There are more than 50,000 engineering jobs:
Subscribe to membership and unlock all jobs
Engineering Jobs
60,000+ jobs from 4,500+ well-funded companies
Updated Daily
New jobs are added every day as companies post them
Refined Search
Use filters like skill, location, etc to narrow results
Become a member
🥳🥳🥳 452 happy customers and counting...
Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.
To try it out
For active job seekers
For those who are passive looking
Cancel anytime
Frequently Asked Questions
- We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
- We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
- We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
- We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
- Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
- Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
- Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅
What Fellow Engineers Say
