Search Atlas

Platform Reliability Engineer, Agentic AI

Remote Medellin, CO
Kubernetes EKS GKE Terraform ArgoCD GitOps Python OpenTelemetry Prometheus Grafana Karpenter KEDA LLM
Description

The Mission: Building the Autonomous Nervous System

Search Atlas is moving beyond suggestions to full execution.

Our agent, Atlas Brain, handles SEO, AEO, Google Ads, and AI Content Generation autonomously—zero manual intervention.

While Platform Engineers build self-service tools for developers, you ensure those tools enable autonomous AI execution with 99.99% reliability. You're not keeping dashboards alive; you're building the engine that allows an AI Agent to replace manual marketing execution. If the platform is reliable, the agent is unstoppable.



What You Will Do:

Architect the Autonomous Backbone

Design and maintain the Kubernetes-based platform (EKS/GKE) that hosts Atlas Brain and its distributed agentic workers—handling millions of requests across SEO crawling, content generation, and ad optimization pipelines.

Engineer for Zero-Touch

Automate every aspect of infrastructure using Terraform, ArgoCD, and Go/Python. If you have to do it twice, it must be a script. Enable true "zero manual execution" at the infrastructure level.

Scale Agentic Workflows
  • Optimize ML inference pipelines for real-time agent decision-making

  • Architect high-concurrency crawling systems that feed Atlas Brain's intelligence

  • Ensure sub-second latency for agent task execution (SEO, Content, AI Builder)

  • Handle high-frequency data pipelines: real-time bidding, SERP monitoring, content generation at scale

Define Radical Reliability for AI

Establish SLOs/SLIs specifically for AI execution success rates and agent task completion, not just "uptime." Design self-healing systems that preemptively resolve failures before they impact autonomous workflows.

Observability for Agent Decisions

Build distributed tracing and monitoring for complex agentic interactions—trace agent decision trees across SEO/AEO/Ads workflows, enabling rapid diagnosis of "why the agent made that choice." Implement OpenTelemetry, Prometheus, and Grafana for full visibility into autonomous execution.

Safety & Guardrails

Implement guardrails and safety controls for autonomous agent execution in marketing contexts—ensuring AI actions align with business rules, budget constraints, and compliance requirements. Design human-in-the-loop escalation paths for edge cases.

Cost & Performance Governance

Proactively optimize cloud spend and resource allocation (Karpenter/KEDA) as we scale to thousands of agencies. Balance performance with cost efficiency for unpredictable AI workloads.


Technical Requirements

Experience: 6+ years in Platform Engineering, SRE, or Infrastructure roles within high-growth SaaS environments—with proven experience supporting AI/ML systems at scale.

Infrastructure as Code: Mastery of Terraform, ArgoCD, and GitOps workflows.

Container Orchestration: Expert-level Kubernetes (EKS/GKE) networking, scaling, security, and multi-tenancy patterns.

MLOps for Agents (Must-Have):

  • Hands-on experience with MLOps pipelines for autonomous agents

  • Model versioning and deployment strategies for continuous agent improvement

  • Prompt management and A/B testing of agent behaviors

  • Guardrails for safe tool execution and decision boundaries

  • Scaling AI inference services (LLMs, embeddings, classification models)

Languages: Proficiency in Python for building custom platform tools and automation.

Observability: Deep expertise in distributed tracing and monitoring for complex, event-driven systems—specifically for debugging AI agent decision chains.

Data-Intensive Systems: Experience with high-frequency data pipelines, web crawling at scale, real-time processing, and low-latency requirements.



Why This Is Different

Unlike traditional SRE roles focused on keeping services up, you're building the infrastructure that enables autonomous AI to execute business-critical marketing tasks. Every millisecond of latency you eliminate, every self-healing mechanism you deploy, directly impacts whether Atlas Brain can truly replace manual agency work.

This is not traditional SRE—you're building the autonomous nervous system for AI execution.



What Success Looks Like

  • Atlas Brain executes millions of marketing tasks daily with <0.1% failure rate

  • Zero infrastructure-related incidents requiring manual intervention during business hours

  • Platform scales from hundreds to thousands of agency clients without reliability degradation

  • Complete observability into agent behavior: "We know not just that the agent acted, but why"


Ready to build the platform that makes autonomous marketing execution a reality?
Search Atlas
Search Atlas

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say