Why do you charge job seekers to use EchoJobs?

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.

How many software engineering jobs are on EchoJobs?

We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!

So, where do the jobs come from?

We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.

What makes EchoJobs different?

We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️

How often are new jobs added?

Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀

How fast can I find a job?

Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯

How often should I check EchoJobs?

Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

Description

AI Diagnostics & Observability Engineer

Department: Engineering

Location: HQ

Compensation: $150K

Employment Type: FullTime

Role Overview

Own and build the full diagnostic, observability, and RCA infrastructure that makes Sage Care’s AI assistant trustworthy and debuggable—in real time and post-call. This engineer builds the visibility layer across telephony, transcription, reasoning, SOP traversal, and tool-calling; creates dashboards for both engineers and live human supervisors; and implements automated triage + notification pipelines that surface issues to the right module owners immediately.

This role sits at the intersection of LLM orchestration, voice pipelines, transcription, SOP engines, and operations, serving as the connective tissue across the stack. Your work enables rapid root-cause analysis, real-time intervention, and continuous improvement of our clinical AI assistants.

Key Responsibilities

Root Cause Analysis, Tracing & Observability

Build automated RCA pipelines to detect and classify failure modes:
- Hallucinations
- Misrouted intents
- Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP)
- Unrecoverable SOP loops
- Broken state transitions
- Telephony dropouts / DTMF issues
Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution.
Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth.
Automatically compute performance, safety, reliability, and coverage metrics.

Diagnostic Dashboards & Visualization

Build live and post-call dashboards that visualize:
- Full call timeline
- SOP/state machine traversal
- Agent reasoning traces
- Tool invocation history
- Divergence from expected behavior
Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots.
Build triage dashboards for engineering and operations teams to rapidly understand system health.

Integration with Core AI Modules

Voice + Telephony Integration
- Trace call-level events (dropouts, retries, audio playback issues).
- Detect DTMF misfires and incorrect action routing.
Transcriber Module Integration
- Analyze turn segmentation, word-error-rate drift, boosting performance, and latency.
- Visualize errors in context (audio, transcript, aligned timecodes).
LLM Orchestration Integration
- Audit intent classification accuracy and subgraph routing.
- Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments.
- Validate tool call correctness (maps, SMS, search, internal SOP tools).

Live Monitoring & Human-in-the-Loop

Architect a live SOP state-machine tracer with:
- Real-time transcript overlays
- Current state + next expected state
- Deviation alerts
Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with:
- Loops
- Latency spikes
- Failed tool calls
- Repeated incorrect decisions
Provide human specialists with escalation alerts and context.

Command & Control Interface

Build an intervention console for on-call specialists, enabling:

“Skip step”
“Say apology”
“Escalate to human”
“Send SMS”
“Repeat last message”
Override of SOP steps while maintaining auditability and continuity.

This system must blend seamlessly into existing agent workflows without breaking call integrity.

Failure Classification, Clustering & Pattern Detection

Build clustering systems (via embeddings or metadata) to group systemic failure modes:
- Intent misroutes under noisy audio
- Repeated missing tool calls
- Looped state machine traversal
- Hallucinated follow-ups or invalid summaries
Generate recurring-failure reports to guide engineering improvements.

Auto-Triaging & Notification System (NEW)

Design and implement an automated triage and notification system that:

Detects failure category and severity in real time.
Routes incidents to the correct module owners:
- Telephony
- Transcription
- LLM orchestration
- SOP/decision-tree team
- Platform reliability
Sends structured payloads containing:
- Trace graphs
- Relevant logs
- Transcript segments
- SOP divergence snapshots
- Suggested RCA labels

Notifications may integrate with:

PagerDuty
Slack (rich message blocks)
Jira auto-ticket creation
Internal incident pipelines

This ensures rapid operational feedback loops and reduces time-to-resolution.

Post-Call RCA Pipelines & Analytics

Extend pipelines to automatically generate human-readable failure summaries with:
- Call-level trace graphs
- Tool call sequences
- Transcript context
- Classified failure types
- Suggested root causes
Store snapshots for operational handoff and debugging.

Required Qualifications

Strong backend engineer experienced with diagnostics, observability, and event-driven tracing.
Expert in Python, logging systems, real-time pipelines, and distributed debugging.
Deep familiarity with:
- LLM agents
- LangGraph or state-machine frameworks
- Tool-calling architectures
- Telemetry or tracing frameworks
Comfortable designing both:
- Backend data pipelines
- Frontend dashboards in React, D3, WebSockets, or equivalent.

Preferred Qualifications

Telephony/Voice: SIP, WebRTC, Twilio, audio streaming pipelines.
Clinical operations, call-center workflows, or mission-critical HITL supervision systems.
Observability stacks (Grafana, ELK, OpenTelemetry, Sentry).
Clustering/ML techniques for failure pattern discovery.

Sage Care

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say