The San Francisco Voice Company
All posts

#Why Observability Tools Are Lying to You in Voice AI

#May 26, 2026

Your dashboards are green. Your callers are frustrated.

This is the central problem with applying conventional observability to voice AI. Infrastructure observability tools were designed to measure latency, error rates, and throughput. When a packet drops or a database times out, these metrics catch it. For APIs, they're a reasonable proxy for whether the system is working.

Voice AI isn't an API call with a return value. It's a conversation.

A conversation can succeed by every infrastructure metric — sub-200ms response times, zero timeouts, 100% uptime — and still be a failure. The agent interrupted the caller three times. The tone shifted defensive in turn four. The caller said "okay fine" and ended the call. Your dashboard showed nothing.

#What these tools measure vs. what matters

Infrastructure observability answers: "Is the system running?" It measures whether your components are responding and whether requests are completing.

Voice AI observability needs to answer: "Did the conversation work?" These are categorically different questions.

A call "worked" by infrastructure metrics if audio was transmitted, a response was generated, and the session closed cleanly. A call worked for your users if they got what they needed, felt heard, and didn't hang up frustrated. There is no infrastructure metric that captures the distance between those two outcomes.

#The deeper problem

The issue isn't that infrastructure observability is bad at what it does. It's that transcript-and-metric approaches apply the wrong model to voice AI data.

Most teams instrument voice AI the same way they'd instrument a REST endpoint: log the inputs, log the outputs, measure the latency. If the LLM returned a string and the session didn't error, the system is "healthy." This misses everything.

Speech carries information that text doesn't. Pace. Pauses. Tone shifts. The question said politely versus the question said through clenched teeth. A caller who says "great, that's very helpful" after a frustrating hold sequence is not satisfied. The words say one thing; the acoustics say another. Transcript-based tooling only sees the words.

#What you actually need to index

Voice AI observability needs to operate at the acoustic level, across five dimensions:

  • Tone — sentiment, irony, sarcasm. Whether the words match the meaning.
  • Prosody — pace, pauses, emphasis. The rhythm of how something was said.
  • Tension — frustration, escalation. Detectable within a turn and across turns.
  • Rhythm — interruptions, overlaps, cadence. Whether the conversation was flowing or broken.
  • Intent — what the caller actually needed, which may differ from what they literally asked.

These aren't features on top of observability. They are the observability layer for voice AI.

When a call fails, the signal is almost never in the transcript alone. It's in the prosody of turn three, the tension escalation between turns five and seven, the rhythm disruption where the agent cut the caller off. These are indexable, queryable, and detectable at scale — but only if your tooling knows to look for them.

#The consequence of getting this wrong

Teams running voice AI at scale without acoustic observability are in a strange position: they know something is wrong — customers are unhappy, conversion is low, escalations are high — but they can't find it. They review calls manually. They build crude keyword searches. They look at average handle time as a proxy for call quality.

This is expensive, slow, and doesn't scale. Every new call is another needle in a growing haystack.

The observability gap is the reason voice AI teams spend weeks debugging what should take hours. The infrastructure is fine. The system is lying.