You are viewing a preview of this lesson. Sign in to start learning
Back to Building Real-Time Voice Agents from Scratch

Part III: The Hard Parts

Tackle the problems that make voice agents feel real: interruption, echo, and the distributed state machine governing a full turn.

Last generated

Why Voice Agents Break: The Gap Between Demo and Reality

Every voice agent demo looks the same: the agent speaks, falls silent, the user responds, the agent processes, speaks again. Clean. Sequential. Satisfying. Then you ship it to real users and within the first hour someone talks over the agent mid-sentence, the agent keeps speaking, a feedback loop causes it to respond to its own voice, and two responses arrive in rapid succession for a single short question. The gap between that demo and a working product is not a gap of polish — it is a gap of architecture. The problems that surface are not edge cases to handle later; they are unavoidable consequences of assumptions baked into the nominal pipeline that simply do not hold in practice.

This lesson names those assumptions, explains why they break, and frames the three core hard parts — interruption, acoustic echo, and distributed state — as engineering challenges with specific mechanical causes.

The Half-Duplex Assumption That Hides in Plain Sight

Most voice pipelines are built around an implicit model that mirrors how push-to-talk radio works: one party transmits while the other listens, then roles switch. In software this shows up as a sequential state machine:

[LISTENING] → [PROCESSING] → [SPEAKING] → [LISTENING] → ...

The pipeline waits for end-of-speech, sends the utterance to the LLM, streams the response to TTS, plays the audio, and only then re-enables the microphone. This is a half-duplex design: each direction of communication is exclusive. It is easy to implement, easy to demo, and wrong for real users.

Real conversation is full-duplex. Humans interrupt constantly — not as a pathological behavior but as a natural conversational signal. A user saying "yeah, no, I got that part" halfway through the agent's explanation is not misbehaving; they are communicating efficiently.

When a half-duplex pipeline encounters a mid-speech user utterance, one of several bad things happens:

  • 🔇 The microphone is disabled during TTS playback, so the user's words are simply never captured.
  • 🔄 The microphone is enabled but VAD ignores the input, because the logic that gates LLM calls waits for a formal end-of-speech signal that only fires after the current playback state clears.
  • 📨 The user's words are captured and queued, but processed after the agent finishes — meaning the agent answers the original question AND the interruption sequentially, producing a disjointed exchange.

The fix is not simply "enable the microphone all the time." Enabling the mic during playback without addressing acoustic echo introduces a different class of failure, and enabling interruption without connecting it to a mechanism that halts playback means the signal is received but nothing changes. These dependencies are why the three hard parts are genuinely entangled rather than independent checkboxes.

💡 Mental Model: Think of the half-duplex pipeline as a revolving door with a single compartment. Only one person can pass at a time, and the door must complete its cycle before the next person can enter. Full-duplex conversation is a standard double door — both directions are always passable, but now you need rules for when people walk into each other. The rules are harder to write, but the door is the right shape for the traffic.

Acoustic Echo: The Agent Listening to Itself

Suppose you fix the half-duplex problem. The microphone is now open during playback. You immediately discover the next problem: the microphone picks up the agent's own speaker output.

This is acoustic echo — not a software artifact but a physical fact. Sound from the speaker travels through the air, reflects off walls and surfaces, and arrives at the microphone a few milliseconds later. The agent's TTS output is the "far-end" signal, and the microphone captures it as if it were user speech.

The consequences are specific and severe. VAD measures signal energy and spectral characteristics to determine whether a human is speaking. The agent's TTS output scores well on exactly these features — it sounds like speech because it is speech. A VAD that has not been given a reference signal to subtract will happily declare that the user is speaking during the agent's response. Worse, the ASR engine transcribes the agent's output and feeds it back to the LLM, which then responds to its own previous statement:

         Agent TTS output: "Your appointment is scheduled for Thursday."
                  |
                  ▼
         [Speaker emits audio]
                  |
               (acoustic path)
                  |
                  ▼
         [Microphone captures echo]
                  |
                  ▼
         VAD: speech detected
                  |
                  ▼
         ASR output: "your appointment is scheduled for thursday"
                  |
                  ▼
         LLM input: [user] "your appointment is scheduled for thursday"
                  |
                  ▼
         LLM output: "Yes, that's correct! Is there anything else...?"

The standard mitigation is Acoustic Echo Cancellation (AEC), which uses the loudspeaker signal as a reference to subtract the estimated echo contribution from the microphone signal. What matters here is the framing: AEC is not an enhancement that improves voice quality, it is a prerequisite for correct behavior in a full-duplex pipeline. Its absence does not degrade quality; it makes the system functionally broken.

Three Concurrent Processes, Zero Shared State

Even with full-duplex input and acoustic echo addressed, a structural problem remains: a single conversational turn is not one process, it is at least three, running concurrently.

  STT Process          LLM Process          TTS/Playback Process
  ──────────────       ──────────────       ──────────────────────
  Streaming audio  →   Streaming tokens →   Queued audio chunks
  Partial results       being generated       being played
  Finalization          Still running?        Still playing?
  event                 Cancelled?            Flushed?

At any given moment, each of these subsystems holds a local view of where the conversation is. None of them automatically knows what the others know. This lack of shared state is the root cause of most subtle voice agent bugs.

Consider a concrete scenario: a user says "actually, stop" halfway through a long agent response. The VAD fires, the STT transcribes the interruption, and the result is sent to the LLM. The LLM receives "actually, stop" and correctly decides to halt — it stops generating tokens. But the TTS engine was given 800ms of audio chunks 400ms ago, and the playback buffer is already committed. The agent continues speaking for another 800ms after the LLM has already processed the interruption signal. The interruption was correctly handled by two of the three subsystems; the third — playback — never received the signal because no code path connected the LLM's decision to halt and the playback buffer's flush operation.

The subsystems were not in agreement about a single canonical question: "Is the agent currently speaking?" In a voice agent, exactly one piece of code should be the authoritative answer to that question — and every subsystem should query or update that single source of truth rather than maintaining its own local assumption.

Latency Budgets and the Cost of Recovery Paths

Adding interruption support and recovery paths to a voice pipeline dramatically tightens the latency budget. In the nominal flow the latency components are relatively tractable:

  Nominal latency path:
  ┌─────────────────────────────────────────────────────────────┐
  │  VAD end-of-speech delay  →  STT finalization               │
  │  →  LLM first token        →  TTS first audio chunk         │
  │  →  Playback buffer fill   →  First audible word            │
  └─────────────────────────────────────────────────────────────┘

Once interruption and recovery are added, new latency paths appear. Interruption detection latency must be short enough that the agent feels like it stopped in response to the user — typically under 300ms of perceived delay, a threshold that is difficult to hit when AEC processing, VAD windowing, and network round trips are all in the chain. Playback halt latency adds more: a 200ms audio buffer means the agent will speak for at least 200ms after the interrupt signal is processed, no matter how fast the software responds. Then the recovery path — flushing partial state, discarding in-flight TTS chunks, resetting LLM context — must complete before the new user utterance can be processed.

A pipeline with a 200ms playback buffer, a 150ms VAD window, and a 50ms coordinator dispatch delay has a minimum perceivable interruption response of roughly 400ms from the moment the user starts speaking to the moment audio stops — before the recovery path begins. Interruption and recovery are not additions to a completed pipeline; they are constraints that should shape the architecture from the beginning.

📋 The Three Hard Parts at a Glance

🔴 Problem ⚙️ Mechanism ⚠️ Naive Fix That Fails
🎙️ Interruption Half-duplex pipeline drops or delays user speech during agent output Mic disabled or gated during playback; interrupt signal not connected to playback halt Enable mic continuously without AEC → echo feedback
🔊 Acoustic Echo Agent's TTS output captured by mic, transcribed, fed to LLM Physical echo path from speaker to mic; VAD and ASR have no reference signal Silence detection → fails on reverberation tail
🔄 Distributed State STT, LLM, and TTS/playback each hold inconsistent views of turn progress Async events arrive out of order; no single owner of "is agent speaking?" Timeouts and sleep calls → masks race conditions, does not resolve them

The sections that follow treat each problem with the depth it requires. The goal here is not to solve the hard parts — it is to make clear that they are hard for mechanical reasons, not incidental ones. That distinction matters because it determines where you look for the fix.


The Acoustic Environment as an Adversary

Before a single word reaches your speech recognizer, the audio signal has already been shaped — and often damaged — by the physical environment it traveled through. Every downstream component — VAD, ASR, barge-in detection — assumes it is receiving reasonably clean near-end speech. When that assumption is violated, the failures are subtle, hard to reproduce, and deeply confusing to debug, because they manifest several layers above where they originate.

What Acoustic Echo Is and How AEC Addresses It

When your voice agent plays audio through a speaker, that audio travels through the air and is picked up by the microphone. Sound travels at roughly 343 meters per second at room temperature. In a typical room, the direct-path delay from speaker to microphone is roughly 1 to 6 milliseconds — but sound also bounces off walls, furniture, and ceilings, producing reverberation: a smeared, decaying copy of the signal that can persist for hundreds of milliseconds after the original sound stops.

Agent speaker output
        │
        ▼
 ┌─────────────┐   direct path (~1–6 ms delay)
 │   SPEAKER   │──────────────────────────────────────────┐
 └─────────────┘                                          │
        │                                                 ▼
        │  reflected paths (walls, surfaces)       ┌──────────┐
        └──────────────────────────────────────────▶    MIC   │
           (longer delays, attenuated, phase-shifted)└──────────┘
                                                          │
                                                          ▼
                                               Mixed signal:
                                               [near-end speech]
                                               + [echo of speaker]
                                               + [reverb tail]
                                               + [background noise]

Acoustic echo cancellation (AEC) addresses this by using the loudspeaker reference signal — the audio sent to the speaker — as a model of what the echo is likely to sound like. The core mechanism is an adaptive filter that maintains an estimate of the acoustic echo path, convolves the loudspeaker reference with that estimate to produce a synthetic echo, and subtracts it from the microphone signal:

Loudspeaker reference signal
        │
        ▼
 ┌──────────────────┐
 │  Adaptive Filter │  ← models the room's echo path
 │  H_est(z)        │
 └──────────────────┘
        │
        │  synthetic echo estimate
        ▼
 ┌──────────────────────────────────────┐
 │  SUBTRACTOR                          │
 │                                      │
 │  mic_signal − echo_estimate          │
 │        = near-end speech (ideally)   │
 └──────────────────────────────────────┘
        ▲
        │
  mic_signal (speech + real echo + noise)

The filter adapts continuously, updating its coefficients as the room's acoustic properties change. This is why AEC is described as adaptive rather than fixed — there is no single coefficient value you can hardcode.

🎯 Key Principle: The reference signal must arrive at the AEC module before the corresponding echo arrives at the microphone. If the reference is delayed — for example, because it is captured post-processing from the audio output stack rather than pre-buffer — the AEC module is trying to cancel an echo using a misaligned reference. The subtraction will be wrong, and the echo will survive or be worsened. A common failure pattern occurs when developers route the loudspeaker reference through the OS audio mixer, which adds a variable buffering delay of tens of milliseconds. The AEC adaptive filter then converges on a nonsensical echo path estimate, and echo suppression fails entirely — even though all the components appear to be configured correctly.

Full-Duplex Means AEC Runs Continuously

A common simplification — and a consequential one — is to treat AEC as something you enable only while TTS audio is playing. This is wrong, and the reason goes back to reverberation.

When TTS playback ends, the room does not immediately go acoustically silent. The reverberation tail — the decaying reflections of the last few hundred milliseconds of audio — persists for a duration determined by the room's RT60 (the time it takes for the sound level to drop by 60 dB). In a typical furnished room, RT60 values of several hundred milliseconds are common. If you disable AEC the moment the TTS audio stream ends, the reverberant tail is still arriving at the microphone unfiltered. The result is garbled transcriptions at the start of a user's turn — precisely when you most need clean audio.

There is a second, subtler reason for continuous AEC. The adaptive filter needs convergence time. If you cold-start the AEC module at the beginning of each TTS playback, it takes tens to hundreds of milliseconds to converge on a useful echo path estimate. Running continuously allows the filter to stay adapted to the room's current acoustic state.

⚠️ Common Mistake: Gating AEC on a tts_playing boolean is one of the most common implementation errors. The correct gating condition, if you must gate at all, is based on whether reverberant energy could plausibly still be present — not on whether a playback flag is set. In practice, the safest approach is no gating at all.

Noise Suppression Is a Separate Stage

Noise suppression is a distinct processing stage from AEC, and conflating the two leads to misconfigured pipelines:

  • AEC removes correlated signal — audio that entered the microphone because the loudspeaker emitted it. There is a known reference.
  • Noise suppression removes uncorrelated interference — keyboard clicks, HVAC hum, street noise. There is no reference signal; the algorithm estimates the noise floor from the signal itself.
Raw microphone signal
        │
        ▼
 ┌─────────────────┐
 │  AEC            │  removes loudspeaker echo (reference-based)
 └─────────────────┘
        │
        ▼
 ┌─────────────────┐
 │  Noise          │  removes background noise (model-based)
 │  Suppression    │
 └─────────────────┘
        │
        ▼
  Signal to VAD / ASR

The ordering matters. Running noise suppression before AEC destabilizes the noise floor estimate that the adaptive filter relies on. The conventional ordering is AEC first, then noise suppression, then any further gain control or VAD.

AEC artifacts — the distortion introduced when echo cancellation is imperfect or too aggressive — can be more damaging to ASR accuracy than the original echo in some scenarios. When the AEC adaptive filter has not converged well, it subtracts the wrong amount of signal and introduces residual distortion into the near-end speech. An ASR model trained on clean microphone speech will have no representation for this distortion and will produce errors. This leads to one of the more counterintuitive failure modes: aggressive AEC can make ASR worse. This is also why the ordering of stages matters: you want the best possible signal entering each stage.

Device-Level vs. Software AEC

Platforms give you options for where AEC is implemented, and the choice has real engineering consequences.

Device-level AEC is implemented either in hardware DSPs on audio chips or in the operating system's audio stack, applied transparently before audio reaches your application. Software AEC is implemented as a library in your application process, operating on raw PCM audio from the microphone and a separately captured loudspeaker reference.

📋 AEC Implementation Layers

🔒 Device / OS AEC 🔧 Software AEC
🎯 Reference tap Handled internally by driver You must wire it up correctly
⚙️ Configurability Low — often a black box High — tunable parameters
🔄 Cross-platform Wildly inconsistent Consistent if you control the library
⚠️ Failure mode Silent — hard to detect misconfiguration Visible — misconfigs produce audible artifacts
🧠 Reliability Strong on target hardware; weak on generic hardware Predictable across platforms

The main hazard with device-level AEC is opacity. When it works, it is excellent. When it fails — because the device is used in an unexpected configuration or a driver bug silently disables it — you have no visibility and no recourse.

A practical example: a voice agent running as a browser-based application on a laptop will receive audio processed by the OS audio stack, which may or may not apply AEC depending on the platform, browser, and audio device. On some combinations, AEC is applied twice — once by the OS and once by a software library — which can cause double-cancellation artifacts that are difficult to trace. Before enabling software AEC, verify whether the platform is already applying AEC at the driver level.

💡 Pro Tip: When debugging AEC failures, record the raw microphone signal and the loudspeaker reference signal to disk simultaneously with a shared timestamp. If you can clearly hear the agent's voice in the raw mic recording, AEC is either absent or misconfigured. If the mic recording sounds distorted even when the user is speaking clearly, AEC is over-correcting or the reference alignment is off.

Near-End Speech Detection and the AEC Interaction

AEC adaptive filters update their coefficients continuously — that is how they track room changes. But they should not update while the user is talking. If they do, the filter will try to model the user's voice as part of the echo path, corrupting the echo path estimate. This is the role of near-end speech detection (called double-talk detection in the telephony literature): when near-end speech is detected, the adaptive filter coefficients are frozen until it ends.

This creates a circular dependency: AEC needs accurate near-end speech detection to perform well, but the signal presented to VAD for near-end speech detection may itself be contaminated with echo if AEC has not yet converged. Early in a session, or after an acoustic change, this can cause a brief period where both AEC and VAD are operating on unreliable information. The practical resolution is to treat the first few seconds of a session as a calibration window — avoid triggering important conversational state changes during that window.

The Signal Chain as Infrastructure

The key shift in thinking this section aims to produce is treating the audio signal chain — AEC, noise suppression, and their correct ordering and configuration — as infrastructure, not optional enhancement. Every component downstream of the microphone is making implicit assumptions about signal quality. When those assumptions are violated, failure modes are non-local: a misconfigured AEC manifests as bizarre ASR outputs, which manifest as incoherent LLM responses, which look like an LLM bug. Tracing that chain backward requires knowing that the acoustic layer is where to look first.


State Ownership in a Multi-Process Turn

With the acoustic environment addressed, a structural problem remains that no amount of signal processing can fix: a single conversational turn spans multiple concurrent subsystems, each maintaining its own local view of what is happening, and none of them automatically synchronized with the others. That gap between local views and shared reality is the root cause of most of the subtle, hard-to-reproduce bugs that plague voice agents in the field.

The Three Local Views of a Single Turn

Consider what is actually running during a single agent response. You have at minimum three distinct subsystems operating simultaneously:

  • 🎙️ The STT subsystem is consuming audio frames, emitting partial transcripts, and waiting for a VAD event or timeout to finalize a transcript.
  • 🤖 The LLM subsystem has received the finalized transcript, begun generating a response, and is streaming tokens out as they arrive.
  • 🔊 The TTS/playback subsystem is receiving those tokens in chunks, synthesizing audio, and pushing it to a speaker buffer.
Time ──────────────────────────────────────────────────────────────────►

STT     [listening]────[partial]──[partial]──[FINAL]────[listening...]─►

LLM                                          [start]──[streaming...]──►

TTS                                                    [synth]─[play]─►

        ▲                                   ▲          ▲
        │                                   │          │
        User starts                    STT fires   First audio
        speaking                       FINAL event  plays

Each subsystem has its own sense of elapsed time and its own state. The STT process, once it fires a FINAL event, may reset to listening immediately — it does not know or care that the LLM is still generating or that audio has not started playing yet. The failures happen in the gaps between subsystems — in the moments when one subsystem's local view diverges from the true state of the conversation.

The Canonical Question: Is the Agent Currently Speaking?

Among all the state that needs to be shared, one question sits at the top of the hierarchy: "Is the agent currently speaking?" This single value is the linchpin of correct voice agent behavior. Every interruption decision depends on it. Every decision about whether to start processing new audio depends on it.

Ambiguity in this single value is the root cause of three of the most common voice agent bugs:

  1. Double responses. A new user utterance arrives while the agent is still speaking, and because the LLM layer does not know the agent is speaking, it fires another LLM call. The user gets two responses back-to-back.

  2. Missed interruptions. The user speaks while the agent is mid-sentence. The VAD correctly detects speech, the STT correctly transcribes it, the interrupt signal reaches the LLM layer — but because nobody told the TTS/playback subsystem to stop, audio continues to play.

  3. Stale audio playing after a new turn. A buffered TTS chunk from the old response was already queued in the audio driver. It plays out, and the user hears a fragment of the old response followed immediately by the new one.

💡 Real-World Example: A user asks "What time does the store close?" The agent begins answering "The store closes at 9 PM on weekdays, and—" when the user interrupts with "What about Sundays?" The TTS subsystem has already synthesized and buffered "—on weekends it closes at 6 PM." If the buffer flush path is incomplete, the user hears the agent answer their question about Sunday hours even though the system technically processed the interruption. The interruption was handled in one layer and ignored in another.

⚠️ Common Mistake: Treating is_speaking as an output of the TTS subsystem rather than as shared global state. If TTS is the only place that knows it is speaking, no other subsystem can make correct decisions that depend on that fact. The state needs to live somewhere that everything can read it.

Events Arrive Out of Order

The problem is compounded by the asynchronous nature of the event streams these subsystems emit. Network scheduling, OS thread preemption, audio driver buffering, and API response latency all conspire to deliver events in orders that violate intuition:

Actual timeline:
  T=0.0s  User starts speaking (interrupts agent)
  T=0.3s  Agent's queued TTS finishes playing
  T=0.6s  STT finalizes user's interruption utterance

Event arrival order at coordinator:
  T=0.61s  STT_FINAL ("What about Sundays?")   ← arrives first
  T=0.63s  TTS_COMPLETE                          ← arrives second

If your coordinator processes these events naively in arrival order, it sees the STT finalization before it knows TTS completed. A TTS_COMPLETE event can arrive hundreds of milliseconds late relative to the actual moment playback ended — and if the completion handler resets a downstream flag, it may inadvertently affect the new turn's state.

This is a well-understood problem in distributed systems, typically addressed with logical clocks or sequence numbers that let receivers reconstruct intended order. Voice agents running across threads or processes face essentially the same challenge in miniature.

Centralizing State: The Coordinator Pattern

The architectural response to this problem is to route all state through a single coordinator process — the one place in the system that holds the authoritative answer to "Is the agent currently speaking?" and more broadly, what phase the current turn is in.

This is the same insight that drives single-threaded event loops in asynchronous runtimes, Redux's single store in frontend state management, and database transactions when multiple processes write to shared data. The principle is always the same: concurrent writes to shared state need a single serialization point.

A minimal coordinator owns a state machine that looks roughly like this:

                    ┌─────────────────────────────────────────┐
                    │              COORDINATOR                │
                    │                                         │
       STT events ──┤→  [IDLE]                                │
      LLM events ───┤       │ STT_FINAL                       │
      TTS events ───┤       ▼                                 │
                    │  [LLM_GENERATING]                       │
                    │       │ first_token / llm_complete      │
                    │       ▼                                 │
                    │  [AGENT_SPEAKING]  ←──────────────┐     │
                    │       │ TTS_COMPLETE + queue_empty │     │
                    │       │ OR interrupt_detected      │     │
                    │       ▼                           │     │
                    │  [FLUSHING]  ──────────────────────┘     │
                    │       │ flush_complete                   │
                    │       ▼                                  │
                    │  [IDLE]                                  │
                    └─────────────────────────────────────────┘

Every event from every subsystem passes through the coordinator before any subsystem acts on it. The coordinator decides whether a given event is valid in the current state — a TTS_COMPLETE that arrives while the coordinator is already in IDLE (because an interruption already flushed) is recognized as stale and discarded. An STT_FINAL that arrives while AGENT_SPEAKING is true triggers the interruption path rather than a naive new LLM call.

Because the coordinator is a single process serializing all events, it can attach sequence numbers or timestamps at ingestion time and use its current state — not event arrival order — to decide what is valid. A TTS_COMPLETE carrying a sequence number from a previous turn can be checked against the coordinator's current turn counter and discarded if stale.

Wrong thinking: "Each subsystem is a black box. I'll let them communicate directly with each other and handle conflicts as they arise."

Correct thinking: "Each subsystem is a black box. All events flow through a coordinator that holds the only authoritative state, and subsystems receive instructions from the coordinator rather than from each other."

💡 Pro Tip: The coordinator's state machine does not need to be complex to be effective. Even a simple implementation tracking four or five states — IDLE, LLM_GENERATING, AGENT_SPEAKING, FLUSHING, and INTERRUPTED — eliminates the vast majority of concurrency bugs by making impossible states impossible.

🧠 Mnemonic: Think of the coordinator as the air traffic controller. Aircraft have their own instruments and local knowledge, but they do not make autonomous decisions about who can occupy a runway. A subsystem acting on its own local view is like a plane deciding to land without clearance — correct information locally, catastrophic outcome systemically.

📋 Coordinator Responsibilities

🎯 Responsibility ❌ Without Coordinator ✅ With Coordinator
🔒 is_speaking authority Each subsystem guesses Single source of truth
🔄 Event ordering Arrival order assumed correct State-based validity check
🗑️ Stale event handling Processed regardless Rejected by turn counter
⚡ Interruption propagation Each subsystem acts independently Coordinator issues flush command
🐛 Bug localization Unclear which subsystem is wrong Check coordinator state at failure time

Practical Failure Modes: What Goes Wrong and When

Theory clarifies what should happen; failure modes reveal what does happen. The five scenarios below each have a distinct signature — a specific symptom, a specific cause rooted in the subsystems covered earlier, and a specific place to look when diagnosing. Developing a precise vocabulary for these failures is not academic: the difference between a one-hour fix and a three-day investigation is usually whether the engineer on call can name the failure they are looking at.


Failure 1: The Agent Responds to Its Own Voice

What the user experiences: The agent finishes a sentence, pauses, and then — unprompted — continues talking, often repeating or extending what it just said, or responding to fragments of its own speech as if they were user input. In severe cases the agent enters a loop.

What is happening in the pipeline:

Speaker output ──────────────────────────────────────────────────────────┐
                                                                          │ (acoustic path)
Microphone input ◄── [room reverb / no AEC] ◄── agent's own TTS audio ◄──┘
        │
        ▼
     STT engine
        │  transcribes agent speech as user speech
        ▼
     LLM call triggered
        │
        ▼
     More TTS output ──► loop

AEC is either absent entirely or not receiving the loudspeaker reference signal it needs. As a result, the STT engine sees TTS audio as a legitimate human utterance and transcribes it.

Why it is easy to miss during development: Most development happens with headphones or software-only audio routing. Headphones eliminate the acoustic path from speaker to microphone. The failure only surfaces on devices where speaker and microphone share physical space.

⚠️ Common Mistake: Assuming that VAD will save you here. VAD detects that audio is present, not whose audio it is. Without AEC providing a reference, VAD will happily classify the agent's own TTS playback as user speech.

How to confirm the diagnosis: Log the raw microphone signal and the STT transcript side by side. If you see the agent's own words appearing in the transcript within one to two seconds of playback, AEC is the culprit.


Failure 2: The Agent Ignores an Interruption and Finishes Speaking Anyway

What the user experiences: The user starts talking mid-sentence to redirect or correct the agent. The agent acknowledges nothing, finishes its full response, and then (sometimes) reacts to what the user said. This is the voice equivalent of a person who talks past you.

What is happening in the pipeline:

User speaks (t=2.1s) ──► VAD fires ──► interruption event emitted
                                              │
                                              ▼
                                    LLM stream cancelled ✓
                                              │
                                    TTS stream cancelled ✓
                                              │
                                    Playback halted?  ✗  ← gap here
                                              │
                                    Audio buffer drains completely
                                              │
                                    User hears full agent response

The interruption signal propagates partway through the system but stops before reaching the audio playback layer. The LLM call gets cancelled, the TTS generation stops, but no one tells the audio output buffer to flush. This happens because interruption handling typically gets wired incrementally — a developer adds barge-in detection, cancels the LLM call, and marks it working, because from the LLM's perspective it is cancelled. The audio playback layer, often a separate process with its own queue, never receives a flush command.

Consider a pipeline where TTS audio is chunked into 200ms segments and pushed onto a playback queue. At the moment of interruption, the queue holds eight chunks — 1.6 seconds of audio. Even if the TTS stream stops generating new chunks immediately, those eight chunks play out. The user experiences a 1.6-second overrun that feels like the agent ignored them entirely.

🎯 Key Principle: An interruption is not complete until playback stops, not until the LLM call stops. These are different events owned by different subsystems, and both must be handled.

How to confirm the diagnosis: Add timestamps to three events: (1) VAD fires, (2) LLM stream cancels, (3) last audio sample plays. If there is a significant gap between (2) and (3), the playback flush is missing.


Failure 3: The Agent Cuts Itself Off Mid-Sentence, Unprompted

What the user experiences: The agent is mid-thought — not a natural pause — and abruptly stops. Sometimes it restarts from the beginning of the same sentence. Sometimes it goes silent and waits for user input that never comes.

What is happening in the pipeline:

TTS generates: "The appointment is scheduled for Thursday—"
                                              |
                             [natural pause between clauses]
                                              |
VAD threshold (too low): "...silence detected, user has finished speaking"
                                              |
                             end-of-speech event fires
                                              |
                      pipeline treats this as: user turn ended → reset state

The VAD end-of-speech threshold is set too aggressively — too short a silence duration, or too sensitive an energy floor — causing it to fire on the natural inter-phrase pauses present in TTS output. If any TTS audio leaks into the microphone path (because AEC is incomplete), VAD is running on a mixed signal. The pause between two TTS clauses looks, to VAD, like the user just finished speaking.

⚠️ Common Mistake: Treating this as a VAD tuning problem when the real fix is AEC. Raising the end-of-speech threshold treats the symptom; removing the echoed TTS signal from the microphone feed treats the cause. Both may be necessary, but the order matters.

How to confirm the diagnosis: Check the VAD event log against the TTS playback log. If end-of-speech events are firing while TTS is actively playing, the VAD is reacting to TTS audio or its echo.

🧠 Mnemonic: PAUSEPlayback Audio Undermines Speech-End detection. When the agent's own audio can reach the VAD, every internal pause becomes a false termination trigger.


Failure 4: Double Responses After a Short User Turn

What the user experiences: The user asks a short question and the agent gives the same answer twice in quick succession, or gives two slightly different answers as if it processed the question twice.

What is happening in the pipeline:

User: "What time is it?"
         │
         ▼
  STT partial result (t=0.8s): "what time is"   ──► LLM call A triggered
         │
         ▼
  STT final result  (t=1.1s): "what time is it?" ──► LLM call B triggered

This is the duplicate STT event failure. Streaming STT systems emit partial results (low-latency transcriptions that may still change) and final results (the committed transcription). For short utterances, the partial result is often nearly identical to the final result, and both arrive within a few hundred milliseconds. Without a deduplication guard — state that says "I have already started processing this utterance" — the pipeline fires two LLM calls with nearly identical input.

A conceptual deduplication guard:

on_stt_event(event):
    if state.current_utterance_id == event.utterance_id
       and state.llm_call_in_flight:
        discard  # already handling this utterance
    else:
        state.current_utterance_id = event.utterance_id
        state.llm_call_in_flight = True
        trigger_llm_call(event.transcript)

(Production implementations also need to handle cancellation, timeouts, and cases where a final result arrives with a meaningfully different transcript than the partial.)

How to confirm the diagnosis: Log every STT event with its type (partial vs. final), its utterance ID, and whether an LLM call was triggered. Double responses almost always show two LLM calls within a short window with overlapping or identical transcript content.


Failure 5: Stale Audio Plays After the Conversation Has Moved On

What the user experiences: The user says something that changes the topic. The agent seems to understand and begins responding — but then a fragment of the previous response surfaces mid-delivery. This is the voice equivalent of a browser loading a page you already navigated away from.

What is happening in the pipeline:

Timeline:
  t=0.0s  Agent begins speaking: "Your flight departs at 9 AM from—"
  t=0.8s  User interrupts: "Wait, I meant the return flight"
  t=0.8s  Interruption detected ──► LLM call A cancelled
                                 ──► TTS stream A cancelled
                                 ──► Playback buffer... partially flushed
  t=0.9s  New LLM call B starts for return flight query
  t=1.5s  Buffered TTS chunk from call A surfaces:
          "...Terminal 2" plays mid-new-response

This is the incomplete flush failure. When an interruption is processed, three distinct cleanup operations must complete: (1) cancel the LLM stream, (2) cancel the TTS stream, (3) flush the playback buffer. The third step is the one most commonly skipped. TTS systems in production often operate with prefetch — audio is generated and buffered ahead of playback to smooth over network latency. A flush operation that only stops new audio from being added to the buffer does not remove audio already there.

The tagging problem makes this subtler: if each TTS chunk is not tagged with a generation ID or turn ID, the playback layer has no way to know which chunks belong to the cancelled call.

Without generation IDs:         With generation IDs:

Buffer: [A1][A2][A3][B1][B2]   Buffer: [A1:gen=1][A2:gen=1][A3:gen=1][B1:gen=2]
         └──────────┘ └───┘            └────────────────────┘ └────────┘
         stale        new              flush all gen=1 chunks  keep gen=2

Race condition: hard to know    Flush is deterministic:
where A ends and B begins       discard everything with gen < current

A TTS prefetch buffer set to three seconds means that at the moment of interruption, up to three seconds of old audio could be queued. Production implementations need flush operations at each stage of the audio pipeline: the TTS decode buffer, the resampling buffer, and the audio device queue.

🎯 Key Principle: Flushing is not a single operation; it is a cascade. Every buffer in the audio path must be addressed in order, from the TTS output stage down to the device-level audio queue, or stale audio will surface from whichever stage was missed.

How to confirm the diagnosis: Log each TTS chunk with a generation or turn ID and log when each chunk actually plays. Stale audio will appear as chunks with an old turn ID playing after a new turn ID has started.


Reading the Pattern Across All Five Failures

Failure Immediate Cause Underlying Pattern
🔊 Agent responds to own voice AEC absent Signal contamination reaches a downstream processor
🙉 Interruption ignored Playback flush missing Event reached one subsystem but not all
✂️ Agent cuts itself off VAD threshold too aggressive False event fires due to contaminated signal
🔁 Double responses No deduplication guard Same logical event processed twice
👻 Stale audio plays Incomplete buffer flush Old state persists into new turn

Three of the five involve incomplete propagation of an event through the pipeline. One involves signal contamination corrupting a downstream processor's input. One involves a missing guard against duplicate processing. All five could be prevented or rapidly diagnosed by a single coordinator that owns the canonical answer to "what state is this conversation in right now?"

Wrong thinking: "I'll fix these as they come up in testing."

Correct thinking: "These failures are predictable from the architecture. I can instrument for them proactively — generation IDs on TTS chunks, event logs on STT result types, timestamps on interrupt-to-flush paths — before they surface in production."


Summary and What Comes Next

The Common Root: Concurrent State Without a Single Owner

By the time a voice agent demo breaks in front of a real user, the instinct is to hunt for the specific bug that caused it. That instinct is usually incomplete. The failure was almost certainly a coordination failure: two or three subsystems each operating correctly according to their own local view, while the global state they collectively owned fell into an inconsistent shape that none of them could detect.

The central thesis of this lesson: the hard parts of voice agents are not independent bugs — they share a common root in concurrent state without a single owner. Multiple processes each hold a local view of a shared reality, and no single authority is responsible for keeping those views consistent. Almost every voice agent failure that survives initial testing is a consistency failure, not a logic failure. The individual components are doing what they were coded to do. What breaks is the assumption that their outputs will arrive in an order that reflects reality.

The architectural implication — centralizing state in a single coordinator rather than letting each subsystem track its own assumptions — is the load-bearing decision that makes the rest of the hard parts tractable.

AEC and Noise Suppression Are Infrastructure, Not Features

AEC and noise suppression do not improve the system; they define the preconditions under which the system can function at all. Without AEC, the agent's TTS output leaks into the microphone signal, the VAD cannot distinguish that leakage from user speech, the ASR transcribes it, and the LLM generates a response to its own previous output. Critically, AEC failure does not produce a noisy signal that downstream components can partially compensate for. It produces a semantically valid but factually incorrect input — a transcription of the agent's own speech — that every downstream component will process without complaint. The system fails silently and confidently.

The practical consequence: treat AEC and noise suppression as infrastructure prerequisites with the same status as network connectivity or audio device initialization. The first integration test worth writing is not "does the agent respond correctly" but "does the transcription remain empty while TTS is playing."

Diagnosing Failures by Subsystem Ownership

Once you accept that voice agent failures are typically coordination failures, the diagnostic process changes. The question becomes "which subsystem owned the state that was wrong at the moment of failure."

📋 Failure → Likely Owning Subsystem

🔍 Symptom 🔧 First Subsystem to Check 📌 Coordination Question
🔁 Agent responds to its own voice AEC / audio input Was echo suppressed before VAD?
🚫 Interruption ignored, agent keeps speaking Playback ↔ coordinator link Did interruption signal reach the flush path?
✂️ Agent cuts off mid-sentence unprompted VAD threshold / AEC leakage Did TTS audio leak into the mic and trigger end-of-speech?
📨 Double response to one user turn STT deduplication / LLM dispatch Were two finalization events both forwarded?
🔊 Stale audio plays after conversation moves on Playback buffer / flush path Was the buffer fully cleared on interruption?

This diagnostic framing also clarifies what good logging looks like. Logging that records what the LLM said is useful for product analysis. Logging that records state transitions — when the coordinator received a VAD event, what the playback status was at that moment, when the STT result was finalized — is what makes coordination failures reproducible and fixable.

What the Child Lessons Each Resolve

The three lessons that follow are each a focused resolution of one slice of the coordination problem introduced here.

Barge-In and Interruption Handling resolves the question of how an interruption signal propagates through the system and what "correctly handled" means end to end — giving the wiring between interrupt detection and playback halt a concrete shape, and establishing what the coordinator must do when it receives an interruption event while in each possible playback state.

The Feedback Loop addresses how the system maintains a clean input signal while simultaneously producing output, and what happens when the boundary between input and output becomes ambiguous. While this lesson introduced what AEC and noise suppression are, the feedback loop lesson addresses how they behave as a pipeline under concurrent load — when the agent is speaking, when the user begins speaking before the agent finishes, and when reverberation tails from the speaker overlap with live user input.

The Playback State Machine is where the architectural recommendation to centralize state becomes concrete. It defines the states a voice agent must track, the events that drive transitions between them, and the invariants that must hold at each state — making "centralize your state" executable by specifying what the state space looks like and what each event must trigger.

Each child lesson is answering a question of the form: given that we have a coordinator, what must it know and do in this specific domain? Barge-in tells it what to do when the user speaks over the agent. The feedback loop tells it what audio conditions it can rely on as stable inputs. The playback state machine tells it what states exist and how to move between them.

Before Moving On

Three applications of the ideas in this lesson are worth carrying forward as active questions:

🔧 Audit your AEC setup first. Before reading the barge-in lesson, verify empirically that AEC is suppressing TTS output below the VAD activation threshold during playback. This single check will either confirm a working foundation or reveal the source of failures that no amount of coordinator logic will fix.

📚 Map your state ownership. Draw — literally — which subsystem currently owns each piece of shared state: who decides "the agent is speaking," who decides "the user has finished speaking," who holds the in-flight LLM response. If any of those is ambiguous or distributed across multiple subsystems, you have located the source of your next hard-to-reproduce bug before it appears in production.

🧠 Build a state transition log before you need it. Instrumentation added after a bug appears is incomplete by definition. Add state transition logging before the system is in production: every VAD event with the current playback state at arrival; every STT finalization with a sequence number; every playback start, pause, and flush with the event that triggered it.

The failures covered in this lesson are not primarily caused by components being poorly written. They are caused by components being written without a shared contract about what the global state is and who owns it. A well-written component in a system without that contract will produce failures that are genuinely hard to locate, because the evidence is distributed across subsystems that each believe they behaved correctly — and they did.