You are viewing a preview of this lesson. Sign in to start learning
Back to Building Real-Time Voice Agents from Scratch

Part IV: Engineering It Well

Make the system robust: gapless browser audio scheduling and a non-blocking async architecture that keeps the event loop responsive.

Last generated

Why Real-Time Voice Agents Fail in Practice

You've probably heard a voice assistant stutter mid-sentence, repeat itself after a pause, or simply go silent for an uncomfortable beat before responding. If you've tried to build one yourself, you've almost certainly caused these problems — and been baffled by them. The bugs feel intermittent, the stack traces point nowhere useful, and fixing one symptom often introduces another. This isn't bad luck. It's the result of applying web application engineering patterns to a domain with fundamentally different physical constraints. Real-time voice is not a faster version of a chat API. It is a different class of system, and understanding why it breaks is the prerequisite for building one that doesn't.

The two technical challenges this lesson prepares you for are browser audio scheduling — the problem of delivering audio samples to the speaker in a continuous, gapless stream — and async concurrency and orchestration — the problem of keeping every stage of the voice pipeline responsive without blocking the thread that coordinates them.

The Unforgiving Physics of Audio Latency

Human hearing is acutely sensitive to temporal discontinuities in speech. Gaps that are imperceptible in other media — a frame drop in video, a momentary pause in a file download — are immediately jarring in audio. The perceptual threshold is low enough that engineers working on real-time audio systems treat roughly 20 milliseconds as the practical upper bound for any gap before listeners register it as a glitch.

To put that in concrete terms: 20 milliseconds at a sample rate of 44,100 Hz is 882 samples. That is the entire tolerance budget for every scheduling decision your system makes — including the time to receive a network packet, decode it, hand it to the Web Audio API, and get it queued before the speaker demands the next buffer. Video streaming systems routinely buffer several seconds of content ahead of playback precisely because the eye tolerates latency that the ear does not. That strategy is simply unavailable here.

🎯 Key Principle: In voice systems, the latency budget is not a target to optimize toward — it is a hard constraint that defines which architectural patterns are even eligible for consideration.

💡 Mental Model: Think of your audio pipeline like a bucket brigade fighting a fire. Every person in the chain must pass the bucket fast enough that the person ahead never runs dry. If one link pauses — even briefly — the entire output fails.

This constraint rules out a surprising number of common patterns. Synchronous HTTP calls inside audio callbacks? Out. Blocking queue reads that wait for a result before releasing? Out. The correct patterns require understanding the boundary between the audio clock and the event loop — which we'll establish shortly.

Three Failure Modes You Will Encounter

Most real-time voice bugs cluster into three recognizable failure modes. Being able to name them accurately is half the battle, because each has a distinct cause, a distinct observable symptom, and a distinct fix.

Failure Mode 1: Audio Underrun and Overrun

Audio underrun occurs when the playback engine demands samples that haven't arrived yet — the result is silence, then audio resuming. Audio overrun is the mirror problem: audio data arrives faster than it can be consumed, overfilling the buffer and causing the scheduler to discard samples or lose synchronization.

Underruns typically trace back to scheduling latency — something in your pipeline is too slow to keep the audio output fed. Overruns are rarer but can appear when a TTS service returns audio faster than real time and you naively queue all of it immediately without rate-limiting delivery.

// ❌ Problematic pattern: scheduling audio without clock-anchored delivery
const audioChunks = [];

async function onTTSChunkReceived(chunk) {
  const buffer = await audioContext.decodeAudioData(chunk.arrayBuffer());
  audioChunks.push(buffer);
  if (audioChunks.length === 1) {
    playNext();
  }
}

function playNext() {
  if (audioChunks.length === 0) return;
  const source = audioContext.createBufferSource();
  source.buffer = audioChunks.shift();
  source.connect(audioContext.destination);
  source.start(); // defaults to audioContext.currentTime — not scheduled ahead
  source.onended = playNext; // fires AFTER audio ends — guarantees a gap
}

The problem is subtle but fatal: source.onended fires after the buffer finishes playing, then playNext() schedules the next buffer at audioContext.currentTime — which is now slightly in the past relative to when the previous buffer ended. The gap is small but audible, and under any real network jitter, it grows. The correct pattern — scheduling each buffer to start exactly where the previous one ended using the audio clock — is covered in depth ahead.

Failure Mode 2: Event Loop Starvation

JavaScript's runtime uses a single-threaded event loop. If any task takes too long, the tasks behind it — including the ones responsible for forwarding STT results to your LLM client or queuing the next TTS request — simply wait. The most dangerous form comes from synchronous I/O: a blocking network call, a large synchronous JSON parse, a tight loop processing audio frames without yielding.

// ❌ Blocking the event loop with synchronous I/O in a hot path
function processAudioFrame(rawSamples) {
  // Synchronous HTTP — blocks the entire event loop
  const xhr = new XMLHttpRequest();
  xhr.open('POST', '/api/vad', false); // third arg = synchronous ❌
  xhr.send(JSON.stringify({ energy: computeEnergy(rawSamples) }));
  return JSON.parse(xhr.responseText).isSpeech;
}

The event loop halts until the HTTP round-trip completes. In a voice pipeline processing audio frames every 20 ms, every frame either completes in a few milliseconds or everything downstream stalls.

Failure Mode 3: Cascading Delays from Batch Processing

The third failure mode doesn't produce an obvious glitch — latency accumulates quietly across pipeline stages. An STT stage that waits for a complete transcript before emitting to the LLM, an LLM stage that waits for the full completion before beginning TTS, a TTS stage that waits for the full audio file before beginning playback — each wait is individually defensible, but together they transform a system capable of sub-second response into one that routinely takes several seconds.

This is the streaming boundary problem. Stages that could emit partial results incrementally instead behave as batch processors. The pipeline topology is correct, but the data flow contracts are wrong. We'll examine this precisely in the next section.

🤔 Did you know? Human speech perception operates bidirectionally — listeners begin building a semantic interpretation before the speaker finishes. Voice agents that begin generating a response only after a full transcript is available are fighting a deeply ingrained expectation. Streaming partial results through the pipeline is what closes this gap perceptually.

Two Clocks, One System

This is the most important conceptual point in this section, and the one most often skipped in tutorials covering only the happy path.

Your browser-based voice agent operates under two fundamentally different clocks:

  1. The JavaScript event loop clock — driven by Date.now(), performance.now(), setTimeout, and Promise scheduling. Subject to timer throttling, background-tab clamping, and event loop congestion.

  2. The Web Audio API clock — accessible via AudioContext.currentTime. Driven by the audio hardware's sample rate, advancing at a fixed sample-accurate rate regardless of what the JavaScript thread is doing. Cannot be throttled, paused, or affected by event loop congestion.

// Demonstrating clock divergence
const audioCtx = new AudioContext();

function blockEventLoop(durationMs) {
  const end = performance.now() + durationMs;
  while (performance.now() < end) {} // Spin-wait — blocks the event loop
}

console.log('JS time before block:', performance.now().toFixed(2));
console.log('Audio time before block:', audioCtx.currentTime.toFixed(4));

blockEventLoop(100);

console.log('JS time after block:', performance.now().toFixed(2));
console.log('Audio time after block:', audioCtx.currentTime.toFixed(4));

// Both advance ~100ms — because the audio hardware kept running
// even though JS was frozen. Audio scheduled relative to JS wall-clock
// time during this block will be late. Audio scheduled against
// audioCtx.currentTime will not.

If you schedule audio playback using Date.now() to calculate when to start the next buffer, you will accumulate drift every time the event loop is busy. Over a multi-turn conversation, this drift compounds. If you schedule using audioCtx.currentTime and advance a scheduling cursor by the exact duration of each buffer, the audio clock itself guarantees gapless delivery — regardless of what the event loop is doing.

Wrong thinking: "I'll use setTimeout to schedule the next audio chunk 200 ms from now, since each chunk is ~200 ms of audio."

Correct thinking: "I'll track a scheduling cursor in audio-clock time. Each new chunk starts at cursor, and I advance cursor by buffer.duration after scheduling. The event loop schedules chunks ahead of time; the audio hardware plays them at the right moment regardless of JS timing."

🧠 Mnemonic: Audio clock = Absolute scheduling. Event loop = Event coordination. Never swap A and E.

📋 Quick Reference: The Three Failure Modes

Failure Mode Symptom Root Cause Fix Direction
Audio Underrun Silence gaps mid-speech Scheduler too slow to feed audio clock Clock-anchored scheduling, lookahead buffering
Audio Overrun Choppy/skipping audio Data arrives faster than playback Rate-limited delivery, ring buffer
Event Loop Starvation Dropped STT/TTS responses Synchronous I/O blocking the loop Async I/O at every stage
Cascading Delay Multi-second response latency Batch processing at streaming-capable stages Streaming data flow, incremental emission

The Voice Pipeline as a Data Flow Graph

With the failure modes named and the two-clock model clear, the next step is building a precise mental map of how data flows through your system — because the failure modes above don't occur randomly. They cluster at specific structural seams. The tool for seeing those seams is a directed graph of pipeline stages.

The Six Canonical Stages

A typical real-time voice agent has six stages, each consuming data from its predecessor and producing data for its successor:

┌─────────┐     raw PCM      ┌───────────┐    PCM chunks    ┌─────┐
│ Capture │ ───────────────► │ Transport │ ───────────────► │ STT │
└─────────┘   (microphone)   └───────────┘   (WebSocket/    └─────┘
  browser                      (optional)      WebRTC)          │
                                                             text tokens
                                                                 │
                                                                 ▼
┌─────────┐   PCM audio     ┌─────┐    text tokens    ┌─────┐
│Playback │ ◄────────────── │ TTS │ ◄────────────────  │ LLM │
└─────────┘  (audio chunks)  └─────┘   (sentence/      └─────┘
  browser                               word chunks)

Simple to describe, hard to implement correctly — because each arrow represents a contract about chunk size, delivery rate, and error handling that must hold under all conditions.

Producer/Consumer Contracts at Every Edge

Each stage is simultaneously a consumer (with respect to the stage behind it) and a producer (with respect to the stage ahead). A producer/consumer contract specifies three things: the unit of data exchanged, the rate at which data arrives, and what happens when the rate is violated.

Mismatches between producer and consumer rates must be resolved by buffers (the producer writes ahead; the consumer catches up) or backpressure (the consumer signals the producer to slow down). Without one of these mechanisms, you either drop data or accumulate unbounded lag — the mechanics are covered in the next section.

The most common rate mismatch is between STT and LLM. A speech-to-text model may emit a complete transcript only after a silence timeout, while the LLM has its own startup latency before emitting the first token. Neither stage is slow in isolation, but if STT waits for a complete utterance before forwarding anything, and LLM waits for a complete prompt before generating anything, you have introduced two full-turn-duration pauses sequentially. That latency is invisible in a unit test of either stage but devastating in the integrated system.

🎯 Key Principle: Never design a stage to wait for its entire input before producing any output, if the stage can reasonably begin work on a prefix. The pipeline's perceived latency is the sum of the waiting times at every stage, not the sum of the processing times.

Streaming vs. Batch Boundaries

It helps to classify each stage as one of three types:

  • Streaming-in, streaming-out: The stage emits output chunks as it receives input chunks. Example: raw audio capture feeding a WebSocket transport.
  • Batch-in, streaming-out: The stage needs a complete input before starting, but once started emits incrementally. Example: a streaming LLM — it needs the full prompt assembled, but emits tokens one at a time.
  • Streaming-in, batch-out: The stage accumulates until some condition is met, then emits a single result. Example: a VAD-gated STT stage that accumulates audio until silence, then emits a transcript.

The streaming vs. batch boundary is the precise term for any edge that transitions between these types. These boundaries deserve special attention because they introduce minimum latency floors: you cannot emit output from the downstream stage until the upstream stage has emitted at least one unit of output.

The classic place this goes wrong is TTS sentence chunking. A naive implementation sends the entire LLM response to TTS after generation completes. A streaming implementation monitors the token stream for sentence-ending punctuation and dispatches each sentence to TTS as soon as it's complete. That alone can reduce perceived response latency by one to two seconds, because TTS synthesis of sentence one overlaps with LLM generation of sentences two and three.

import asyncio

async def stream_llm_to_tts(
    llm_stream,       # async iterator yielding token strings
    tts_queue,        # asyncio.Queue for TTS sentence chunks
    sentence_enders: str = ".!?",
) -> None:
    """
    Reads tokens from an LLM stream, accumulates them into sentences,
    and dispatches each sentence to the TTS queue as soon as it's complete.
    Converts a batch-out LLM response into streaming input for TTS.
    """
    buffer = []
    async for token in llm_stream:
        buffer.append(token)
        if token.strip() and token.strip()[-1] in sentence_enders:
            sentence = "".join(buffer).strip()
            if sentence:
                await tts_queue.put(sentence)  # backpressure: suspends if queue full
            buffer.clear()

    # Flush any trailing text that didn't end with punctuation
    if buffer:
        remainder = "".join(buffer).strip()
        if remainder:
            await tts_queue.put(remainder)

Notice that await tts_queue.put() suspends this coroutine if the TTS consumer is slow — that's backpressure working correctly. A synchronous call here would block the event loop; an unbounded queue would hide a slow TTS stage until memory exhaustion.

⚠️ Common Mistake: Treating LLM token streaming as an implementation detail rather than a pipeline design decision. If your LLM integration waits for response.text (a fully materialized string) instead of iterating over a streaming response, you have silently converted a batch-in/streaming-out stage into batch-in/batch-out, and every downstream stage pays the full generation time as added latency.

Shared State Across Stages

The pipeline stages don't just pass data — they must also share context: session ID, turn history, and cancellation signals.

Wrong thinking: Store shared state in a module-level global dictionary keyed by session ID. All stages read and write directly from it.

Correct thinking: Create a per-session context object at the start of a turn and pass it explicitly through function arguments into each stage.

Globals fail for a concrete reason. When the user interrupts mid-response, two coroutines become active: one draining the TTS queue for the current turn (which should be cancelled), and one beginning STT processing for the new turn. If both read and write to the same global, the new turn's context update can overwrite state the old turn's cleanup logic is still reading — a race condition that produces garbled responses or missed cancellations.

Passing a context object through function arguments makes the scope explicit: each coroutine holds a reference to the context of the turn it was created for, and when that turn ends, the context is discarded. No shared mutable state means no race.

import asyncio
from dataclasses import dataclass, field

@dataclass
class TurnContext:
    session_id: str
    history: list[dict]          # conversation history for the LLM prompt
    cancel_event: asyncio.Event  # set this to abort all stages for this turn
    tts_queue: asyncio.Queue     # sentences awaiting TTS synthesis

async def run_llm_stage(
    ctx: TurnContext,
    transcript: str,
    llm_client,
) -> None:
    prompt = ctx.history + [{"role": "user", "content": transcript}]
    buffer = []

    async for token in llm_client.stream(prompt):
        if ctx.cancel_event.is_set():
            break
        buffer.append(token)
        if token.strip().endswith(('.', '!', '?')):
            sentence = "".join(buffer).strip()
            await ctx.tts_queue.put(sentence)
            buffer.clear()

    if buffer and not ctx.cancel_event.is_set():
        await ctx.tts_queue.put("".join(buffer).strip())

async def handle_interruption(ctx: TurnContext) -> None:
    ctx.cancel_event.set()
    while not ctx.tts_queue.empty():
        try:
            ctx.tts_queue.get_nowait()
        except asyncio.QueueEmpty:
            break

The TurnContext dataclass is the explicit contract for shared state. When cancel_event is set, every coroutine holding that context sees it at its next check point. This is cancellation propagation at the right level of abstraction — not a global flag, not a thread kill, but a scoped signal that travels with the turn.

💡 Pro Tip: Make cancel_event an asyncio.Event rather than a boolean flag. Events can be awaited, support multiple waiters without polling, and wake sleeping coroutines immediately.

Drawing the Graph for Your Agent

Before writing code, draw your actual graph and annotate each edge with three properties:

  1. Data unit: what is the atom of data passing over this edge? (e.g., PCMChunk(frames=480, sample_rate=16000), str)
  2. Expected rate: how frequently does data arrive? (e.g., every 30 ms, on silence detection, per sentence)
  3. Buffer policy: what happens when the consumer is slower than the producer?

A concrete annotated graph might look like this:

┌─────────┐ PCMChunk(480 frames)  ┌───────────┐ PCMChunk(480 frames)  ┌─────┐
│ Capture │ every 30 ms           │ Transport │ every 30 ms           │ STT │
│  (VAD)  │ ────────────────────► │           │ ─────────────────────►│     │
└─────────┘                       └───────────┘                        └─────┘
  POLICY: ring buffer,               POLICY: bounded queue,                 │
  overwrite old audio                drop oldest on overflow           str, on silence
  (silence is cheap to lose)         (latency over loss)                    │
                                                                             ▼
┌─────────┐  PCMChunk(variable)  ┌─────┐  sentence str              ┌─────┐
│Playback │ ◄──────────────────  │ TTS │ ◄────────────────────────  │ LLM │
│AudioCtx │                      │     │                            │     │
└─────────┘                      └─────┘                            └─────┘
  POLICY: scheduling queue,        POLICY: bounded queue,             streams tokens
  audio clock as source of truth   cancel on ctx.cancel_event         directly

Notice what this diagram reveals: the Capture → Transport edge and the TTS → Playback edge have fundamentally different overflow policies, and for good reason. Raw microphone audio is continuous and cheap to lose — overwriting the oldest chunk in a ring buffer is acceptable because VAD will detect speech again on the next chunk. But synthesized TTS audio must be delivered in order; dropping a chunk means the user hears a gap.

The graph also reveals where latency accumulates. Any stage with a batch boundary adds the duration of its input assembly to end-to-end latency. For a typical cloud-hosted pipeline — VAD silence timeout (300–500 ms) + STT inference (200–400 ms) + LLM first token (300–600 ms) + TTS first sentence (200–400 ms) + audio scheduling (~20 ms) — the floor is around one to two seconds under ideal conditions. Knowing this before you start building lets you make deliberate trade-offs rather than discovering the latency budget is already spent after integration.

🎯 Key Principle: The graph is a communication tool as much as a design tool. A team that shares a common annotated diagram will catch integration bugs in design review that would otherwise take hours to diagnose in production.


Buffering, Backpressure, and Timing Contracts

The pipeline graph tells you where the seams are. This section gives you the mechanics for getting those seams right — choosing the correct buffer type, establishing backpressure, picking chunk sizes that honor the latency budget, and threading cancellation through the entire graph.

Ring Buffers vs. Queues

The first decision at any pipeline seam is what happens when data arrives faster than it can be consumed. Two structures answer this with opposite philosophies.

A ring buffer has a fixed capacity. When it fills and a new item arrives, the oldest item is silently overwritten. For raw audio capture this is exactly the right behavior — if your processing thread falls slightly behind due to a GC pause, you want the most recent microphone samples, not a growing backlog of stale ones.

A queue (bounded) preserves every item in order. When it fills, the producer must block and wait or drop the item. This is the right contract for discrete semantic events: transcription results, LLM token chunks, TTS audio segments. Dropping a transcription result means silently losing part of what the user said.

Audio capture boundary         Transcription result boundary

  [Mic] --audio samples-->  [Ring Buffer]  -->  [STT processor]
         fixed-rate              overwrites           variable-rate
         stream                  on overflow          consumer

  [STT] --transcripts-->  [Bounded Queue]  -->  [LLM stage]
         variable-rate         blocks or            variable-rate
         producer              drops on full        consumer

Getting this backwards — using a ring buffer for transcription results — produces a subtle, hard-to-reproduce bug where occasional transcription chunks are silently dropped under load.

⚠️ Common Mistake: Using an unbounded asyncio.Queue() or JavaScript Array as the push target for every pipeline stage. Under a slow network, the LLM stage takes longer to drain its queue than the TTS stage takes to fill it, and memory grows monotonically through the conversation turn. Always set an explicit maxsize and decide your overflow policy before writing the producer.

import asyncio

## Bounded queue for transcription results.
transcript_queue: asyncio.Queue[str] = asyncio.Queue(maxsize=10)

async def stt_producer(transcript_queue: asyncio.Queue[str]) -> None:
    async for transcript in receive_transcripts():
        try:
            await transcript_queue.put(transcript)  # suspends if full
        except asyncio.QueueFull:
            print("[WARN] transcript queue full; dropping chunk")

async def llm_consumer(transcript_queue: asyncio.Queue[str]) -> None:
    while True:
        transcript = await transcript_queue.get()
        await call_llm(transcript)  # yields control while awaiting network
        transcript_queue.task_done()

await transcript_queue.put(transcript) suspends the coroutine rather than spinning, so the event loop stays free to run other tasks — including the audio playback scheduler.

Backpressure: Letting the Consumer Set the Pace

Backpressure is the mechanism by which a slow consumer signals a fast producer to pause. Without it, the producer keeps pushing data, the queue grows, and each newly consumed item is older than the last — meaning the user hears a response that was generated relative to a conversational state from seconds ago. Latency does not stay constant; it increases monotonically across the conversation turn.

🎯 Key Principle: Backpressure does not eliminate the latency mismatch between a fast producer and a slow consumer. It bounds it. Without backpressure the bound is memory. With backpressure the bound is your queue size times your chunk size.

In async Python, asyncio.Queue with a maxsize gives you backpressure for free: await put() suspends the producer until there is space. In JavaScript with the Web Streams API, ReadableStream and WritableStream implement backpressure through their internal queue and the desiredSize property — when desiredSize drops to zero, the producer's write() promise resolves more slowly.

A common place backpressure gets accidentally bypassed is at the TTS–playback boundary. TTS services often return audio quickly for short sentences, and it is tempting to fire-and-forget each audio chunk into the Web Audio API scheduler without tracking how much audio is already enqueued. During a long LLM response, this can pre-schedule dozens of seconds of audio, making cancellation very difficult to make instantaneous.

Chunk Sizing: The Latency–Overhead Trade-off

Every stage that processes a stream must decide how large a unit of work to process at once. Chunk size is the most direct dial for end-to-end latency, but it has a lower bound below which you pay more cost than you save.

A stage cannot emit output until it has processed at least one full chunk. If the STT stage processes 300 ms of audio per chunk, the LLM stage cannot start until 300 ms has accumulated. So smaller chunks reduce latency. But smaller chunks also mean more function calls, more allocations, more context switches, and a higher risk of underrun — where the playback side runs out of pre-buffered audio because upstream stages could not keep pace.

Worked Example: Choosing Chunk Size for a 500 ms End-to-End Latency Target
Stage Variable per-chunk cost Fixed per-chunk overhead
Audio capture → STT network RTT ~150 ms ~5 ms
STT → LLM first token model latency ~200 ms ~10 ms
LLM first token → TTS streaming, ~50 ms/sentence ~5 ms
TTS → playback audio scheduling ~2 ms

The variable costs (network RTT, model latency) are not reduced by smaller chunk sizes — they are per-request costs. What you can tune is the audio duration that flows into STT. Sending 100 ms audio chunks instead of 300 ms shaves 200 ms off the time STT waits before having enough audio — but also sends 3× as many requests, tripling fixed overhead and rate-limiting risk.

💡 Real-World Example: A reasonable heuristic for STT streaming is to accumulate audio until either a silence boundary is detected or a maximum chunk duration is reached — not on a fixed timer. Short utterances get processed quickly; long utterances don't accumulate excessive lag.

For playback, the complementary calculation: how much audio must be pre-buffered in the Web Audio scheduler before playback begins to avoid underrun? If your TTS-to-playback network jitter is up to 80 ms, you need at least 80 ms buffered at start time. Buffer less and any jitter spike causes a gap; buffer more and the user hears a longer silence before the first word.

import asyncio
from dataclasses import dataclass

@dataclass
class ChunkPolicy:
    max_duration_ms: int
    min_duration_ms: int
    silence_threshold_ms: int

STT_CHUNK_POLICY = ChunkPolicy(
    max_duration_ms=200,
    min_duration_ms=40,
    silence_threshold_ms=80,
)

async def flush_to_stt(
    audio_buffer: bytearray,
    policy: ChunkPolicy,
    silence_detected: bool,
    transcript_queue: asyncio.Queue[str],
) -> bytearray:
    duration_ms = len(audio_buffer) // 32  # 16kHz, 16-bit mono: 32 bytes/ms

    should_flush = (
        duration_ms >= policy.max_duration_ms
        or (silence_detected and duration_ms >= policy.min_duration_ms)
    )

    if not should_flush:
        return audio_buffer

    transcript = await send_to_stt(bytes(audio_buffer))
    await transcript_queue.put(transcript)
    return bytearray()

Cancellation Propagation: Making Interruption Truly Immediate

When a user interrupts the agent mid-response, the correct behavior is that audio stops within tens of milliseconds. This is one of the most commonly implemented incorrectly because cancellation is treated as an afterthought rather than a first-class signal.

By the time the user interrupts, multiple stages have already produced and buffered data:

[LLM] -- tokens --> [TTS] -- audio chunks --> [Playback Queue] --> [Audio Clock]
          ↑                      ↑                    ↑
    buffered in           buffered in TTS        buffered in
    token queue           output queue           Web Audio graph

If the cancellation signal only reaches the LLM stage — stopping new token generation — the TTS stage continues processing already-received tokens, the playback queue continues filling, and the audio clock continues playing. The user hears the agent keep talking for several seconds after the interruption.

Cancellation propagation means the signal must travel upstream through every stage simultaneously.

import asyncio

class TurnContext:
    def __init__(self) -> None:
        self._cancel_event = asyncio.Event()

    def cancel(self) -> None:
        self._cancel_event.set()

    @property
    def cancelled(self) -> bool:
        return self._cancel_event.is_set()


async def tts_stage(
    token_queue: asyncio.Queue[str],
    audio_queue: asyncio.Queue[bytes],
    ctx: TurnContext,
) -> None:
    while not ctx.cancelled:
        try:
            token = await asyncio.wait_for(token_queue.get(), timeout=0.05)
        except asyncio.TimeoutError:
            continue  # re-check ctx.cancelled

        audio_chunk = await synthesize_speech(token)

        if ctx.cancelled:
            return  # don't enqueue audio generated after cancellation

        await audio_queue.put(audio_chunk)


async def handle_interruption(
    ctx: TurnContext,
    audio_queue: asyncio.Queue[bytes],
) -> None:
    ctx.cancel()
    # Drain audio queued before cancellation reached this stage
    while not audio_queue.empty():
        try:
            audio_queue.get_nowait()
        except asyncio.QueueEmpty:
            break
    # Caller must also clear scheduled audio from the Web Audio context
    # (stop() on active AudioBufferSourceNode instances)

Draining the audio queue is necessary but not sufficient: audio already handed off to the browser's Web Audio scheduler must also be cancelled at the Web Audio layer by calling stop() on active AudioBufferSourceNode instances. The audio context has its own timeline and will continue playing scheduled buffers regardless of application logic — as established in the two-clock model.

🧠 Mnemonic: Think of cancellation as a fire alarm, not a power switch. A power switch only cuts the source. A fire alarm must reach every room simultaneously.

💡 Mental Model: For each stage, ask: "If cancellation fires right now, how many milliseconds of output have I already produced that will play regardless?" The sum across all stages is your worst-case cancellation latency. For a well-designed pipeline with small bounded queues, this should be under 200 ms.

⚠️ Common Mistake: Cancelling only the outermost LLM request and assuming downstream stages will naturally drain. They will — but not before playing several seconds of already-synthesized audio. Cancellation is not garbage collection.

Timing Contract Summary

📋 Quick Reference: Buffer Decisions at Each Pipeline Boundary

Boundary Buffer Type Overflow Policy Latency Contribution
Mic → STT Ring buffer Overwrite oldest audio ~chunk duration (40–200 ms)
STT → LLM Bounded queue Block producer Near zero (queue drain)
LLM → TTS Bounded queue Block producer Near zero (queue drain)
TTS → Playback Bounded queue + drain on cancel Block producer; drain on cancel ~pre-buffer duration (50–150 ms)

Every row is a decision, not a default. Skipping any one and relying on runtime defaults — an unbounded queue, no overflow handling — is how pipelines that work in development fall apart under realistic load.


Common Engineering Mistakes and How to Diagnose Them

Real-time voice agents fail in ways that are disproportionately hard to debug. The symptoms — a half-second stutter, a response that arrives too late, a conversation that gradually becomes more sluggish — rarely point directly at their causes. Root causes are almost always from a small set of architectural mistakes that are easy to make and easy to miss in development, precisely because development conditions are forgiving compared to what users actually experience.

Mistake 1: Blocking the Event Loop

Event loop blocking is the single most common cause of pipeline stalls. It happens when a synchronous operation — a network call, a disk read, a CPU-intensive loop — occupies the thread, preventing any other work from being scheduled.

The failure mode is particularly insidious in audio callbacks. If your callback includes a synchronous HTTP call to an STT API — even one that typically completes in 50 ms — you have introduced an unbounded block. On a good development network this often goes unnoticed. Under a loaded production network, the same call might take 400 ms, causing a complete pipeline stall.

## ❌ WRONG: synchronous HTTP call inside what should be a non-blocking handler
import httpx

def on_audio_chunk(chunk: bytes) -> None:
    # Blocks the thread until HTTP completes — everything else waits
    response = httpx.post(
        "https://api.example.com/transcribe",
        content=chunk,
        timeout=5.0
    )
    handle_transcript(response.json())
## ✅ CORRECT: enqueue the chunk; a separate coroutine handles the I/O
import asyncio
import httpx

audio_queue: asyncio.Queue[bytes] = asyncio.Queue()

def on_audio_chunk(chunk: bytes) -> None:
    audio_queue.put_nowait(chunk)  # returns immediately

async def transcription_worker(client: httpx.AsyncClient) -> None:
    while True:
        chunk = await audio_queue.get()
        response = await client.post(
            "https://api.example.com/transcribe",
            content=chunk,
            timeout=5.0
        )
        handle_transcript(response.json())
        audio_queue.task_done()

Treat the audio callback as a minimal handoff point: it puts data somewhere and returns immediately. A separate coroutine performs the I/O.

Diagnosing Event Loop Blocking

The best tool is a CPU flame graph captured under realistic load. Look for wide, flat stacks corresponding to blocking calls — socket.recv, time.sleep, synchronous HTTP client internals. In Python, asyncio's built-in slow callback logging is a fast first check:

import asyncio
import logging

logging.basicConfig(level=logging.DEBUG)
loop = asyncio.get_event_loop()
loop.slow_callback_duration = 0.05  # log any callback taking longer than 50ms

If warnings appear immediately, you have synchronous work on the hot path.

Mistake 2: Scheduling Audio with Wall-Clock Time

This mistake can hide in a working system for days. Using Date.now() in JavaScript (or time.time() in Python) to calculate when to schedule the next audio buffer creates a dependency on the OS wall clock rather than the audio subsystem's own clock.

The audio subsystem runs on a hardware clock tied to the output device's sample rate. The wall clock is subject to OS scheduling jitter, NTP corrections, and power management events. Over a short interaction the difference is negligible. Over a multi-turn conversation, drift compounds:

Turn 1:  wall clock ahead by  2ms  → tiny gap at buffer boundary
Turn 2:  wall clock ahead by  5ms  → small audible click
Turn 3:  wall clock ahead by 11ms  → noticeable glitch
Turn 4:  wall clock ahead by 19ms  → jarring gap

The correct approach is to schedule all audio against AudioContext.currentTime and track a playhead position — the audio-context timestamp at which the next buffer should begin — advancing it by the duration of each buffer after scheduling.

// ✅ CORRECT: schedule against AudioContext.currentTime, not Date.now()
class GaplessScheduler {
  constructor(audioContext) {
    this.ctx = audioContext;
    this.playhead = audioContext.currentTime + 0.05; // small initial lookahead
  }

  schedule(audioBuffer) {
    const source = this.ctx.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(this.ctx.destination);
    source.start(this.playhead);
    // Advance by exact duration — no wall-clock involvement
    this.playhead += audioBuffer.duration;
  }
}

🎯 Key Principle: Wall-clock time tells you when a chunk arrived. The audio clock tells you when a chunk should play. These are different numbers, and using the wrong one causes drift.

Mistake 3: Unbounded Turn Context Growth

A voice agent that begins a conversation at 200 ms LLM latency and ends it at 900 ms is almost certainly suffering from context accumulation. Every message appended to conversation history grows the LLM's input token count. Without a truncation strategy, this growth is monotonic — each turn is strictly more expensive than the last.

Token growth across turns (no truncation):

Turn  1: [system prompt] + [user: 12 tokens]               = ~400 tokens
Turn  5: [system prompt] + [4 prior turns] + [user msg]    = ~1,200 tokens
Turn 10: [system prompt] + [9 prior turns] + [user msg]    = ~2,800 tokens
Turn 20: [system prompt] + [19 prior turns] + [user msg]   = ~6,000 tokens

The failure is easy to miss in development because developers typically run short test conversations. Common practical truncation strategies:

  • Sliding window: Keep only the most recent N turns. Fast and predictable; risks dropping context the user referenced earlier.
  • Summarization: Periodically compress older turns into a summary. Preserves semantic context at the cost of a summarization call.
  • Token budget: Enforce a hard ceiling on the history portion and truncate oldest-first. More precise than turn counting because turn length varies.

⚠️ Common Mistake: Assuming the LLM provider's context window is large enough that truncation isn't needed. Even with a large context window, inference latency scales with input length — you pay the latency cost long before hitting the token limit.

The diagnostic is straightforward: log the input token count for every LLM call. If it grows monotonically, you have no truncation strategy.

Mistake 4: Processing Silence Without a VAD Gate

Voice Activity Detection (VAD) distinguishes speech from silence and background noise. Omitting a VAD gate means every downstream stage — STT, LLM, TTS — receives and processes frames during silence.

The compute waste is compounded by a more serious problem: spurious responses. If your STT model occasionally misinterprets a fan hum or keyboard click as a low-confidence word, and that word crosses your confidence threshold, it becomes input to the LLM, which produces a response, which plays back to a user who asked nothing.

Without VAD gate:
Microphone → [every frame] → STT → LLM → TTS → Speaker
                ↑
          silence, noise, keyboard clicks — all processed

With VAD gate:
Microphone → VAD → [speech frames only] → STT → LLM → TTS → Speaker
               ↓
          [silence dropped — no wasted compute, no spurious triggers]

VAD implementations range from simple energy-threshold detectors (fast, low overhead, sensitive to environment) to neural classifiers (more robust, modest compute cost). For most production deployments, a neural VAD is worth the overhead — false-positive rates from energy thresholds in real environments (offices, homes, mobile) are high enough to cause visible quality problems.

The VAD gate also enables cleaner end-of-turn detection: defining end-of-turn as "speech activity followed by N milliseconds of silence" is more reliable than fixed timeouts and connects directly to the cancellation propagation covered in the previous section.

Mistake 5: Over-Buffering and Hiding Latency in Development

Over-buffering is the most deceptive mistake on this list because it makes a broken system appear to work. A developer who adds "just a bit more" buffering to smooth over an occasional glitch has not fixed the underlying problem — they've masked it under conditions where it doesn't matter and guaranteed it will resurface under conditions where it does.

Development (local network, no jitter):
LLM delay: 80ms   [buffer: 400ms]   → gap absorbed, no glitch heard

Production (mobile, 80ms jitter):
LLM delay: 340ms  [buffer: 400ms]   → gap NOT absorbed, glitch audible
                        ↑
               same buffer, same root cause, now exposed

The correct approach is to test with artificially degraded network conditions from the first integration test:

  • Linux: tc qdisc can add latency, packet loss, and jitter at the OS level
  • macOS: Network Link Conditioner (in developer tools) provides a GUI equivalent
  • CI environments: Tools like toxiproxy can inject network conditions programmatically

Run your pipeline under three profiles: clean (baseline), moderate jitter (~50 ms added latency, ~1% packet loss), and degraded (~150 ms added, ~5% packet loss). Buffer sizes should be tuned to your latency budget under the moderate profile, not the clean one.

🤔 Did you know? The relationship between buffer size and perceived latency is not linear. A buffer that is twice as large does not produce twice the smoothing — it produces twice the startup delay and twice the lag before a user's interruption takes effect. Oversized buffers actively damage interactivity even when they successfully prevent glitches.

🎯 Key Principle: Buffer sizes should be determined by the latency budget, not by how large they need to be to paper over a slow stage. A slow stage should be fixed; buffering should absorb expected jitter around a healthy stage.

Diagnostic Workflow for Buffer-Hidden Latency
Progressive buffer reduction:

Buffer: 400ms → no glitches          [stage healthy OR buffer masking]
Buffer: 200ms → no glitches          [stage probably healthy]
Buffer: 100ms → occasional glitches  [true latency floor ~100-150ms]
Buffer:  50ms → frequent glitches    [confirmed: stage needs improvement]
                ↑
          Now fix the stage. Don't restore the 400ms buffer.

How These Mistakes Compound

These five mistakes rarely appear in isolation. A missing VAD gate generates more STT calls, which (using synchronous I/O) is more likely to block the event loop under load, which increases the apparent need for buffering, which masks the underlying problems further. Each mistake makes the others harder to diagnose.

The common thread: all five are hidden under favorable conditions and revealed under realistic ones. Event loop blocking is invisible when the network is fast. Clock drift is invisible over a two-turn conversation. Context growth is invisible in short test sessions. Missing VAD is invisible in a quiet office. Over-buffering is invisible on localhost.

📋 Quick Reference: Mistakes, Symptoms, and Diagnostic Tools

Mistake Symptom Diagnostic Tool
Blocking event loop Intermittent full-pipeline stalls CPU flame graph; asyncio slow callback logging
Wall-clock scheduling Growing audio drift over conversation Compare AudioContext.currentTime vs. scheduled playhead
Unbounded context Monotonically increasing LLM latency Log input token count per turn
No VAD gate Spurious responses; wasted STT compute Count non-speech frames sent to STT
Over-buffering Latency absent in dev, present in prod Test under tc qdisc; reduce buffers progressively

Key Takeaways and Engineering Checklist

The five principles below reduce to a single coherent question you can ask about any pipeline stage: Is this stage's contract defined, its I/O non-blocking, its cancellation wired, its timing referenced to the right clock, and its latency measured? If the answer to any of those is no, you have a known risk that will surface at the worst time.

Principle 1: Define Every Stage's Contract Before Writing Code

Before implementation, every stage should have three properties written down: its chunk size, its buffer bound, and its overflow policy. These are the contract between stages, and omitting them is the fastest way to guarantee latency surprises later.

Chunk size determines end-to-end latency contribution — every extra millisecond of frame size is a millisecond added to your minimum achievable latency. Buffer bound caps how much latency can accumulate before the system takes action. Overflow policy is the decision you make when a buffer fills: drop oldest (appropriate for raw audio capture), drop newest, or apply backpressure. There is no universally correct choice — but there is always a correct choice for your stage, and it must be made before you write the queue instantiation.

from dataclasses import dataclass

@dataclass
class StageContract:
    """Documents the contract for one pipeline stage."""
    stage_name: str
    chunk_size_bytes: int
    buffer_max_items: int
    overflow_policy: str  # "drop_oldest" | "drop_newest" | "backpressure"

    def describe_latency_budget(self, sample_rate: int, channels: int, bit_depth: int) -> str:
        bytes_per_sample = bit_depth // 8
        bytes_per_second = sample_rate * channels * bytes_per_sample
        chunk_ms = (self.chunk_size_bytes / bytes_per_second) * 1000
        max_buffer_ms = chunk_ms * self.buffer_max_items
        return (
            f"{self.stage_name}: chunk={chunk_ms:.1f}ms, "
            f"max_buffer={max_buffer_ms:.0f}ms, "
            f"overflow={self.overflow_policy}"
        )

capture_contract = StageContract(
    stage_name="microphone_capture",
    chunk_size_bytes=640,       # 20ms at 16kHz mono 16-bit
    buffer_max_items=10,        # 200ms max accumulation
    overflow_policy="drop_oldest"
)

tts_playback_contract = StageContract(
    stage_name="tts_playback",
    chunk_size_bytes=3200,      # 100ms at 16kHz mono 16-bit
    buffer_max_items=5,         # 500ms lookahead max
    overflow_policy="backpressure"
)

print(capture_contract.describe_latency_budget(16000, 1, 16))
## microphone_capture: chunk=20.0ms, max_buffer=200ms, overflow=drop_oldest

Treat StageContract instances as living specs committed alongside the code they describe. When a latency regression appears, the first question is: which stage's actual behavior diverged from its contract?

Checklist:

  • Every stage has a written chunk size with the arithmetic showing its latency contribution
  • Every queue or ring buffer has a hard maximum item count
  • The overflow policy for each stage is explicitly chosen and matches the semantics of the data

Principle 2: Async Non-Blocking I/O Is an Architectural Constraint

Every synchronous call inside a hot path is a latency landmine. A synchronous HTTP request, a blocking file read, or time.sleep() suspends the entire thread — and in a single-threaded async runtime, that means suspending every other coroutine, including the one delivering audio chunks to your playback buffer.

import asyncio
import httpx

## ❌ Wrong: synchronous HTTP inside an async hot path
def fetch_tts_chunk_blocking(text: str) -> bytes:
    import urllib.request
    with urllib.request.urlopen(f"https://tts-api.example/synthesize?text={text}") as r:
        return r.read()  # suspends the whole thread

## ✅ Correct: async HTTP that yields control back to the event loop
async def fetch_tts_chunk_async(client: httpx.AsyncClient, text: str) -> bytes:
    response = await client.post(
        "https://tts-api.example/synthesize",
        json={"text": text},
        timeout=5.0
    )
    response.raise_for_status()
    return response.content

## ✅ Correct: streaming variant for incremental TTS chunks
async def stream_tts_chunks(client: httpx.AsyncClient, text: str):
    async with client.stream(
        "POST",
        "https://tts-api.example/synthesize",
        json={"text": text}
    ) as response:
        async for chunk in response.aiter_bytes(chunk_size=3200):
            yield chunk  # emit each chunk as it arrives

The streaming variant cuts TTS latency to roughly time-to-first-chunk rather than waiting for the full synthesis response.

⚠️ Common Mistake: Wrapping a blocking call in asyncio.to_thread() and considering it solved. Running blocking code in a thread pool prevents event loop starvation but does not eliminate the wall-clock delay — it still contributes to pipeline lag.

Checklist:

  • All network calls on the hot path use an async client with explicit timeout configuration
  • No time.sleep(), blocking lock acquisition, or synchronous file reads in any function called from an audio callback or chunk handler
  • Any to_thread() wrappers are documented with the reason and the latency budget they consume

Principle 3: Cancellation Is a First-Class Feature

User interruption is not an edge case — it is one of the most common interactions in any real conversation. Designing cancellation as an afterthought means the happy path will be clean and the interruption path will be a source of subtle bugs.

The design discipline: draw the cancellation signal path at the same time you draw the happy path, not after. Concretely:

  • A TurnContext with a cancel event is created at the start of each agent turn and passed into every stage
  • Each stage checks the cancel event before processing a new chunk
  • The TTS playback stage clears its buffer on cancellation rather than finishing what's queued
  • LLM and TTS streaming connections are explicitly closed on cancellation
import asyncio
from dataclasses import dataclass, field

@dataclass
class TurnContext:
    session_id: str
    turn_id: str
    cancel_event: asyncio.Event = field(default_factory=asyncio.Event)

    def cancel(self):
        self.cancel_event.set()

    @property
    def is_cancelled(self) -> bool:
        return self.cancel_event.is_set()


async def tts_playback_stage(audio_queue: asyncio.Queue, ctx: TurnContext):
    while True:
        try:
            chunk = await asyncio.wait_for(audio_queue.get(), timeout=0.05)
        except asyncio.TimeoutError:
            if ctx.is_cancelled:
                while not audio_queue.empty():
                    audio_queue.get_nowait()
                    audio_queue.task_done()
                return
            continue

        if ctx.is_cancelled:
            audio_queue.task_done()
            return

        await schedule_to_audio_context(chunk)
        audio_queue.task_done()


async def handle_user_interruption(current_ctx: TurnContext) -> TurnContext:
    current_ctx.cancel()
    await asyncio.sleep(0)  # yield to let cancellation propagate
    return TurnContext(
        session_id=current_ctx.session_id,
        turn_id=generate_turn_id()
    )

💡 Mental Model: Think of a conversation turn as a structured concurrency scope. When the scope is cancelled, everything inside it terminates. The TurnContext is that scope's handle.

Checklist:

  • A cancellation token is created per turn and passed into every stage function signature
  • TTS playback clears its buffer on cancellation; it does not play out queued audio
  • LLM and TTS streaming connections are explicitly closed on cancellation
  • Cancellation is exercised in integration tests with a simulated mid-response interruption

Principle 4: The Audio Clock Is the Source of Truth for Playback Timing

The event loop is excellent for coordination: routing chunks, managing async I/O, handling cancellation. It is not a clock. Using asyncio.sleep(0.02) to schedule 20 ms audio frames will drift. Using Date.now() to decide when to enqueue the next audio buffer will drift.

The audio context maintains a sample-accurate clock that advances independently of the event loop's scheduling jitter. Hand chunks to that clock in advance of when they're needed (a lookahead of one to two chunk durations is typical), let the clock consume them at the hardware-determined rate, and use the clock's currentTime to decide when to schedule the next chunk — never a wall-clock timestamp.

Checklist:

  • All audio scheduling uses the audio context clock (AudioContext.currentTime or equivalent), not system time
  • Playback scheduling includes a fixed lookahead buffer (one to two chunk durations) to absorb event loop jitter
  • Drift monitoring: log the difference between expected and actual audio context position periodically

Principle 5: Measure Latency Per Stage from the First Integration Test

Latency regressions compound. A 10 ms regression in STT, 15 ms in the LLM token handler, and 8 ms in the TTS scheduler are each individually below the perceptual threshold — but combined they produce a 33 ms regression that is noticeable and whose root cause is nearly impossible to isolate post-hoc.

Instrument per-stage latency from the first integration test. This is inexpensive — a timestamp at stage ingress and egress is enough — and makes regressions immediately localizable.

import time
from contextlib import asynccontextmanager
from typing import Callable

@asynccontextmanager
async def measure_stage(stage_name: str, emit: Callable[[str, float], None]):
    start = time.monotonic()
    try:
        yield
    finally:
        elapsed_ms = (time.monotonic() - start) * 1000
        emit(stage_name, elapsed_ms)


def log_latency(stage: str, ms: float):
    print(f"[latency] {stage}: {ms:.2f}ms")  # replace with your metrics sink


async def stt_stage(audio_chunk: bytes, ctx) -> str:
    async with measure_stage("stt", log_latency):
        transcript = await call_stt_api(audio_chunk)
    return transcript


async def llm_stage(transcript: str, ctx) -> str:
    async with measure_stage("llm_time_to_first_token", log_latency):
        # Measure only to first token — this is the latency users feel
        first_token = await get_first_llm_token(transcript)
    return first_token

For streaming stages like LLM token generation, the metric that matters most to users is time-to-first-token, not total generation time. Instrumenting total generation time will show a healthy number even when the agent feels slow because the first token is delayed.

⚠️ Common Mistake: Adding latency instrumentation only to stages you suspect are slow. Instrument every stage uniformly. The regression will be in the stage you didn't suspect.

Checklist:

  • Every stage emits a latency metric in every integration test run, not just in production
  • For streaming stages, time-to-first-output is measured separately from total duration
  • A latency budget document exists with per-stage target ranges
  • Latency metrics are reviewed on every PR that touches pipeline code

Summary: The Engineering Checklist

📋 Quick Reference: Pipeline Engineering Checklist

# Principle Non-Negotiable Requires Judgment
1 Stage contracts Buffer bound and overflow policy must be explicit Exact chunk size depends on target latency
2 Async I/O No blocking calls on any hot path to_thread() acceptable for legacy libs, with documented cost
3 Cancellation Cancel token per turn, buffer cleared on cancel Cleanup ordering between stages
4 Audio clock authority All playback scheduling uses audio context clock Lookahead buffer size (1–2 chunks is typical)
5 Per-stage latency Instrumented from first integration test Metric granularity (p50 vs. p99 depending on stakes)

These principles have a dependency order in how they fail. An undefined stage contract (Principle 1) makes it impossible to know whether a latency measurement (Principle 5) indicates a regression or expected behavior. A missing cancellation path (Principle 3) can corrupt the latency signal by including cleanup work from a prior turn. Build the contracts and the cancellation path first; instrument latency second; async I/O and audio clock discipline are non-negotiable throughout.

Before this lesson, the challenge of building a real-time voice agent could be summarized as "make the AI respond fast." After it, you understand that "fast" is a structural property of the system, not a tuning knob — the result of defined contracts at every seam, non-blocking I/O throughout, cancellation as a first-class citizen, the audio clock as the scheduling authority, and latency measured where it actually accumulates. The pipeline graph gives you a vocabulary for reasoning about all of these together; the checklist gives you a way to verify them before problems compound in production.

Next Steps

Three practical directions from here:

  1. Apply the contract documentation pattern to an existing pipeline. Write out the StageContract for each stage. If you can't fill in the buffer bound or overflow policy, you've found a latency risk worth addressing before the next production incident.

  2. Add per-stage latency instrumentation to your integration test suite. Even a simple log line per stage captured in CI gives you a regression baseline. The first time a PR causes a stage to double its latency, you'll be glad it's visible before merge.

  3. Write a cancellation integration test. Simulate a user interruption at the midpoint of a synthesized TTS response and verify that no buffered audio plays after the cancellation signal fires. This test will fail on most first attempts — and fixing it will reveal the gaps in your cancellation signal path more clearly than any code review.