Part II: The Pipeline
Wire together ASR, LLM, and TTS into a working pipeline and learn the key latency trick that makes responses feel instant.
Why Pipelines and Where This One Fits
Think about the last time a voice assistant felt genuinely conversational — not just technically correct, but fast in the way a real person is fast. The reply came before you'd fully registered that you'd finished talking. Now think about the last time one felt sluggish: that pause after you speak, long enough to make you wonder if it heard you, before a voice finally answers. Both experiences are the product of the same basic architecture — the difference is almost entirely in how the stages of that architecture are connected. Understanding those connections is what this lesson is about.
A voice agent pipeline is the chain of transformations that converts raw microphone audio into spoken output from an AI. At its core, the chain has exactly three stages: Automatic Speech Recognition (ASR), which converts audio into text; language model inference (LLM), which generates a text response; and Text-to-Speech synthesis (TTS), which converts that response back into audio. Every architectural decision in this lesson flows from understanding what happens between them — not inside them.
The Three Stages, Briefly
ASR takes a stream of audio samples and produces text. From the pipeline's perspective, ASR is a machine that accepts audio chunks and emits words incrementally as audio arrives — it doesn't wait for you to finish speaking before it starts working.
LLM inference takes a text prompt and generates a text response token by token, not as a block of text all at once. This incremental output is the property the pipeline depends on.
TTS takes text and synthesizes audio waveforms. Modern TTS systems can begin producing audio before the entire response text is available — but the quality of TTS output depends on having enough phrase-level context to handle prosody correctly. This tension between wanting to start early and needing enough context is the central design problem for the LLM-to-TTS handoff.
┌─────────────────────────────────────────────────────────────┐
│ Voice Agent Pipeline │
│ │
│ 🎤 Audio In │
│ │ │
│ ▼ │
│ ┌─────────┐ partial transcripts ┌─────────────────┐ │
│ │ ASR │ ─────────────────────────▶│ LLM │ │
│ └─────────┘ └────────┬────────┘ │
│ │ │
│ token stream │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ TTS │ │
│ └────────┬────────┘ │
│ │ │
│ audio chunks │
│ ▼ │
│ 🔊 Audio Out │
└─────────────────────────────────────────────────────────────┘
Why Latency Compounds — and What to Do About It
Here is the uncomfortable arithmetic of a naïve pipeline. Suppose each stage takes a fixed amount of time to complete before handing off to the next:
- ASR finishes transcribing: 800 ms after the user stops speaking
- LLM generates the full response: 1,200 ms after receiving the transcript
- TTS synthesizes the full audio: 600 ms after receiving the response text
Total time before the user hears anything: 2,600 ms. That's nearly three seconds of silence. In a voice interface, it reads as a broken system.
Now notice where most of that delay lives: in the waiting. The LLM doesn't start until ASR finishes. TTS doesn't start until the LLM finishes. Every stage's full duration adds directly to the time the user stares at silence.
🎯 Key Principle: Latency in a naïve pipeline is the sum of stage durations. The goal of pipeline design is to convert that sum into something closer to the maximum of stage durations — by letting stages run concurrently on partial outputs.
The metric that makes this concrete is time-to-first-audio (TTFA): the duration between the moment the user stops speaking and the moment the agent begins playing audio. Once audio has started playing, human perception of delay is much more forgiving — we're hearing something, the agent is clearly working. The first sound matters far more than the last one.
This distinction reshapes every architectural decision in the pipeline:
- We want ASR to forward text to the LLM as early as possible, not after full transcription.
- We want the LLM to forward tokens to TTS as they arrive, not after full generation.
- We want TTS to begin synthesis as soon as it has a speakable chunk, not after it has the full response.
Each of those goals requires the stages to communicate via streams of partial outputs rather than complete, finished results.
Time ──────────────────────────────────────────────────────▶
Naïve pipeline:
ASR [████████████████████]
LLM [████████████████████████████]
TTS [████████████]
First audio: ──────────────────────────────────────────────────────────▶
Streaming pipeline:
ASR [████████████████████]
LLM [████████████████████████████]
TTS [████] [████] [████]
First audio: ──────────▶
In the optimized pipeline, the total time before first audio is not 2,600 ms. It is closer to: ASR TTFA + LLM first-token latency + TTS synthesis time for the first clause — potentially under 500 ms from end of user speech.
The Three Handoffs
Naming the three stages is the easy part. The harder part is understanding the three handoffs between them, because that is where most pipeline complexity lives.
Handoff 1: ASR → LLM. ASR emits partial transcripts continuously. The pipeline cannot forward every partial transcript to the LLM — that would trigger redundant inference on incomplete inputs. The handoff requires a mechanism to detect that the user has finished speaking, called end-of-utterance detection or voice activity detection (VAD).
Handoff 2: LLM → TTS. The LLM emits individual tokens. TTS cannot synthesize individual tokens — prosody depends on phrase-level context. The handoff requires a buffering layer that accumulates tokens until a natural synthesis boundary — typically a sentence or clause ending — then forwards that chunk to TTS.
Handoff 3: Interrupt Signal. When the user speaks again while the agent is talking, the pipeline needs a way to cancel the in-flight LLM generation and TTS playback. Without a cancellation signal path, the agent continues speaking its previous response over the user's new input — a catastrophic conversational failure.
┌──────────────────────────────────────────────────────────┐
│ Three Critical Handoffs │
│ │
│ ASR ─[VAD-gated transcript]──▶ LLM │
│ │ │
│ [clause-buffered token stream] │
│ ▼ │
│ TTS │
│ │
│ New user speech ─[interrupt signal]──▶ Cancel LLM+TTS │
└──────────────────────────────────────────────────────────┘
🧠 Mnemonic: Think of the three handoffs as Gate → Buffer → Cancel. Gate the transcript (VAD), buffer the tokens (clause chunking), cancel on interrupt. These three mechanisms are what make the streaming pipeline safe and responsive, not just fast.
This lesson — across all its sections — is about how the stages connect. The internals of ASR, LLM streaming, and TTS are covered in their respective child lessons. What this lesson owns is the architecture between those stages: the queues, the handoff contracts, the buffering strategies, and the cancellation paths. You can have excellent ASR, an excellent LLM, and excellent TTS and still build a sluggish, broken voice agent if the connections between them are wrong.
📋 Quick Reference: Pipeline Overview
| Stage | Role | Output Unit | Latency Driver |
|---|---|---|---|
| ASR | Audio → Text | Partial transcript | Audio processing speed |
| LLM | Text → Text | Individual tokens | First-token latency |
| TTS | Text → Audio | Audio chunks | Synthesis start threshold |
| Handoff 1 | ASR → LLM | VAD-gated utterance | End-of-utterance accuracy |
| Handoff 2 | LLM → TTS | Clause-buffered text | Buffer accumulation time |
| Handoff 3 | Interrupt → Cancel | Cancellation signal | Detection latency |
Streaming as the Backbone: How Data Flows Between Stages
With the three stages and their handoffs mapped out, the next question is what actually flows across those handoffs — and what constraints govern each transfer. Every stage is simultaneously a producer and a consumer: ASR consumes audio and produces transcript fragments; the LLM consumes a transcript and produces tokens; TTS consumes text and produces audio frames. The data model connecting them is streaming, and understanding it in detail — what each stage actually emits, what "meaningful to send" means at each boundary, and where queuing can silently ruin responsiveness — is what this section covers.
The Shape of Each Stage's Output Stream
ASR does not wait until the user stops talking to produce a transcript. Modern ASR systems emit a continuous stream of partial transcripts: early, tentative guesses progressively revised as more audio arrives. A user saying "What's the weather in San Francisco?" might produce a stream like this:
t=0.3s → "what"
t=0.6s → "what's"
t=0.9s → "what's the"
t=1.1s → "what's the weather"
t=1.4s → "what's the weather in"
t=1.7s → "what's the weather in San"
t=2.0s → "what's the weather in San Francisco"
t=2.1s → [end-of-utterance signal]
Each line is a full re-hypothesis of the entire utterance so far, not just a new word appended. Early guesses can be wrong and get corrected.
The LLM emits a stream of tokens: sub-word units that arrive one or a few at a time as the model generates them auto-regressively. A response like "The weather in San Francisco is currently 58 degrees and foggy." arrives as:
"The" → " weather" → " in" → " San" → " Francisco" → " is" → " currently" → ...
The token stream has no natural pauses — it is a flat sequence with no inherent segmentation into speakable units.
TTS needs a chunk of text as input and synthesizes audio for that chunk. The pipeline must answer: how big should that chunk be? Too small and prosody breaks; too large and TTFA suffers.
ASR Output Stream LLM Token Stream TTS Input Chunks
────────────────── ───────────────── ────────────────
"what" "The" "The weather in
"what's" " weather" San Francisco
"what's the" " in" is currently
"what's the weather" " San" 58 degrees
... (partial) " Francisco" and foggy."
[END_OF_UTTERANCE] " is"
" currently" Each chunk triggers
... one TTS synthesis call
The ASR-to-LLM Handoff: Stability Before Forwarding
ASR partial transcripts are unstable. Forwarding a partial transcript to the LLM the moment it arrives means sending "what's the" to a language model and prompting it to generate a response to an incomplete, possibly inaccurate fragment — wasteful at best, wrong at worst.
The pipeline therefore needs to distinguish between final transcripts (stable, committed) and partial transcripts (tentative, subject to revision). Most ASR systems communicate this through an end-of-utterance (EOU) signal — a marker that indicates the speaker has finished their turn, typically derived from a voice activity detector (VAD) running alongside ASR.
The practical rule: do not forward any transcript to the LLM until an EOU signal has been received. Until then, partial transcripts are buffered or discarded from the LLM's perspective.
⚠️ Common Mistake: Some implementations watch the transcript text for a period of silence (e.g., no new words for 500ms) as a proxy for end-of-utterance. This is fragile — a speaker might pause mid-sentence — and introduces an artificial delay. A proper EOU signal from a VAD is more reliable.
One nuance worth acknowledging: some advanced pipeline architectures forward high-confidence partial transcripts to the LLM to reduce latency further, using speculative or interruptible LLM inference. That is a legitimate technique but introduces significant complexity around cancellation and correctness. For most implementations, the clean rule — wait for EOU — is the right starting point.
The LLM-to-TTS Handoff: Buffering Tokens into Speakable Chunks
This is where the most subtle and consequential buffering logic lives. The LLM streams tokens one at a time, but TTS cannot meaningfully synthesize a single token — prosody (the rhythm, stress, and intonation of speech) depends on phrase-level context. A TTS model given just "The weather" has no idea whether the sentence continues with "is nice" or "in San Francisco has been terrible." The resulting audio sounds clipped and unnatural.
The inter-stage component that solves this is an inter-stage buffer (sometimes called a chunking function). It accumulates incoming tokens until it has a unit of text that TTS can render naturally, then emits that unit as a synthesis request.
The buffer detects speakable boundaries by watching for:
- Sentence-ending punctuation:
.,?,! - Clause-level punctuation:
,,;,:(with caveats) - Token count threshold: a fallback that flushes the buffer after a maximum number of tokens, preventing long parenthetical clauses from blocking synthesis indefinitely
Incoming tokens (LLM output):
"The" " weather" " in" " San" " Francisco" " is" " currently"
" 58" " degrees" " and" " foggy" "." " Tomorrow"
" will" " be" " sunny" "."
Buffer state over time:
After "foggy": buffer = "The weather in San Francisco is currently 58 degrees and foggy"
After ".": → EMIT "The weather in San Francisco is currently 58 degrees and foggy."
buffer = "" ← reset
After second "." → EMIT "Tomorrow will be sunny."
TTS receives two synthesis requests, in order:
1. "The weather in San Francisco is currently 58 degrees and foggy."
2. "Tomorrow will be sunny."
Commas are tempting split points because the buffer would drain faster — producing lower TTFA. The risk is that short comma-separated clauses produce awkward TTS prosody. "The weather," synthesized alone, sounds like an incomplete thought. A practical heuristic: split on commas only when the accumulated buffer already exceeds a minimum token count (roughly 10–15 tokens), ensuring TTS always receives a minimally self-contained phrase.
The tradeoff is explicit: smaller chunks yield lower latency but risk prosody artifacts; larger chunks yield better audio but increase TTFA. A sentence boundary is a reasonable default that balances both concerns for most conversational content.
Back-Pressure: When TTS Can't Keep Up
Streaming between stages creates a new class of problem: stages don't run at the same speed. The LLM may produce tokens faster than TTS can synthesize and play audio. When this happens, tokens pile up in the inter-stage buffer, and the result is a progressively increasing audio lag — the user hears audio that corresponds to text the LLM generated seconds ago.
LLM generates tokens at ~60 tokens/sec
TTS synthesizes and plays at ~40 tokens/sec equivalent
t=0s: Buffer: []
t=1s: Buffer: [20 tokens queued]
t=2s: Buffer: [40 tokens queued]
t=5s: Buffer: [100 tokens queued] ← user hears audio from t=0, LLM is at t=5
Two complementary strategies handle this:
1. Bound the buffer. Set a maximum size for the inter-stage buffer. When the buffer is full, the pipeline applies back-pressure upstream — the LLM task awaits an available slot before putting a new chunk into the queue. A reasonable default is 2–4 pending synthesis chunks at any time.
2. Drain aggressively on interruption. If the user starts speaking again, there is no value in finishing a long backlogged queue. An interrupt signal should drain the buffer immediately and cancel in-flight TTS tasks.
⚠️ Common Mistake: Developers often test pipelines with short responses where the LLM finishes before TTS does, so the buffer never grows. The problem surfaces with long responses — explanations, lists, multi-paragraph answers — where generation significantly outlasts audio playback.
🎯 Key Principle: An unbounded buffer turns a fast LLM into a liability — the faster it generates, the more lag it can accumulate if TTS can't keep pace. Bounding the buffer makes the failure mode explicit and manageable.
Contracts, Not Just Cables
A useful way to think about these inter-stage connections is as contracts rather than simple data pipes. Each contract specifies three things:
| ASR → LLM | LLM → Buffer | Buffer → TTS | |
|---|---|---|---|
| Unit emitted | Complete utterance transcript | Token | Sentence/clause chunk |
| Trigger condition | EOU signal from VAD | Each generated token | Punctuation boundary or token threshold |
| Flow control | One transcript per turn | Gated by LLM speed | Bounded queue; back-pressure upstream |
This framing makes the pipeline's behavior predictable and testable in isolation. You can unit-test the chunking function independently by feeding it a sequence of tokens and verifying what chunks it emits and when. You can test the ASR handoff logic by simulating partial and final transcript events without running a real ASR model. The contracts define the seams.
The concrete implementation of these contracts — async queues, producer and consumer tasks, and the cancellation paths that cut across all of them — is what the next section translates into running code.
Wiring the Pipeline: A Concrete Architecture
A voice agent pipeline is only as robust as the connections between its stages. This section walks through a concrete producer-consumer architecture that keeps each stage independent, coordinates the handoffs correctly, and provides the interrupt paths a real conversation requires.
The Core Pattern: One Stage, One Task, One Queue
Each pipeline stage runs as its own async task and stages communicate exclusively through async queues rather than direct function calls. This is the classic producer-consumer pattern applied in three hops.
Microphone
│
▼
┌─────────────┐ asr_queue ┌─────────────┐
│ ASR Task │ ──────────────────────► │ LLM Task │
│ │ (transcript + EOU flag) │ │
└─────────────┘ └─────────────┘
│
llm_queue
│
▼
┌─────────────┐
│ TTS Task │
│ │
└─────────────┘
│
▼
Speaker
The key property this buys you: a slow stage cannot block an upstream stage from making progress. If TTS is busy synthesizing a previous sentence, the LLM task keeps producing tokens into llm_queue. Each task runs at its own pace; the queues absorb timing mismatches.
🎯 Key Principle: Queues between stages carry two things: data payloads (transcripts, tokens, audio chunks) and control signals (end-of-utterance, cancel, flush). Keeping these separate leads to clean state machines; conflating them leads to fragile parsers.
The ASR-to-LLM Handoff: Why VAD Does the Signaling
The most counterintuitive design decision at the ASR-to-LLM boundary is that transcript content alone is not sufficient to decide when to forward to the LLM. A transcript may be syntactically complete — "What's the weather like" — but the user might still be mid-utterance, about to add "in Seattle on Friday?"
The correct signal comes from a Voice Activity Detector (VAD), which detects the presence or absence of human speech. When the VAD observes a silence gap long enough to indicate the user has finished, it emits an EOU signal. Only then does the pipeline forward the accumulated transcript to the LLM.
In practice, the message placed on asr_queue is a small dataclass:
from dataclasses import dataclass
from enum import Enum, auto
class MessageKind(Enum):
PARTIAL_TRANSCRIPT = auto()
END_OF_UTTERANCE = auto()
CANCEL = auto()
@dataclass
class PipelineMessage:
kind: MessageKind
payload: str | None = None # transcript text, or None for signals
The ASR task places PARTIAL_TRANSCRIPT messages as words arrive, then a single END_OF_UTTERANCE message (with the final, stable transcript as payload) when the VAD fires. The LLM task ignores partial transcripts and acts only on END_OF_UTTERANCE:
import asyncio
async def llm_task(asr_queue: asyncio.Queue, llm_queue: asyncio.Queue, cancel_event: asyncio.Event):
accumulated_transcript = []
while True:
msg = await asr_queue.get()
if msg.kind == MessageKind.CANCEL:
accumulated_transcript.clear()
cancel_event.set()
continue
if msg.kind == MessageKind.PARTIAL_TRANSCRIPT:
accumulated_transcript.append(msg.payload)
continue # wait for EOU
if msg.kind == MessageKind.END_OF_UTTERANCE:
full_text = " ".join(accumulated_transcript) if accumulated_transcript else msg.payload
accumulated_transcript.clear()
cancel_event.clear()
async for token in call_llm_streaming(full_text):
if cancel_event.is_set():
break
await llm_queue.put(PipelineMessage(kind=MessageKind.PARTIAL_TRANSCRIPT, payload=token))
if not cancel_event.is_set():
await llm_queue.put(PipelineMessage(kind=MessageKind.END_OF_UTTERANCE))
⚠️ Common Mistake: Triggering LLM inference on every PARTIAL_TRANSCRIPT event causes redundant, overlapping inference calls — the LLM starts answering "What's the" before the user says "weather." The EOU gate is not optional; it is what separates a pipeline from a race condition.
The LLM-to-TTS Handoff: The Chunking Function
Once the LLM is streaming tokens, the TTS stage faces a dilemma: it needs a meaningful phrase to synthesize with natural prosody, but waiting for a complete response defeats the purpose of streaming. The answer is a chunking function — an accumulation layer that watches two triggers:
- Sentence boundary detection: punctuation characters (
.,?,!,;) - Token count threshold: a fallback that flushes the buffer after a maximum number of tokens
async def chunking_generator(
llm_queue: asyncio.Queue,
sentence_endings: str = ".?!;",
max_tokens: int = 30,
):
buffer: list[str] = []
token_count = 0
while True:
msg = await llm_queue.get()
if msg.kind == MessageKind.CANCEL:
buffer.clear()
token_count = 0
yield PipelineMessage(kind=MessageKind.CANCEL)
continue
if msg.kind == MessageKind.END_OF_UTTERANCE:
if buffer:
yield PipelineMessage(
kind=MessageKind.PARTIAL_TRANSCRIPT,
payload="".join(buffer).strip()
)
buffer.clear()
token_count = 0
yield PipelineMessage(kind=MessageKind.END_OF_UTTERANCE)
continue
if msg.kind == MessageKind.PARTIAL_TRANSCRIPT:
buffer.append(msg.payload)
token_count += 1
joined = "".join(buffer)
should_flush = (
any(joined.rstrip().endswith(p) for p in sentence_endings)
or token_count >= max_tokens
)
if should_flush:
yield PipelineMessage(
kind=MessageKind.PARTIAL_TRANSCRIPT,
payload=joined.strip()
)
buffer.clear()
token_count = 0
This async generator consumes from llm_queue and yields synthesis-ready chunks. The TTS task iterates over it:
async def tts_task(llm_queue: asyncio.Queue, cancel_event: asyncio.Event):
async for chunk_msg in chunking_generator(llm_queue):
if cancel_event.is_set() or chunk_msg.kind == MessageKind.CANCEL:
await drain_audio_output()
continue
if chunk_msg.kind == MessageKind.PARTIAL_TRANSCRIPT and chunk_msg.payload:
audio_bytes = await synthesize_speech(chunk_msg.payload)
if not cancel_event.is_set():
await play_audio(audio_bytes)
💡 Pro Tip: Set max_tokens between 20–40 tokens. Lower values produce more TTS calls with shorter audio (more network round-trips); higher values delay first-audio unnecessarily. Start at 25 and adjust based on your TTS latency budget.
The Cancellation Path: Handling Interruptions
A pipeline without a cancellation path is not a voice agent — it is a lecture. Real conversation requires barge-in capability: when the user starts speaking while the agent is talking, the agent must stop, discard in-flight work, and listen to the new input.
The mechanism is an asyncio.Event called cancel_event. Here is how it propagates:
User speaks (VAD detects)
│
▼
ASR Task emits CANCEL message
│
├──► asr_queue ──► LLM Task checks cancel_event
│ │ (stops token generation)
│ ▼
│ llm_queue ──► chunking_generator
│ │ (flushes buffer, yields CANCEL)
│ ▼
│ TTS Task
│ │ (stops playback)
│
└──► cancel_event.set() (shared across all tasks)
Setting cancel_event causes: the LLM task to break out of its token-generation loop; the chunking generator to pass a CANCEL message downstream and clear its buffer; and the TTS task to stop playback and discard the current audio chunk. After the new utterance arrives, the LLM task calls cancel_event.clear() and all stages resume normal operation.
⚠️ Common Mistake: Placing the cancel_event check only at the start of each task iteration is not sufficient. If synthesize_speech() is an awaitable that takes 300ms, the TTS task won't notice the cancellation until after it returns. The check must surround every await that represents meaningful work.
Putting It Together: A Minimal End-to-End Skeleton
import asyncio
async def run_pipeline():
asr_queue: asyncio.Queue[PipelineMessage] = asyncio.Queue(maxsize=20)
llm_queue: asyncio.Queue[PipelineMessage] = asyncio.Queue(maxsize=50)
cancel_event = asyncio.Event()
async with asyncio.TaskGroup() as tg:
tg.create_task(asr_task(asr_queue, cancel_event))
tg.create_task(llm_task(asr_queue, llm_queue, cancel_event))
tg.create_task(tts_task(llm_queue, cancel_event))
asyncio.run(run_pipeline())
Three tasks, two queues, one shared cancel event — the entire coordination model fits in a single function. asyncio.TaskGroup ensures that if any task raises an unhandled exception, all sibling tasks are cancelled, preventing a silent partial pipeline.
(In production you'd also need task restart logic for network errors, a VAD task feeding the ASR task, and discard policies for overload scenarios. The maxsize parameters above are starting points.)
Testability as a Design Dividend
One underappreciated benefit of this architecture is that each stage is testable in isolation. Because the interface between stages is a queue of PipelineMessage objects, you can test the chunking function by feeding it a sequence of token messages and asserting on what chunks it yields — no microphone, no network call required:
import asyncio
import pytest
@pytest.mark.asyncio
async def test_chunking_flushes_on_sentence_end():
llm_queue: asyncio.Queue[PipelineMessage] = asyncio.Queue()
tokens = ["Hello", ",", " world", "."]
for t in tokens:
await llm_queue.put(PipelineMessage(kind=MessageKind.PARTIAL_TRANSCRIPT, payload=t))
await llm_queue.put(PipelineMessage(kind=MessageKind.END_OF_UTTERANCE))
results = []
async for msg in chunking_generator(llm_queue):
results.append(msg)
if msg.kind == MessageKind.END_OF_UTTERANCE:
break
synthesis_chunks = [m for m in results if m.kind == MessageKind.PARTIAL_TRANSCRIPT]
assert len(synthesis_chunks) == 1
assert synthesis_chunks[0].payload == "Hello, world."
This isolation is a direct consequence of the queue-based design — stages expose a clean, observable interface, and the queue is the seam where you insert test inputs.
📋 Quick Reference: Pipeline Components
| Component | Consumes | Produces | Key Signal |
|---|---|---|---|
| ASR Task | Audio frames | PARTIAL_TRANSCRIPT, END_OF_UTTERANCE |
VAD silence gate |
| LLM Task | END_OF_UTTERANCE |
Token PARTIAL_TRANSCRIPT, END_OF_UTTERANCE |
cancel_event |
| Chunking Generator | Token stream | Clause-sized PARTIAL_TRANSCRIPT |
Punctuation / token count |
| TTS Task | Clause chunks | Audio playback | cancel_event |
Common Mistakes When Connecting the Stages
Every stage of a voice agent pipeline can be individually correct — the ASR can transcribe accurately, the LLM can reason well, the TTS can produce natural speech — and the integrated pipeline can still feel broken. The failure points almost always live in the connections, not the components themselves. This section works through the five integration mistakes that account for the majority of broken or sluggish pipelines in practice.
Mistake 1: Waiting for the Full LLM Response Before Starting TTS
This is the single most expensive latency error in voice pipeline development, and the most common — largely because it feels natural. When you call an LLM API, collecting the complete response before doing anything with it mirrors how you'd use a database query result.
In a voice agent, that instinct destroys perceived responsiveness. If you wait for the full LLM response before handing anything to TTS, generation time and synthesis time are purely sequential. For a response that takes 2 seconds to generate and 1.5 seconds to synthesize, the user hears nothing for 3.5 seconds after ASR finishes.
The fix is to stream tokens from the LLM into a chunking buffer and begin TTS synthesis as soon as the first complete clause is available. TTS synthesis for the first chunk overlaps with LLM generation of the remaining tokens — the user hears audio while the LLM is still generating the second sentence.
❌ Wrong: llm_response = await llm.complete(prompt) — then start TTS.
✅ Correct: Start TTS as soon as chunk_buffer emits its first sentence-boundary event from the token stream.
💡 Real-World Example: A common symptom is a pipeline that performs well in unit tests (where latency is mocked) but feels laggy in integration. The developer added per-stage streaming but left a single await llm.complete() call — not llm.stream() — at the handoff, accidentally re-serializing the pipeline.
🎯 Key Principle: TTFA is dominated by the longest unbroken wait in the pipeline. Any stage boundary where you collect a complete output before starting the next stage resets that clock.
Mistake 2: Forwarding Every Partial ASR Transcript to the LLM
ASR systems emit a rapid stream of partial, unstable transcripts as audio arrives. A developer wiring ASR output to the LLM might write:
## Anti-pattern
async for transcript in asr.stream():
response = await llm.generate(transcript.text) # ❌ fires on every partial
This fires a full LLM inference call on every interim transcript event. For a 4-second utterance, that could mean dozens of inference requests, most immediately superseded by the next partial. The consequences:
- Wasted compute and cost — most requests return before the user has finished speaking.
- Race conditions — an early, short response might complete and begin TTS while the user is still speaking.
- Non-deterministic behavior — network latency variation can cause out-of-order responses.
The correct gate is an EOU signal from a VAD, not from transcript content:
ASR partial events: ["I'd"] ["I'd like"] ["I'd like the"] ["I'd like the weather"]
↓ ↓ ↓ ↓
skip skip skip [VAD: silence]
↓
forward to LLM ✅
🤔 Did you know? Some ASR APIs distinguish between partial and final transcript events in their event type field. Using only final events is a lightweight substitute for a full VAD in controlled environments — but it depends entirely on the ASR provider's internal silence detection, which may have a fixed, non-configurable delay.
Mistake 3: Passing Individual Tokens Directly to TTS
Once you've correctly wired the LLM to stream tokens, the next temptation is to forward each token to TTS immediately. The instinct is valid, but the execution breaks audio quality.
TTS prosody is computed over a span of text, not over individual words or subword tokens. When you pass "The" to a TTS engine, it cannot know whether the sentence is "The dog barked" or "The results were catastrophic." The engine commits to an acoustic rendering before it has the context that determines correct prosody. The result is choppy, clipped audio.
❌ Token-by-token to TTS:
LLM: "The" → TTS | "weather" → TTS | "today" → TTS | "is" → TTS | "sunny" → TTS
[clipped] [clipped] [clipped] [clipped] [clipped]
✅ Clause-buffered to TTS:
LLM: "The" "weather" "today" "is" "sunny." → buffer detects "."
↓
"The weather today is sunny." → TTS
↓
[natural prosody] ✅
The fix is the clause-boundary buffer described in detail in Streaming as the Backbone and implemented in Wiring the Pipeline. The buffer is not trying to collect a complete response — that would reintroduce Mistake 1 — it is looking for the earliest safe boundary where prosody can be computed reliably.
⚠️ Note that this simplified picture works well for most conversational output. It can break down when the LLM produces content with unconventional punctuation (code snippets, lists, markdown) — handling those cases requires content-aware chunking.
Mistake 4: Forgetting to Cancel In-Flight TTS on User Interruption
This mistake doesn't degrade latency — it breaks turn-taking, which is arguably worse. The failure mode: the user begins speaking while the agent is mid-sentence. The pipeline correctly detects new speech, begins ASR on the new utterance, eventually sends a new prompt to the LLM. But the previous TTS task was never cancelled. It continues generating and playing audio, overlapping with the new response.
Agent speaking: "The weather in New York is currently sixty-two degrees and—"
User interrupts: "Actually, what about Boston?"
❌ Without cancellation:
TTS continues: "—partly cloudy with a chance of rain this afternoon."
New response: "Boston is currently fifty-eight degrees..." (plays over previous TTS)
Result: Two audio streams collide.
✅ With cancellation:
VAD fires → cancel TTS task → drain audio buffer → silence
New response: "Boston is currently fifty-eight degrees..." (clean start)
The fix requires a cancellation signal path wired to every active downstream task when a new utterance begins:
- The VAD fires on new speech.
- The pipeline emits a cancellation event to the current LLM generation task (if active) and to the current TTS task.
- Any audio already queued for playback is drained or discarded.
- The pipeline resets to the ASR-collection state.
🧠 Mnemonic: STOP-DRAIN-RESET. Stop the active tasks. Drain the audio queue. Reset the pipeline state. Skipping any one of these three steps leaves an artifact — a dangling task, residual audio, or an inconsistent state for the next turn.
💡 Pro Tip: Write a test scenario that sends a new VAD signal 200ms into TTS playback and verify that audio output stops and pipeline state is clean. Cancellation bugs are easy to miss in happy-path testing.
Mistake 5: Using Blocking Calls Inside an Async Pipeline
The previous four mistakes are all about what you connect; this one is about how you wire it. An async pipeline achieves concurrency by cooperatively yielding control — every await is a point where the event loop can schedule another task. A blocking call — a synchronous HTTP request, a file read using blocking open(), any third-party SDK that uses requests under the hood — occupies the thread without yielding. When such a call executes inside an async task, it doesn't just slow that task: it freezes the entire event loop.
Async event loop — correct behavior:
Task A (ASR): ──await──────────────────await──
Task B (LLM): ──await──────await──────────
Task C (TTS): ──await──────────────
↑ cooperative switches keep all three progressing
Async event loop — blocking call in Task B:
Task A (ASR): ──await──[FROZEN]──────────────────
Task B (LLM): ──[BLOCKING CALL: 800ms]──────
Task C (TTS): [FROZEN]────────────────
Your "concurrent" pipeline is actually serialized during every blocking call — ASR stops consuming audio, TTS stops generating speech.
Two patterns fix this:
1. Use async-native clients for I/O-bound operations:
## Blocking — holds the event loop
response = httpx.post(url, json=payload) # ❌
## Non-blocking — yields during I/O
async with httpx.AsyncClient() as client:
response = await client.post(url, json=payload) # ✅
2. Offload CPU-bound work to a thread pool:
async def process_audio(audio_bytes: bytes) -> str:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(
None,
sync_asr_sdk.transcribe,
audio_bytes
)
return result
⚠️ run_in_executor solves event loop blocking but does not make code truly concurrent if it holds the GIL. For genuinely CPU-bound work (e.g., running a local ASR model), a ProcessPoolExecutor or separate subprocess is the correct tool. For I/O-bound SDK calls — the common case in a cloud-API pipeline — a thread pool is sufficient.
Putting the Mistakes Together
These five mistakes are independent in cause but often appear together in a newly built pipeline:
┌─────────────────────────────────────┬──────────────────────────┬──────────────────────────┐
│ Mistake │ Primary Impact │ Fix │
├─────────────────────────────────────┼──────────────────────────┼──────────────────────────┤
│ Full LLM wait before TTS │ Time-to-first-audio │ Stream tokens + chunk │
│ Every partial transcript → LLM │ Cost + race conditions │ VAD-gated EOU signal │
│ Individual tokens → TTS │ Audio quality (prosody) │ Clause-boundary buffer │
│ No TTS cancellation on interrupt │ Turn-taking / UX │ STOP-DRAIN-RESET path │
│ Blocking calls in async pipeline │ All-stage concurrency │ Async clients / executor │
└─────────────────────────────────────┴──────────────────────────┴──────────────────────────┘
Mistakes 1, 3, and 5 all ultimately harm TTFA — but through different mechanisms: waiting too long at a stage boundary, synthesizing too-small units, and stalling the event loop respectively. Mistake 2 harms cost and correctness. Mistake 4 harms conversational naturalness. Fixing all five is not optional in a production pipeline — each represents a category of failure that surfaces under real usage.
Summary and What Comes Next
You now have a working mental model of how a real-time voice agent pipeline holds together — not as three independent systems that run in sequence, but as a coordinated data flow where each stage's partial output feeds the next before the previous one finishes.
The One Metric That Governs Everything
Time-to-first-audio (TTFA) is the elapsed time between the moment the user stops speaking and the moment the agent produces its first audible sound. It is the primary design constraint the entire pipeline is organized around. Total response duration matters, but TTFA matters more: human conversational norms tolerate a speaker who takes a while to finish a sentence far better than one who takes a long time to start one.
Streaming is the mechanism that minimizes TTFA. The pipeline does not wait for ASR to finish before starting the LLM, and does not wait for the LLM to finish before starting TTS. Each stage processes partial output from the previous stage as soon as enough of it arrives to be useful. The result: by the time the LLM has generated its first clause, TTS synthesis has already started.
❌ Wrong thinking: "I'll optimize model latency first, then worry about streaming." ✅ Correct thinking: "Streaming cuts TTFA before any model optimization runs. Without it, model speed gains are hidden behind architectural lag."
The Three Critical Handoffs
If TTFA is the goal, the three handoffs are where that goal is won or lost:
Handoff 1 — VAD-gated transcript to LLM. ASR emits partial transcripts continuously; the pipeline forwards only the final, stable transcript, triggered by an end-of-utterance signal from the VAD. Without this gate, the LLM receives dozens of redundant queries mid-utterance.
Handoff 2 — Clause-buffered token stream to TTS. The LLM produces a stream of tokens; an inter-stage buffer accumulates them until a clause or sentence boundary is detected, then emits a synthesis request. Without this buffer, TTS receives individual tokens and produces choppy, prosodically broken audio.
Handoff 3 — Interrupt signal to cancel in-flight tasks. When the user speaks while the agent is generating or speaking, a cancellation signal originating from the VAD must propagate to cancel active LLM and TTS tasks. Without this path, the agent speaks over the user — a failure that makes natural conversation impossible.
🧠 Mnemonic: VCI — VAD gate, Clause buffer, Interrupt signal. These are the three structural points where the pipeline succeeds or fails at real-time responsiveness.
The Producer-Consumer Queue Architecture
The structural pattern that makes all three handoffs reliable: each stage runs as an independent async task and communicates exclusively through shared queues. No stage calls another directly. No stage blocks waiting for another to finish.
This architecture provides three concrete advantages:
- Stage independence — each stage can be tested in isolation by pushing items onto its input queue and asserting on its output.
- Backpressure containment — bounded queues make the failure mode of a slow TTS stage explicit and manageable, rather than silently growing audio lag.
- Interrupt safety — because stages communicate only through queues and cancellation tokens, the interrupt signal has a clean implementation path with no dangling callbacks.
📋 Core Concepts Reference
| Concept | What It Is | Mechanism | Failure Without It |
|---|---|---|---|
| Time-to-first-audio | Primary latency metric | Streaming between all stages | Agent feels unresponsive regardless of model speed |
| VAD-gated handoff | ASR → LLM trigger | End-of-utterance silence detection | Redundant LLM calls on partial transcripts |
| Clause buffer | LLM → TTS trigger | Punctuation or token-count threshold | Choppy, prosodically broken audio |
| Interrupt signal | Turn-taking control | VAD event cancels in-flight tasks | Agent speaks over user input |
| Producer-consumer queues | Structural pattern | Async tasks + bounded queues | Stage coupling, blocking calls, untestable components |
What the Child Lessons Add
This lesson has described the connections between stages. What it has not done is go inside each stage and explain how to make it perform well. That is the work of the three child lessons, each of which plugs directly into the connection points described here.
ASR covers the internals of speech recognition — streaming transcript APIs and the mechanics of partial versus final transcript events. The connection point: once you understand how your ASR system signals end-of-utterance, you know exactly what to put on the transcript queue to trigger the LLM task.
LLM Streaming covers how to configure and consume a streaming inference response — managing token streams, handling tool calls mid-stream, and controlling generation parameters that affect output speed. The connection point: the LLM task's raw token stream is what your chunking function converts into synthesis requests on the TTS queue.
TTS covers synthesis APIs, voice selection, audio encoding, and streaming playback. The connection points: the clause-buffer input format (what a valid synthesis request looks like) and the interrupt signal handler (how a cancellation token stops in-progress synthesis and flushes the audio output buffer cleanly).
💡 Pro Tip: A practical way to build incrementally is to stub each stage before deepening it. Start with a TTS stage that just prints the text it receives — this lets you verify your clause-buffer logic produces well-formed phrases before connecting a real synthesis API. The queue architecture makes this substitution mechanical: the stub and the real implementation are interchangeable without touching any other stage.
Three Warnings to Carry Forward
⚠️ The pipeline is only as streaming as its slowest non-streaming call. A single blocking HTTP request inside an async task stalls the entire event loop. Every I/O-bound call must be awaited or offloaded to a thread pool. Wrap synchronous SDKs before integrating them — not after discovering the event loop stall in testing.
⚠️ TTFA budgets compound. Each stage consumes some share of your total acceptable latency. Model that budget explicitly before choosing components, and measure it from your own system — vendor benchmarks typically reflect best-case conditions.
⚠️ Interrupt handling is not optional polish. It is a correctness requirement. A pipeline that cannot be interrupted mid-response does not support natural conversation. Retrofitting cancellation into a pipeline designed without it requires restructuring the cancellation model across all three stages simultaneously — build it from the first prototype.