Shape of a Voice Agent
Lesson 1 — Form the correct mental model: the duplex loop, half-duplex vs full-duplex, and the two architectures (compose-your-own vs hosted realtime API).
Why Voice Agents Are Architecturally Different
If you've built a chatbot or wired up a language model to a web interface, you already have intuitions about how AI systems work: a user sends a message, the model processes it, you render the response. The mental model is a request-response cycle — clean, stateless, and forgiving. Latency is a quality-of-life concern. A two-second delay might feel sluggish, but the conversation still works.
Now imagine that same delay in a phone call. Someone asks you a question, and you say nothing for two seconds. The silence doesn't feel like "processing" — it feels like a dropped call, a confused pause, or a social failure. The conversational illusion collapses almost instantly. This is the core insight that separates voice agent architecture from text-based AI system design: in voice, latency is not a performance metric you optimize after the fact. It is a first-class architectural constraint that shapes every decision from the start.
Beyond latency, voice introduces two more challenges that text interfaces simply don't face: figuring out when the user has finished speaking, and streaming synthesized speech in real time rather than waiting for a complete response. Together, these three constraints — latency budget, turn detection, and streaming output — are what force a voice agent into a specific pipeline shape, rather than a simple API call.
The Latency Budget: Why 800ms Is an Architectural Threshold
Human conversation has timing norms that feel natural without us thinking about them. Speakers take turns rapidly — typically with gaps measured in hundreds of milliseconds, not seconds. Gaps beyond roughly a second begin to feel meaningful: they signal hesitation, confusion, or discomfort. Applied to a voice agent, this means the total round-trip time from when a user finishes speaking to when the agent begins responding must stay well below one second to preserve the feeling of natural dialogue.
A practical working threshold is approximately 800ms end-to-end — from the last audio sample of the user's utterance to the first audio sample of the agent's reply. This is not a hard physical law, but it is a useful design anchor: if your architecture structurally cannot hit this budget under normal operating conditions, you will need to rethink the architecture, not just tune parameters.
💡 Mental Model: Think of 800ms as a budget you must allocate across pipeline stages, not a number you tune at the end. Every component — detecting end-of-speech, transcribing audio, generating a response, synthesizing speech — draws from the same account. If any single stage is structurally over-budget, the whole system is over-budget.
A naive implementation might work as follows: record audio until silence is detected, send the entire audio file to a transcription service, wait for the full transcript, send the transcript to a language model, wait for the complete response, send the complete response to a text-to-speech service, wait for the full audio file, and finally play it. Each "wait" is a blocking gap. On fast hardware with fast APIs, this chain might take two to four seconds under favorable conditions — already outside the conversational budget. The architecture that actually fits inside the budget streams, pipelines, and overlaps. Understanding why the naive approach fails is the foundation; the specific pipeline shape it demands is covered in the next section.
Turn Detection: The Problem Text Interfaces Never Have to Solve
When a user types a message in a chat interface, there is an explicit signal for completion: they press Send. The system never has to wonder whether the user is done. Voice has no Send button.
Voice input is a continuous audio stream. The microphone captures a sequence of audio samples — a mix of speech, breath, ambient noise, and silence — and the system must decide, in real time, when the user has finished speaking. This is the turn detection problem, and it is genuinely difficult.
The most obvious approach is silence detection: if audio energy drops below a threshold for some duration, assume the user is done. This is the basis of Voice Activity Detection (VAD), which most systems use as a starting point. VAD works reasonably well for simple cases, but it has structural failure modes:
- Intra-utterance pauses. People pause mid-sentence when thinking. "Can you book me a flight to... um... Boston?" contains a pause that a naive VAD might interpret as end-of-turn.
- Environmental noise. A cough, a breath, a door closing — all produce audio energy events that interact with VAD unpredictably.
- Trailing fragments. A speaker who says "Yeah" or "Okay" after a longer utterance may leave an ambiguous silence that's hard to distinguish from a completed turn.
⚠️ Common Mistake: Assuming that VAD alone solves turn detection. VAD detects audio activity, not conversational completeness. A system that relies purely on silence thresholds will either cut off users mid-thought (threshold too short) or introduce unnecessary delay (threshold too long). Real systems layer additional signals — acoustic endpointing models trained on conversational data, semantic completeness heuristics, or a combination — on top of basic VAD.
The cost of getting turn detection wrong is asymmetric in an instructive way. Cut off too early, and the agent interrupts before the user is done. Wait too long, and the latency budget is spent waiting for a silence threshold to expire. This constraint has no analog in text-based AI systems, and it requires deliberate architectural investment rather than a configuration knob.
Text Interface: Voice Interface:
User types... Audio stream begins...
User types... [system monitoring for end-of-turn]
User presses Send ──────► [VAD detects silence]
[acoustic model confirms]
[endpointing threshold met] ──────►
System processes System processes
Streaming Output: You Cannot Wait for the Full Response
In a text chat interface, the system can afford to wait for the language model to finish generating before rendering any output. Voice has no such tolerance. You cannot wait for the full LLM response before beginning to synthesize and play speech.
Here is the arithmetic. A language model generating a moderate response — say, three to five sentences — might take one to two seconds to produce the complete text. A text-to-speech system then needs additional time to synthesize the audio. If these steps run sequentially, the combined delay easily exceeds the latency budget before a single word has been spoken.
The solution is streaming synthesis: the system begins synthesizing and playing speech as soon as the first few words or tokens arrive from the LLM, rather than waiting for the complete output.
❌ Sequential (breaks latency budget):
LLM ──[generates full response]──► TTS ──[synthesizes full audio]──► Speaker
←── 1-2 seconds ───────────► ←── 0.5-1 second ─────────►
(silent) (still silent)
✅ Streaming (fits latency budget):
LLM ──[token 1]──► TTS ──[chunk 1 audio]──► Speaker (starts playing)
──[token 2]──► TTS ──[chunk 2 audio]──► Speaker
──[token 3]──► TTS ──[chunk 3 audio]──► Speaker
──[ ... ]──► ...
First audio plays while LLM is still generating.
This imposes real constraints on component selection. Not all TTS services support streaming synthesis — some require the full input text before they begin. The natural unit for streaming TTS is often a sentence or clause boundary, not individual tokens: speaking word-by-word at token speed produces unnatural, choppy output. Real streaming pipelines buffer a few tokens until they have a complete syntactic unit, then synthesize and play that chunk while the LLM continues generating the next one.
💡 Pro Tip: When evaluating TTS services, time-to-first-audio (TTFA) — the latency to the first audio byte — is a more relevant metric than throughput or audio quality alone. A service that produces excellent audio but requires the full input text before starting will structurally break your latency budget regardless of how fast it synthesizes.
How the Three Constraints Force a Pipeline Shape
Taken individually, each constraint is a solvable engineering problem. What makes voice agent architecture genuinely distinct is that all three apply simultaneously and interact.
| Constraint | Text system assumption | Voice system requirement |
|---|---|---|
| Latency budget | Multi-second response acceptable | Sub-second end-to-end mandatory |
| Turn detection | User signals completion explicitly | System must infer completion from audio |
| Streaming output | Buffer full response, then render | Stream partial output as it generates |
The interaction is what forces the specific pipeline shape. Streaming output requires starting TTS before the LLM finishes. You can only start the LLM once you have a transcript. You can only start transcription once turn detection fires. These dependencies create a pipeline where each stage begins as soon as it has enough input, rather than waiting for the previous stage to complete. The latency budget then requires minimal buffering at each stage boundary — any stage that buffers its full output before handing off reintroduces the sequential delay you were trying to eliminate.
The result is an architecture that looks less like a series of API calls and more like a streaming graph:
Voice Agent Pipeline (simplified):
Microphone
│
▼
[Audio Capture] ──stream──► [VAD / Turn Detection]
│
(end-of-turn signal)
│
▼
[ASR / Transcription]
│
(streaming tokens)
│
▼
[LLM Inference]
│
(streaming text tokens)
│
▼
[TTS Synthesis] ──stream──► [Audio Playback]
(Each stage begins as soon as it has sufficient input —
not when the previous stage is complete.)
⚠️ Common Mistake: Framing voice agent development as "adding audio I/O to a chatbot." This framing suggests the underlying system architecture can stay the same and audio is a wrapper around it. The three constraints above show why this fails: the latency budget, turn detection requirement, and streaming output need all change how data flows through the system at a structural level.
With these three constraints established, the natural next question is: what does the data flow that satisfies all of them actually look like in practice? That's the territory of the next section, which maps the concrete movement of audio through a voice agent system and explains how the pipeline stages must interlock — and what happens when the user speaks while the agent is already talking.
The Duplex Loop: How a Voice Agent Moves Data
Every voice agent, regardless of how it is assembled, does the same fundamental thing: it moves audio in one direction while simultaneously managing audio in the other. The term duplex comes from telecommunications and describes whether a channel can carry information in both directions. In voice agents, it is not merely a transport property; it is a behavioral contract that determines how the system handles the most natural thing a human does in conversation: speaking while the other party is still talking.
Half-Duplex: Taking Turns
In a half-duplex voice agent, the microphone and speaker are never active at the same time. The system listens, then speaks, then listens again — strict alternation, like a walkie-talkie.
HALF-DUPLEX LOOP
User speaks Agent processes Agent speaks
┌──────────┐ ┌─────────────────┐ ┌──────────┐
│ 🎤 MIC │──────► │ ASR → LLM → TTS │ ───► │ 🔊 PLAY │
│ ACTIVE │ └─────────────────┘ │ ACTIVE │
└──────────┘ └──────────┘
▲ │
│ (mic is OFF during playback) │
└─────────────────────────────────────────────┘
cycle repeats
The appeal is simplicity. Because the microphone is suppressed while audio plays, the system never has to ask: "Is that the user talking, or is it hearing its own speaker output?" There is no acoustic feedback problem, no ambiguity about whose turn it is, and no need to detect whether a user wants to interrupt. The state machine is trivially small.
For a large class of applications — structured forms, appointment booking, simple Q&A over a narrow domain — this is entirely sufficient. But half-duplex has a ceiling. The moment a user tries to interrupt — to correct a misunderstanding mid-sentence, to say "wait, actually" — nothing happens. The agent keeps talking. In usability research, this pattern reliably reads as robotic or dismissive, not because the content is wrong but because the turn-taking rhythm violates what humans expect from a listening party.
💡 Mental Model: Think of half-duplex as a formal debate: each side has the floor, and interrupting is out of order. That works for debate; it fails for the kind of fluid back-and-forth that characterizes helpful conversation.
Full-Duplex: Simultaneous Streams
Full-duplex operation means the microphone and speaker are active at the same time. The system is always listening, even while it is speaking. This is how ordinary telephone calls work.
FULL-DUPLEX LOOP
┌──────────────────────────────────────────────────────┐
│ │
│ 🎤 MIC (always active) │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────┐ ┌─────┐ ┌─────────┐ │
│ │ ASR │───►│ LLM │───►│ TTS │───►│ PLAY │ │
│ └─────────┘ └─────┘ └─────┘ └─────────┘ │
│ │ ▲ │
│ │ interrupt detected? │ │
│ └────────────────────────────────────┘ │
│ if yes: discard in-flight audio │
│ cancel pending TTS, re-enter listen │
└──────────────────────────────────────────────────────┘
The immediate consequence is that the user can barge in — speak over the agent mid-sentence and have the system respond to the interruption. Done well, this feels natural. Done poorly, it produces one of the more spectacular failure modes in voice system design.
The Barge-In Problem
Barge-in handling is the clearest practical difference between half and full duplex. When a user starts speaking while the agent is playing audio, the full-duplex system must do several things nearly simultaneously:
Detect the interruption. The system is receiving microphone audio that contains both the user's voice and acoustic bleed from the speaker. A naive voice activity detector will fire on the agent's own playback. Robust interrupt detection requires either acoustic echo cancellation (AEC) — subtracting the known speaker signal from the mic input — or a secondary model that classifies whether captured audio represents user intent versus ambient noise.
Stop playback and abandon in-flight generation. Any audio currently queued for the speaker must be discarded immediately. If the LLM is still producing tokens and TTS is still synthesizing audio from those tokens, that entire chain needs to be cancelled.
Reconstruct context accurately. The conversation history must reflect what the agent actually said before being interrupted — not the full response it was in the process of generating. If the agent was mid-sentence saying "The earliest available slot is Tuesday at 3pm, and we also have—" and the user interrupted after "Tuesday," the context entry should capture only the delivered portion. Logging the full intended response would cause the LLM to believe it communicated information it never actually delivered.
⚠️ Common Mistake: Logging the full intended agent response to conversation history regardless of whether it was interrupted. Systems that skip this step will have an LLM that thinks it explained something thoroughly when it said almost nothing — leading to confused, circular follow-up responses.
The Five-Stage Loop and Why Pipelining Matters
Regardless of duplex mode, a voice agent's processing decomposes into five conceptual stages:
CAPTURE ──► TRANSCRIBE ──► REASON ──► SYNTHESIZE ──► PLAY
(mic) (ASR) (LLM) (TTS) (speaker)
The temptation — especially coming from a text-based API background — is to treat these as sequential blocking steps: capture all the audio, transcribe it all, send the complete transcript to the LLM, wait for the complete response, synthesize it all, then play. This is the correct mental model for a REST API call. It is the wrong mental model here.
🎯 Key Principle: The five stages must be pipelined and overlapping, not sequential and blocking. Each stage begins processing as soon as it has enough input to start.
BLOCKING (sequential) — DO NOT DO THIS
t=0 Capture ████████
t=1 Transcribe ████████
t=2 Reason ████████
t=3 Synthesize ████
t=4 Play ██
Total latency: sum of all stages
PIPELINED (overlapping) — TARGET THIS
t=0 Capture ████████
t=0.5 Transcribe ████████
t=1 Reason ████████████
t=1.2 Synthesize ████
t=1.5 Play ████████
Total latency: longest critical path, not sum
The difference is not marginal. Stacking the full latency of each component sequentially produces end-to-end delay that comfortably exceeds what feels responsive in conversation. Two concrete examples of pipelining in action:
ASR streaming: Rather than waiting for the user to stop speaking and batch-transcribing the complete audio, a streaming ASR system emits partial transcript tokens as speech arrives. By the time the user finishes speaking, the transcript is already substantially complete.
LLM-to-TTS handoff: Rather than waiting for the LLM to generate a complete response and then sending the full text to TTS, the system begins synthesizing speech from the first sentence or clause as soon as it arrives. The LLM is still generating while TTS is already producing audio for the beginning of the response.
In a well-designed full-duplex system, the loop never truly stops. The capture and VAD stages are always running; interrupt detection runs as its own concurrent path during playback:
FULL-DUPLEX STAGE CONCURRENCY
Stage Always-on? Triggered by
─────────────────────────────────────────────────────
Capture ✓ yes hardware clock
VAD ✓ yes capture output
ASR ✓ partial VAD activity
LLM ✗ no end-of-turn signal
TTS ✗ no LLM token stream
Play ✗ no TTS audio chunks
Interrupt det. ✓ yes capture output (during play)
Notice that interrupt detection is not a step that happens after the play stage — it is a separate always-on path that monitors capture continuously and can trigger a cancellation event into the play, TTS, and LLM stages at any moment.
Half-Duplex vs. Full-Duplex: A Concrete Comparison
| 🎙️ Half-Duplex | 🔄 Full-Duplex | |
|---|---|---|
| Mic during playback | ❌ Off | ✅ On |
| User can interrupt | ❌ No | ✅ Yes |
| Echo cancellation needed | ❌ No | ✅ Usually |
| Context management complexity | 🟢 Low | 🔴 High |
| Implementation complexity | 🟢 Lower | 🔴 Higher |
| Conversational naturalness | 🟡 Moderate | 🟢 High |
| Failure mode | Agent ignores user | Context corruption |
The failure mode column deserves emphasis. A half-duplex agent that ignores interruptions is annoying — users understand what is happening and can adapt. A full-duplex agent that handles barge-in poorly is worse in a subtler way: it produces confident-sounding responses based on a corrupted context, meaning the conversation drifts into incoherence without the user having a clear understanding of why.
This asymmetry is worth holding on to: half-duplex fails visibly and understandably; full-duplex fails invisibly and confusingly. For teams building their first voice agent, the lower complexity ceiling of half-duplex is often the right starting point — not because it produces the best experience, but because it produces a debuggable one. Full-duplex becomes worth the investment when interruption handling is a genuine user need for the specific application, not just a theoretical quality improvement.
The duplex loop is the skeleton on which everything else in voice agent design hangs. Whether you are choosing between ASR providers, deciding how to stream LLM output, or debugging why your agent sounds confused after an interruption, the question underneath is almost always: where in this loop did the data take the wrong path, wait too long, or carry the wrong state? The next section addresses how you assemble that loop in practice.
Two Ways to Build: Compose-Your-Own vs. Hosted Realtime API
Once you accept that a voice agent is fundamentally a duplex audio loop with a hard latency budget, the next question becomes structural: how do you actually assemble one? There are two dominant architectural patterns, and they differ not just in implementation detail but in what visibility, flexibility, and risk they hand to you.
The Composed Pipeline: You Own the Wiring
The compose-your-own pattern means wiring together distinct, independently operating services — or local models — into a pipeline you orchestrate yourself:
[Microphone / Audio capture]
│
▼
┌─────────────┐
│ ASR Service │ ← Converts raw audio → text transcript
└──────┬──────┘
│ transcript text
▼
┌─────────────┐
│ LLM Service│ ← Converts transcript → response text
└──────┬──────┘
│ response tokens (streaming)
▼
┌─────────────┐
│ TTS Service│ ← Converts text → audio chunks
└──────┬──────┘
│ audio chunks
▼
[Speaker / Audio playback]
Each box is a separate system. Your orchestration layer handles the plumbing: shuttling audio to ASR, receiving transcripts, passing them to the LLM, streaming tokens to TTS, and playing back audio.
The critical property of this architecture is independent substitutability. If a new ASR model cuts word error rate significantly on your target domain, you swap it in without touching TTS or the LLM. If your LLM cost spikes and you want to evaluate a different provider, you change the orchestration layer's LLM client and nothing else. A customer-support voice agent built for a domain with dense technical vocabulary — part numbers, product codes — can use a domain-tuned ASR model while pairing it with a general-purpose LLM and a TTS system chosen purely for voice quality. In a hosted realtime API, this combination might not be possible.
The cost of this flexibility is integration work and latency accumulation. Every service boundary is a round trip. Audio leaves your system to reach the ASR endpoint, the transcript leaves to reach the LLM endpoint, tokens leave to reach TTS, and audio returns for playback. Even with streaming at each stage, each hop adds network latency and processing overhead that stacks.
⚠️ Common Mistake: Assuming that because each component responds in, say, 200ms, the composed pipeline adds only 200ms total. In reality, you add the latency of each stage: ASR processing time plus network round-trip, LLM time-to-first-token, and TTS time-to-first-audio-chunk. The sum can easily exceed the ~800ms threshold.
Debugging a composed pipeline is, however, relatively tractable. Each service produces its own logs. Each API call has a latency you can measure. If TTS is slow, you can see it. If ASR is producing poor transcripts, you can log them. You also receive separate bills per service, which makes cost attribution straightforward.
The Hosted Realtime API: Single Endpoint, Opaque Interior
The hosted realtime API pattern takes the opposite stance. Instead of wiring together components, you open a single persistent connection — typically a WebSocket or a WebRTC session — send raw audio to one end, and receive audio back from the other. The vendor's infrastructure handles speech recognition, language model reasoning, and speech synthesis internally, without exposing those stages as separable units.
[Microphone / Audio capture]
│
▼
┌───────────────────────────────────┐
│ Hosted Realtime Endpoint │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ ASR │ -> │ LLM │ -> │ TTS │ │ ← Internal pipeline,
│ └─────┘ └─────┘ └─────┘ │ not visible to you
└───────────────────┬───────────────┘
│ (one WebSocket / WebRTC session)
▼
[Speaker / Audio playback]
Your application code connects to one endpoint, handles one event stream, and manages one session. There is no orchestration layer to write, no component-level retry logic to implement, and no per-service authentication to maintain. For a prototype, this is a significant acceleration.
The tradeoff is opacity. You do not know which ASR model is running, whether it's been updated between your Tuesday and Wednesday tests, or how the vendor's internal pipeline is sequencing stages. You cannot replace the ASR with a domain-tuned alternative or swap the TTS voice for one from a different provider. If the LLM component inside the hosted API starts producing different outputs after an undocumented update, you have no independent transcript to inspect to distinguish an ASR regression from an LLM behavioral change.
🎯 Key Principle: A hosted realtime API gives you a thin, clean interface at the cost of control over the interior. Everything that flows through it — model versions, internal routing, pipeline tuning — is the vendor's decision, not yours.
Latency: Fusion vs. Queuing
The latency story for hosted realtime APIs runs in both directions. On one hand, a vendor can fuse pipeline stages server-side in ways that are structurally impossible when stages are separate services — beginning TTS synthesis from partial LLM outputs without any inter-service network hop. On the other hand, a hosted realtime API adds vendor-side queuing that you cannot observe. When the vendor's infrastructure is under load, your request waits in a queue you have no window into.
Composed pipeline latency profile:
──────────────────────────────────────────────────────
ASR network + processing: ~150–400ms (measurable)
LLM time-to-first-token: ~200–600ms (measurable)
TTS time-to-first-audio: ~100–300ms (measurable)
Hosted realtime API latency profile:
──────────────────────────────────────────────────────
Vendor internal pipeline: varies (NOT measurable by you)
Possible server-side fusion: saves 1–2 round-trips
Possible vendor queue delay: adds unpredictable overhead
→ Potentially lower median; less predictable tail latency
(Actual latency depends heavily on network topology, model size, and server load. The diagram shows the structural difference in observability, not precise timing.)
Cost and Debugging: Two Very Different Surfaces
With a composed pipeline, your cost is itemized by component. You can calculate the approximate cost per conversation minute by summing ASR, LLM, and TTS rates. If costs increase, you can attribute the change to a specific component and substitute just that one.
With a hosted realtime API, billing is typically consolidated — often per audio-minute of session time. The consolidated nature makes total cost simple to reason about, but harder to optimize.
Debugging follows the same split. In a composed pipeline, when something goes wrong, you have artifacts at every stage: the raw audio, the transcript returned by ASR, the full message history sent to the LLM, and the audio produced by TTS. A user complaint that "the agent misunderstood me" is quickly triaged by checking the ASR transcript. With a hosted realtime API, your debugging surface is the event stream the API emits — typically JSON messages indicating session state changes, transcript snippets if provided, and error codes. The internal pipeline is not inspectable.
💡 Mental Model: Think of the difference as transparent plumbing versus a sealed appliance. Composed pipelines are transparent — you can see every pipe and replace any section. A hosted realtime API is a sealed appliance — it does the job reliably under normal conditions, but when something unusual happens, you're limited to what the manufacturer exposes on the outside.
Choosing Between Them
| 🔧 Compose-Your-Own | ⚡ Hosted Realtime API | |
|---|---|---|
| Model selection | Choose independently per stage | Vendor-determined |
| Debugging surface | Per-component logs and artifacts | Single opaque event stream |
| Cost visibility | Itemized per service | Consolidated per session |
| Latency character | Accumulated round-trips; measurable | Potentially lower median; opaque tail |
| Time to first prototype | Higher (integration work) | Lower (single connection) |
| Component substitution | Any component independently | Not available |
| Observability | High | Low |
If you are building a prototype to validate whether a voice agent interaction model works at all, a hosted realtime API is often the faster path. If you are building a production system with requirements around domain-specific ASR accuracy, particular voice personas, cost predictability at scale, or on-premises deployment for data residency — a composed pipeline is the more appropriate foundation.
⚠️ Common Mistake: Assuming that because you're using a hosted realtime API, your application has no architecture decisions to make. You still own conversation state management, error recovery on dropped connections, and interrupt handling logic. This misconception is common enough that it gets its own treatment in the next section.
🤔 Did you know? These two patterns aren't mutually exclusive in a larger system. Some production deployments use a hosted realtime API for the primary happy-path conversation while falling back to a composed pipeline for specific interaction types — such as when a domain-tuned ASR model is needed for a particular menu option.
Where Mental Models Break: Three Common Misconceptions
The three misconceptions covered here share a common origin: they are all reasonable intuitions imported from text-based AI systems that happen to be wrong in a real-time voice context. Each one produces a system that works in a demo, degrades in testing, and fails in production — not because of bugs, but because the underlying mental model was off from the start.
Misconception 1: The Pipeline Is Sequential Steps
The most pervasive mistake is treating the voice pipeline as a waterfall: wait for the user to finish speaking, transcribe the full audio, pass the complete transcript to the LLM, wait for the full LLM response, then hand the complete text to TTS. This is the natural shape for a text chatbot, and it is catastrophically wrong for voice.
As established earlier, voice breaks conversational illusion above roughly 800ms end-to-end. A sequential pipeline stacks delays multiplicatively:
Sequential (waterfall) pipeline:
[User speaks 3s] → [ASR buffers full audio: +300ms]
→ [LLM receives full transcript, generates response: +600ms]
→ [TTS receives full text, synthesizes audio: +400ms]
→ [Audio begins playing]
Perceived latency after speech ends: 1300ms ← already over budget
Each component is a blocking stage: it waits for complete input before producing any output. The correct mental model replaces this with a streaming graph, where each stage begins emitting partial outputs as soon as it has enough input.
The specific smell of this mistake in code:
## ❌ Sequential / buffering anti-pattern
full_transcript = transcriber.transcribe_complete(audio_buffer) # blocks until done
full_response = llm.complete(full_transcript) # blocks until done
audio_out = tts.synthesize(full_response) # blocks until done
play(audio_out)
versus the correct shape:
## ✅ Streaming / pipelined pattern (conceptual pseudocode)
for transcript_chunk in transcriber.stream(audio_stream):
for response_token in llm.stream(transcript_chunk):
for audio_chunk in tts.stream(response_token):
audio_out.write(audio_chunk) # plays while still generating
(In practice you'd also manage backpressure, buffer sizes, and sentence-boundary detection before handing chunks to TTS — the later implementation lessons will make this concrete.)
⚠️ Common Mistake: Buffering the LLM output to "clean it up" before passing to TTS. Even a small cleaning pass that waits for a complete response adds hundreds of milliseconds. Prefer streaming TTS that handles natural text with light post-processing on the audio side rather than text buffering.
Misconception 2: Silence Solves Turn Detection
The second misconception is subtler and harder to debug because it works most of the time. The assumption: when the user goes quiet, they are done speaking. Therefore, detect silence, end the turn, and pass the transcript to the LLM.
Voice Activity Detection (VAD) detects whether audio contains human speech based on audio energy levels — amplitude and spectral characteristics. This is fast, runs in real time, and works well in controlled conditions. The problem is that VAD is not a semantic model. It does not know what a sentence is. It knows that audio energy dropped below a threshold.
Things that fire VAD incorrectly:
- A cough or throat-clear — high-energy non-speech event; VAD fires "speech end" after it, potentially cutting off the user mid-thought
- An inhalation before a new clause — brief silence in the middle of a sentence
- Background noise that cuts out — its sudden absence looks like "speech stopped"
- Filler words with trailing silence — "um..." followed by a pause while the user thinks
- DTMF tones, notification sounds, or a second person in the background
VAD behavior (energy-based):
Audio energy over time:
████▄▄████████▄▄▄▄████████████▄▄▄▄▄▄▄▄
↑↑ ↑↑↑↑ ↑↑↑↑↑↑↑↑
breath pause silence →
fires fires fires turn end
"speech "speech end" "speech end"
end"
What actually happened:
"So the answer is... um... I think it depends on—"
^
Agent interrupts here
Real production systems layer additional approaches on top of energy VAD:
Minimum silence duration tuning adds a configurable hold-off: don't end the turn until silence has persisted for, say, 700ms rather than 200ms. This reduces false triggers from breath pauses at the cost of adding baseline latency to every real turn end.
Model-based endpointing uses a small classifier trained to predict whether a spoken segment is semantically complete — did the user just finish a thought, or are they mid-sentence with a pause? These models consider prosodic features (pitch, rhythm) alongside energy.
Transcript-based confirmation inspects the partial transcript once VAD fires a candidate end-of-turn — does it end with a question mark? Is the grammar closed? This is inherently approximate and language-specific, but in practice it catches many obvious cases.
The key reframe is this: turn detection is a prediction problem, not a detection problem. You are predicting whether the user has finished their communicative intent, not merely detecting the absence of sound. VAD is a necessary signal, but it is not a sufficient answer.
⚠️ Common Mistake: Setting VAD silence thresholds too low (under ~300ms) to feel "responsive" creates a system that constantly interrupts users mid-thought. Setting them too high (over ~1200ms) adds noticeable dead air at every turn transition. The right threshold depends on your deployment context and must be tuned empirically.
Misconception 3: Hosted API Means No Architecture Decisions
The third misconception is the most dangerous because it is partially true. The reasoning goes: if a hosted realtime API handles ASR, LLM, and TTS internally and exposes a single audio-in / audio-out WebSocket, then the hard architectural decisions have been made for you. You just connect and ship.
A hosted API does abstract away the internal pipeline. What it does not abstract away is the conversation-layer logic that sits around the pipeline — and that logic is where the hard problems live in production. Three specific responsibilities remain yours regardless of which hosted API you use.
Conversation State Management
A hosted API processes audio and returns audio. It does not inherently know what persona your agent has, what task it is in the middle of, what the user said three turns ago, or what tools it has access to. Conversation state — the accumulated context that gives each new utterance meaning — is something you must explicitly maintain and inject.
Your responsibility (hosted API):
┌─────────────────────────────────────────────┐
│ Your application layer │
│ │
│ ┌──────────┐ ┌──────────────────────┐ │
│ │ Session │ │ Conversation state │ │
│ │ manager │ │ (history, context, │ │
│ └────┬─────┘ │ task progress) │ │
│ │ └──────────┬───────────┘ │
└───────┼─────────────────────┼──────────────┘
│ │
▼ ▼
┌───────────────────────────────────────────┐
│ Hosted Realtime API │
│ (ASR + LLM + TTS, all internal) │
└───────────────────────────────────────────┘
Concretely, this means deciding when to summarize or prune conversation history to stay within context limits, how to carry task-specific state, and what happens to state when a session times out and resumes.
Error Recovery on Dropped Connections
WebSocket connections drop. Networks are unreliable, especially on mobile clients. When the connection breaks mid-conversation, you need a strategy: does the agent resume from where it left off? Does it greet the user again from scratch? A common concrete failure mode: a connection drops while the agent is mid-sentence. When the client reconnects, the conversation history may show an incomplete agent utterance, confusing the LLM's understanding of where the conversation stands. You need to detect this case, decide whether to replay, summarize, or skip, and update the history accordingly.
⚠️ Common Mistake: Treating a WebSocket disconnect as a fatal error rather than a recoverable event. In mobile environments especially, brief disconnects are routine. A production system must implement exponential-backoff reconnection, idempotent session resumption, and explicit handling for the "what did I miss?" state update on reconnect.
Interrupt Handling Logic
As covered in the duplex loop section, a hosted API may detect that the user has started speaking and stop its own audio output — but it cannot decide what that interruption means for your conversation state.
Consider: the agent is three sentences into a five-sentence response. The user interrupts with "wait, go back to the part about pricing." The hosted API stops speaking. The conversation history still contains the full five-sentence response as if it were delivered. The next user utterance will only make sense if the agent knows where it was interrupted. Interrupt handling requires you to detect when a barge-in occurred, record the approximate point of interruption, update conversation history to reflect partial delivery, and decide whether the agent should acknowledge the interruption or simply respond to the new input.
| 🔒 Hosted API owns | 🔧 You own | |
|---|---|---|
| Signal processing | Audio encoding, VAD, ASR | — |
| Language | LLM inference, TTS synthesis | System prompt, context injection |
| Transport | WebSocket/WebRTC connection | Reconnection logic, session IDs |
| Conversation | Turn-by-turn audio exchange | History management, state pruning |
| Interrupts | Stopping audio output on barge-in | State updates, history correction |
💡 Mental Model: A hosted API is like a skilled translator — it handles the mechanics of converting meaning between audio and language fluently. But it does not know what the conversation is about, what has been agreed, or what to do when the phone gets dropped. Those are your responsibilities as the conversation designer and application developer.
Connecting the Three
These three misconceptions are not independent. They form a cluster rooted in the same underlying error: importing the request-response mental model from text AI into a domain that fundamentally requires a stream-and-state mental model.
In a text chatbot, requests are discrete, turns are unambiguous (the user pressed Send), and conversation state lives neatly in a message array. In a voice agent, data is continuous, turn boundaries are inferred, and state is dynamic and easily corrupted by partial deliveries, dropped connections, or mid-sentence interruptions.
- Buffering is what you do when you think in requests, not streams.
- Trusting silence is what you do when you think in discrete messages, not continuous audio.
- Assuming the hosted API handles architecture is what you do when you forget that conversation state is a cross-cutting concern that no single component owns.
The mental model from the earlier sections — a duplex audio loop with a latency budget, composed of overlapping streaming stages — is the corrective for all three. Keep that model active and these mistakes become detectable before you write the code that embeds them.
Key Takeaways and What Comes Next
You arrived at this lesson with intuitions shaped by text-based AI systems, REST APIs, and request-response thinking. That intuition is useful in many places, but it actively misleads in voice agent design. Here is the consolidated mental model and a map of where each piece becomes concrete in the lessons ahead.
The Mental Model in One View
| Concept | What it means | Why it matters later |
|---|---|---|
| Duplex audio loop | The fundamental shape of every voice agent | Explains why all pipeline decisions exist |
| Latency budget | ~800ms perceptual threshold for response start | Forces streaming at every stage |
| Turn detection | System must infer when user has finished speaking | VAD alone is insufficient; needs heuristics or models |
| Half-duplex | Agent and user take strict turns | Simpler state; right for many production use cases |
| Full-duplex | Simultaneous audio; enables barge-in | Required for natural conversation; adds state complexity |
| Compose-your-own | Wire ASR + LLM + TTS yourself | Control, observability, component-swap freedom |
| Hosted realtime API | Single endpoint, audio in/audio out | Fast prototype; you still own app-level state |
| Sequential batching | Buffering full outputs before passing them forward | Stacks latency; must be replaced with streaming |
The single most durable shift is this: a voice agent is not a chat interface with audio bolted on. It is a duplex audio loop operating under a hard latency budget, where every architectural decision flows from that constraint. The 800ms perceptual threshold isn't a performance target — it is an architectural constraint that shapes every component choice downstream.
Half-duplex and full-duplex are not just implementation details — they represent fundamentally different contracts with the user. Many production voice agents are half-duplex, and for structured data collection, scheduled reminders, and simple Q&A, that is entirely sufficient. The robotic quality of half-duplex only becomes a real liability when users will predictably want to interrupt and the conversation is expected to feel fluid. Choosing between them is not a question of which is better; it is a question of what your use case requires and what complexity your team can maintain.
The compose-your-own vs. hosted API choice is a trade-off between control and observability on one side, and speed-to-prototype on the other. A hosted realtime API does not eliminate architecture decisions — you still own conversation state, error recovery on dropped connections, and interrupt handling logic. The API handles the audio pipeline; it does not handle your application.
The three misconceptions — sequential batching, trusting silence for turn detection, and assuming the hosted API handles everything — are all expressions of the same error: importing request-response thinking into a stream-and-state domain. The reflex to build against is asking, at each design decision: Am I treating this like a text API? Am I stacking latency unnecessarily? Am I assuming the API handles something that I actually own?
The Road Ahead
This lesson has been deliberately abstract — its job is to give you the conceptual skeleton before any code appears.
Implementing the loop is the immediate next step: building the composed pipeline end-to-end from microphone capture through ASR, LLM, and TTS to audio output. The latency budget discussed here becomes measurable — you will instrument the pipeline and observe where milliseconds accumulate. This implementation starts in half-duplex mode, which isolates the data-flow problem from the state-management problem.
Latency and cost trade-offs come once the loop is running. Not all ASR models have the same latency profile. Not all TTS engines produce audio at the same rate. LLM token generation speed varies substantially across deployment configurations. You will make concrete substitutions in the pipeline and measure the results — the mental model you built here is precisely what makes those measurements interpretable.
Full-duplex and state management come later, after the composed pipeline is solid. By that point you will have enough hands-on familiarity with the pipeline to understand why interruption creates a state problem, not just that it does.
The progression is intentional: half-duplex before full-duplex, composed pipeline before hosted API. Each layer adds complexity that only makes sense once the simpler version is understood. Every lesson that follows will assume you are reasoning from the duplex loop model, and the implementation choices will make far more sense because of it.