You are viewing a preview of this lesson. Sign in to start learning
Back to Building Real-Time Voice Agents from Scratch

VAD: Detecting Speech

Lesson 3 — Use Silero VAD to detect speech boundaries: the start/end event model, threshold and silence-duration knobs, and the pre-roll buffer.

Last generated

Why VAD Exists: The Speech Detection Problem

Imagine you've just wired up a microphone to an automatic speech recognition system. You hit record, say "Schedule a meeting for Tuesday," and wait for the transcript. What you probably don't think about — until it bites you — is that the microphone was already running for several seconds before you spoke, and it kept running for several more after you finished. In that time it captured your HVAC system cycling on, your chair creaking, and a solid 800 milliseconds of silence. Your ASR system dutifully processed all of it. The transcript came back as a hallucinated fragment, the latency was higher than expected, and the per-second inference bill quietly ticked upward.

This is the problem Voice Activity Detection (VAD) exists to solve. VAD is the gating mechanism at the entrance of any real-time voice pipeline — the component whose sole job is to answer, for every moment in a continuous audio stream, a single binary question: is a human speaking right now? Getting that answer right, fast, and reliably is harder than it looks, and the consequences of getting it wrong cascade through everything downstream.

The Fundamental Mismatch: Continuous Audio, Discrete Events

Audio captured from a microphone is a continuous stream of samples. At 16 kHz, you receive 16,000 amplitude measurements per second, every second, without pause — whether anyone is talking or not. This stream has no natural segment boundaries. It does not label itself.

Downstream components in a voice pipeline — ASR engines, language models, response generators — are designed to operate on utterances: bounded chunks of audio that represent a complete thought or command. Feeding them a raw, unbounded audio stream creates an immediate mismatch:

Raw audio stream (continuous, unlabeled):
────────────────────────────────────────────────────────────────────
 [silence] [noise] [speech: "book a flight"] [silence] [cough] [silence] [speech: "to Paris"]
────────────────────────────────────────────────────────────────────
          ↓  (no VAD: everything passes through)
 ASR receives: silence + noise + speech + silence + cough + silence + speech
 Result: degraded accuracy, wasted compute, unpredictable latency

Without VAD acting as a gatekeeper, ASR must process everything. Silence and background noise aren't just neutral — they actively confuse statistical language models that expect speech-like input distributions. Many ASR engines will produce hallucinated words from sustained noise, particularly when the noise contains speech-like spectral content (fans, TVs, nearby conversations).

VAD solves this by converting the continuous stream into a discrete sequence of events:

          ↓  (VAD applied)
 Events emitted:
  t=1.2s → SPEECH_START
  t=1.9s → SPEECH_END   (utterance: "book a flight")
  t=4.1s → SPEECH_START
  t=4.6s → SPEECH_END   (utterance: "to Paris")

ASR only receives labeled speech segments. Silence, noise, cough — discarded.

This start/end event model is the central interface contract that VAD establishes. Upstream components supply raw PCM audio; VAD consumes that audio chunk by chunk and emits events that downstream components can act on cleanly.

💡 Mental Model: Think of VAD as the receptionist at the entrance to a recording studio. Without a receptionist, anyone and anything walks in — delivery people, random noise, the occasional actual musician. The receptionist's job isn't to evaluate the music; it's to ensure that only musicians get through the door. VAD does the same for speech: it doesn't transcribe or understand — it just decides who gets in.

Why Simple Amplitude Thresholding Fails

The obvious first attempt at VAD is also the most intuitive one: measure the energy (amplitude) of the audio signal and call it speech when the energy is high enough. This approach — often called energy-based detection — is trivial to implement and works reasonably well in a quiet room with a close-talking microphone. The problem is that none of those conditions are guaranteed in production.

Three failure modes are particularly damaging:

1. Background noise floors. Office HVAC systems, street noise, mechanical keyboard typing — many noise sources produce consistent energy levels that can meet or exceed a fixed amplitude threshold. A threshold calibrated to catch soft speech will fire continuously on a noisy floor. Raise the threshold to suppress noise, and soft speech gets missed. There is no single threshold value that works across both conditions without a reliable real-time noise estimate.

2. Soft speech and vocal variation. People don't speak at constant volume. Sentence-final words often drop in energy. Speakers on hands-free calls at distance may produce energy levels an order of magnitude lower than a headset user. Energy-based detection systematically clips the ends of utterances and misses quieter speakers.

3. Microphone gain variation. A threshold calibrated for one microphone will be wrong for another. Automatic Gain Control (AGC), present in many consumer audio stacks, dynamically adjusts gain in ways that shift the energy baseline continuously — making a fixed threshold a moving target.

Energy threshold (fixed): ─────────────────── 0.02 ───────────────────

Scenario A (headset, quiet room):
  Speech waveform:    ████████████████████  (energy: 0.05)  → DETECTED ✓
  HVAC noise:         ░░░░░░░░░░░░░░░░░░░░  (energy: 0.005) → SILENT   ✓

Scenario B (laptop mic, office):
  Speech waveform:    ██████████            (energy: 0.018) → MISSED   ✗
  HVAC + keyboard:    ░░░░░░░░░░░░░░░░░░    (energy: 0.025) → DETECTED ✗

VAD models address this by learning speech-specific features rather than relying on raw amplitude. A trained neural network learns that human speech has characteristic spectral structure — formants, voiced/unvoiced patterns, temporal modulation — that distinguishes it from stationary noise, impulsive noise, and music, even when their energy levels overlap. A team deploying a voice assistant in a call center environment discovered that their energy-based VAD was triggering almost continuously during hold music playback, which sits in the same energy band as soft speech. Switching to a neural VAD model resolved the spurious triggers immediately — the model had learned that hold music's steady-state spectral envelope doesn't match voiced speech patterns, despite similar RMS values.

VAD's Position in the Pipeline

VAD sits at the entry point of the voice pipeline — the first processing stage that every audio sample passes through. This position has an important architectural consequence: VAD's output latency and error rate set a hard ceiling on the performance of every component downstream.

Audio Source
     │
     ▼
┌─────────┐
│   VAD   │  ← Entry point. Errors here propagate everywhere.
└────┬────┘
     │  Speech events (start / end)
     ▼
┌─────────┐
│   ASR   │  ← Can only transcribe what VAD passes through.
└────┬────┘
     │  Transcript
     ▼
┌─────────┐
│   LLM   │
└────┬────┘
     │  Response
     ▼
  TTS / Output

If VAD adds 200 ms of latency before emitting a speech-start event, those 200 ms are permanently added to the end-to-end response time — no amount of optimization downstream can recover them. This makes VAD a first-class engineering concern: the choice of model, the configuration of its parameters, and the correctness of its integration determine the responsiveness ceiling for the entire system.

The Two Failure Modes That Matter

VAD errors come in two directions with meaningfully different consequences.

False negatives (missed speech start) occur when a speech-start event fires later than the actual onset of speech. The concrete consequence is clipped words: the first phonemes or syllables of an utterance are discarded before VAD opens the gate.

Actual speech:   |"Hey, can you schedule..."
VAD detection:        ↑ SPEECH_START fires here (200ms late)
ASR receives:         "can you schedule..."
User experience: First word "Hey" is lost.

For wake-word-based systems or any pipeline where the first word carries semantic weight, this failure mode is particularly damaging. One mechanism for partially mitigating it is the pre-roll buffer — a technique for retaining a short window of audio captured just before the speech-start event fires. This is covered in its dedicated child lesson.

False positives (noise classified as speech) occur when VAD fires a speech-start event in response to something that isn't speech — a door slam, background TV, a burst of keyboard noise. The consequence is a spurious ASR invocation: the pipeline wakes up, buffers audio, and sends a segment of non-speech to the transcription engine.

The costs compound quickly: every spurious invocation burns CPU/GPU cycles, the language model may receive spurious input and generate an unwanted response, and a voice agent that occasionally responds to ambient noise erodes user trust rapidly even if the absolute error rate is low.

📋 VAD Failure Modes

False Negative False Positive
What happens Speech onset missed or delayed Noise triggers speech-start
Immediate effect Leading audio discarded Spurious ASR invocation
User impact Clipped words, missed commands Agent responds to ambient noise
Tuning direction Lower detection threshold Raise detection threshold

With this framing in place, the requirements for a production-grade VAD become concrete: distinguish speech from silence and noise across varied acoustic environments, emit speech-start events with minimal latency, remain configurable to trade off false negative rate against false positive rate, and operate in real time on fixed-size chunks without accumulated latency. Energy-based detection fails the first requirement. A well-designed neural model — such as Silero VAD — is engineered to meet all four.

How Silero VAD Works: Architecture and the Start/End Event Model

With the problem clearly framed, the next question is how a neural model solves it. Silero VAD answers the speech-presence question by running a compact neural network over short audio chunks and converting the resulting probability stream into the two discrete events — speech-start and speech-end — that the rest of your pipeline acts on. Understanding exactly how that conversion works, and why the model is stateful, is the foundation for integrating it correctly.

The Neural Network: Small by Design

Silero VAD is an open-source voice activity detection model built around a recurrent neural architecture. Its defining characteristic is that it is intentionally small — designed to run in real time on commodity hardware, including devices without a GPU, without introducing meaningful latency. Compactness is a first-class design goal: a VAD that can't keep up with the audio stream is useless regardless of its accuracy.

The model accepts audio as raw PCM samples at a fixed sample rate of 16 kHz. Audio arriving at any other sample rate will be processed without error but will produce unreliable probability scores — the model has no way to detect the mismatch. (Handling the resampling step correctly is covered in the next section.)

The input unit is a chunk: a fixed-length window of audio samples. At 16 kHz, the standard chunk size is 512 samples, corresponding to approximately 32 milliseconds of audio. For each chunk the model produces exactly one output: a probability score between 0.0 and 1.0 representing the likelihood that speech is present.

Audio stream (16 kHz PCM)
│
├─── chunk 0 (512 samples / ~32 ms) ──► model ──► p = 0.03  (silence)
├─── chunk 1 (512 samples / ~32 ms) ──► model ──► p = 0.11  (silence)
├─── chunk 2 (512 samples / ~32 ms) ──► model ──► p = 0.71  (speech rising)
├─── chunk 3 (512 samples / ~32 ms) ──► model ──► p = 0.94  (speech)
├─── chunk 4 (512 samples / ~32 ms) ──► model ──► p = 0.97  (speech)
├─── chunk 5 (512 samples / ~32 ms) ──► model ──► p = 0.88  (speech)
├─── chunk 6 (512 samples / ~32 ms) ──► model ──► p = 0.21  (trailing off)
└─── chunk 7 (512 samples / ~32 ms) ──► model ──► p = 0.04  (silence)

The probability score alone is not an event — it is a raw signal. Converting it into the start/end events your pipeline needs requires a state machine layered on top.

Why the Model Is Stateful

Here is a subtlety that trips up many first integrations: Silero VAD is not stateless. Between calls, the model retains a recurrent hidden state — internal activations that carry information about what the model has heard so far. This is what allows it to distinguish, for example, a brief pause mid-sentence from the actual end of speech. The recurrent state gives the model a form of short-term memory that a purely feedforward architecture would lack.

The practical consequence is strict: chunks must be fed to the model in order, without gaps, and with no chunks duplicated. If you skip a chunk, the model's hidden state drifts out of sync with reality. If you feed chunks out of order, the probability scores will be unreliable in ways that are difficult to debug — the scores won't look obviously wrong, they'll just be subtly miscalibrated.

🎯 Key Principle: Silero VAD's recurrent state is a resource that must be managed with the same care as an open file handle or a database connection. It is valid only for one continuous audio session. When that session ends, the state must be explicitly reset before a new session begins.

From Probability Stream to Events: The State Machine

The state machine has two primary states: SILENCE and SPEECH. Transitions between them are governed by two parameters:

  • start_threshold — a probability value (typically around 0.5). When the model's per-chunk probability exceeds this value for a configurable number of consecutive chunks, the state machine transitions from SILENCE to SPEECH and fires a speech-start event.
  • end_threshold — a probability value (typically lower than start_threshold, around 0.35). When the probability drops below this value and stays below it for a configurable minimum silence duration, the state machine transitions from SPEECH to SILENCE and fires a speech-end event.

The asymmetry between start and end thresholds is intentional and important. Using a higher threshold to enter the SPEECH state makes the detector resistant to brief noise spikes. Using a lower threshold to exit it means the detector won't prematurely cut off a sentence just because a speaker paused briefly. This is the classic hysteresis pattern from control systems: make it harder to flip a state than to stay in it.

Probability stream over time:

  1.0 ┤                  ╭──────────╮
  0.9 ┤               ╭──╯          ╰──╮
  0.7 ┤             ╭─╯                ╰─╮
  0.5 ┤─ ─ ─ ─ ─ ─ ╯  start_threshold   ╰─ ─ ─ ─
  0.4 ┤                                          ╰──╮
  0.35┤─ ─ ─ ─ ─ ─ ─ ─ ─ ─ end_threshold─ ─ ─ ─ ─ ─╰─
  0.1 ┤                                               \_____
  0.0 ┤
       ──────────────────────────────────────────────────►  time

             ▲                               ▲
             │                               │
        SPEECH-START                    SPEECH-END
         event fires                    event fires
         (threshold                  (below end_threshold
         crossed and                  for min_silence_ms)
         held briefly)

The minimum silence duration parameter adds a time gate to the end transition. Even if the probability drops below end_threshold on a single chunk, the state machine won't fire a speech-end event until it has remained below the threshold for the configured duration — for example, 300 ms. This prevents a natural mid-sentence pause from being misread as the end of an utterance.

💡 Real-World Example: A speaker says: "I'd like to book a flight — [0.4s pause] — to Tokyo." Without a minimum silence duration, the 400 ms pause might fire a speech-end event, causing the pipeline to dispatch "I'd like to book a flight" to the ASR engine and start a new utterance with "to Tokyo." With a minimum silence duration of 500 ms, the pause is absorbed and the full sentence is treated as one utterance.

The Two-Event Interface Contract

The start/end event model is not just an implementation detail — it is an interface contract between Silero VAD and every component downstream.

                     ┌─────────────────────────────────┐
   Raw PCM stream    │                                 │   speech-start event
  ─────────────────► │        Silero VAD               │ ──────────────────────►
   (16 kHz, fixed    │   (neural net + state machine)  │   speech-end event
    chunk size)      │                                 │ ──────────────────────►
                     └─────────────────────────────────┘

  UPSTREAM CONTRACT:                    DOWNSTREAM CONTRACT:
  • Fixed-size, ordered PCM chunks      • React to speech-start: begin buffering
  • 16 kHz sample rate                  • React to speech-end: dispatch buffer
  • Explicit state reset between        • Do NOT treat probability score as ASR
    sessions                              confidence

Downstream consumers react to two event types:

  • 🟢 speech_start — speech has been detected; begin accumulating audio.
  • 🔴 speech_end — the utterance has concluded; the accumulated audio is ready for transcription.

Notice what is not in this contract: Silero VAD does not tell you what was said, how many words the utterance contained, or whether the audio is intelligible. Its sole claim is about presence — is a human speaking, or not? Treating the probability score as a proxy for audio quality or ASR confidence is a category error that leads to subtle bugs downstream.

The Tuning Knobs: A Preview

Three parameters directly control the state machine's behavior:

Parameter Controls Typical starting point
start_threshold Confidence required before declaring speech started ~0.5
end_threshold How far probability must fall before silence is recognized ~0.35
min_silence_duration_ms How long silence must persist before speech-end fires ~300–500 ms

Lowering start_threshold makes the detector more sensitive but more prone to false starts. Raising min_silence_duration_ms makes end detection more robust but increases latency between when the speaker finishes and when the ASR call is made. There is also a fourth consideration these parameters don't directly address: audio captured before the speech-start event fires. By the time the state machine crosses the start threshold, the speaker has already been speaking for at least one or two chunks. That leading audio — the first consonant of the first word — is at risk of being lost. This is the problem the pre-roll buffer solves, covered in its dedicated child lesson.

To make the state machine concrete, here is a short worked trace:

Settings: start_threshold = 0.5, end_threshold = 0.35, min_silence_duration_ms = 300, chunk size = 32 ms, consecutive chunks needed to start = 2.

Chunk  Probability  State     Action
──────────────────────────────────────────────────────────────
  0      0.04       SILENCE   nothing
  1      0.08       SILENCE   nothing
  2      0.61       SILENCE   p > start_threshold, count = 1
  3      0.89       SILENCE   p > start_threshold, count = 2
                              → TRANSITION TO SPEECH
                              → FIRE speech_start event  🟢
  4      0.95       SPEECH    accumulate audio
  5      0.91       SPEECH    accumulate audio
  6      0.88       SPEECH    accumulate audio
  7      0.29       SPEECH    p < end_threshold, silence timer starts
  8      0.22       SPEECH    silence timer = 32 ms (< 300 ms)
  9      0.18       SPEECH    silence timer = 64 ms (< 300 ms)
  10     0.71       SPEECH    p > end_threshold, silence timer RESET
  11     0.88       SPEECH    speaker resumed mid-pause
  12     0.91       SPEECH    accumulate audio
  13     0.17       SPEECH    p < end_threshold, silence timer starts
  14     0.12       SPEECH    silence timer = 32 ms
  ...
  23     0.03       SPEECH    silence timer = 320 ms (≥ 300 ms)
                              → TRANSITION TO SILENCE
                              → FIRE speech_end event    🔴

Chunk 10 shows the silence timer being reset because the speaker paused and then resumed — exactly the behavior min_silence_duration_ms is designed to produce. Chunks 2–3 show the consecutive-chunk confirmation requirement: a single high-probability chunk isn't enough; the model must see sustained evidence before committing to a start event.

Integrating Silero VAD into a Streaming Audio Pipeline

Knowing what Silero VAD does conceptually is one thing; wiring it into a live audio loop without subtle bugs is another. The integration looks deceptively simple — read a chunk, call a model, check a number — but three categories of mistakes are nearly invisible at the code level and catastrophic at runtime: feeding audio at the wrong sample rate, managing stateful resets incorrectly, and misunderstanding the contract between VAD events and downstream buffers.

Step One: Ensuring the Correct Sample Rate

Silero VAD's neural network was trained on audio sampled at 16 kHz. The model's internal temporal assumptions, its learned feature detectors, and the chunk-size-to-duration relationship all depend on this specific rate. Feeding audio at a different rate is the single most common integration error, and it is pernicious because the model does not raise an exception when the sample rate is wrong.

If your audio source delivers 44.1 kHz (a common default for browser microphones) and you pass those samples directly to Silero VAD, the model happily returns a probability score — it just isn't a valid one. A 30 ms audio chunk at 44.1 kHz contains 1,323 samples; the same duration at 16 kHz contains only 480. When the model receives 1,323 samples expecting 480, it interprets the input as a longer time window than it actually covers, corrupting both the probability output and the hidden state. The failure mode is silent: you'll see erratic probability scores and wonder if your thresholds are wrong, when the real problem is upstream.

Resampling must happen before chunking. In Python, torchaudio.functional.resample or librosa.resample are standard tools:

import torch
import torchaudio.functional as F

SOURCE_RATE = 44100   # whatever your audio device delivers
TARGET_RATE = 16000   # what Silero VAD requires
CHUNK_MS    = 30
CHUNK_SIZE  = int(TARGET_RATE * CHUNK_MS / 1000)  # = 480 samples

def resample_chunk(raw_chunk: torch.Tensor) -> torch.Tensor:
    return F.resample(raw_chunk.unsqueeze(0), SOURCE_RATE, TARGET_RATE).squeeze(0)

💡 Pro Tip: Make the target sample rate a named constant — VAD_SAMPLE_RATE = 16000 — and reference it everywhere chunk sizes are computed. This prevents the subtle bug where one part of the code hard-codes 480 samples and another computes it from a variable that was changed without updating the constant.

The Core Integration Loop

Once audio is at the correct sample rate, the integration follows a consistent four-step cycle:

┌─────────────────────────────────────────────────────────────┐
│                  STREAMING AUDIO LOOP                       │
│                                                             │
│  ┌──────────┐    resample     ┌──────────┐                  │
│  │  Audio   │ ─────────────► │  16 kHz  │                  │
│  │  Source  │                │  Chunk   │                  │
│  └──────────┘                └────┬─────┘                  │
│                                   │                         │
│                                   ▼                         │
│                          ┌────────────────┐                 │
│                          │  Silero VAD    │                 │
│                          │  model(chunk)  │                 │
│                          └───────┬────────┘                 │
│                                  │  probability [0.0–1.0]   │
│                                  ▼                          │
│                          ┌────────────────┐                 │
│                          │  State Machine │                 │
│                          │  (SILENT /     │                 │
│                          │   SPEAKING)    │                 │
│                          └───┬───────┬────┘                 │
│                              │       │                      │
│                    speech_end│       │speech_start          │
│                              ▼       ▼                      │
│                        ┌─────────────────────┐             │
│                        │  Downstream Consumer│             │
│                        │  (ASR buffer, etc.) │             │
│                        └─────────────────────┘             │
└─────────────────────────────────────────────────────────────┘
import torch

START_THRESHOLD    = 0.5
END_THRESHOLD      = 0.35
MIN_SILENCE_MS     = 400
MIN_SILENCE_CHUNKS = int(MIN_SILENCE_MS / 30)  # ~13 chunks

class VADIntegrator:
    def __init__(self, model, reset_fn):
        self.model         = model
        self.reset_fn      = reset_fn
        self.speaking      = False
        self.silence_count = 0
        self.pcm_buffer    = []

    def process_chunk(self, chunk_16k: torch.Tensor):
        """
        chunk_16k: float32 tensor of shape (480,) at 16 kHz.
        Returns an event dict or None.
        """
        prob = self.model(chunk_16k, 16000).item()

        if not self.speaking:
            if prob >= START_THRESHOLD:
                self.speaking      = True
                self.silence_count = 0
                self.pcm_buffer    = [chunk_16k]
                return {"event": "speech_start", "prob": prob}
            return None

        else:
            self.pcm_buffer.append(chunk_16k)

            if prob < END_THRESHOLD:
                self.silence_count += 1
                if self.silence_count >= MIN_SILENCE_CHUNKS:
                    audio              = torch.cat(self.pcm_buffer)
                    self.speaking      = False
                    self.silence_count = 0
                    self.pcm_buffer    = []
                    return {"event": "speech_end", "audio": audio}
            else:
                self.silence_count = 0

            return None

    def reset_session(self):
        """Call this between independent call sessions."""
        self.reset_fn()
        self.speaking      = False
        self.silence_count = 0
        self.pcm_buffer    = []

Several design decisions are worth naming explicitly.

The start and end thresholds differ. START_THRESHOLD is 0.5; END_THRESHOLD is 0.35. The gap creates hysteresis — once in the SPEAKING state, probability must fall meaningfully below 0.35 before transitioning back. The specific values here are illustrative starting points; the Threshold Tuning child lesson covers systematic calibration.

Buffering begins on speech_start. The pcm_buffer starts collecting raw PCM at the moment the start event fires. This means any audio that arrived before the probability crossed START_THRESHOLD is not included — the trade-off and mechanism for capturing that leading audio is the topic of the Pre-roll Buffer child lesson.

The speech_end event carries the buffered audio. torch.cat(self.pcm_buffer) assembles all accumulated chunks into a single contiguous tensor. This is the payload the downstream ASR consumer receives — it does not need to know anything about the chunk-by-chunk processing that produced it.

🎯 Key Principle: VAD's output is not a transcript. It is a labeled segment of raw audio. The speech_end event says "here is audio that contains speech" — not "here is what was said." Keeping this boundary explicit in your code prevents a common architectural confusion where developers expect richer output from the VAD layer.

State Management: The Reset Contract

The model's recurrent hidden state carries context across chunk boundaries within a session — a feature. The problem arises at session boundaries. If reset_session is not called between independent sessions, the hidden state from Session A's final chunks is still active when Session B's first chunk arrives.

WITHOUT RESET:

  Call A audio ──► [hidden state] ──► Call B audio
                       ↑_____________↑
                   State "remembers" Call A
                   → Corrupt probability for Call B's first chunks

WITH RESET:

  Call A audio ──► [hidden state] ──► RESET ──► [zero state] ──► Call B audio
                                                    ↑
                                         Clean slate for Call B

Call reset_session whenever you transition between independent audio contexts: when a new caller connects, when a WebSocket session ends, or when explicitly restarting the VAD. You do not need to reset between individual utterances within a single session — the state machine handles intra-session transitions through the speaking flag and silence_count automatically. The recurrent hidden state, however, is the model's internal memory, and only reset_fn() (which calls model.reset_states() in Silero's API) clears it.

⚠️ Common Mistake: Calling reset_session after every utterance, reasoning that a clean state is always safer. Within a session, the hidden state carries useful context across short silences — it helps the model distinguish a speaker pausing mid-sentence from one who has genuinely finished. Resetting after every utterance throws away that context and can increase false negatives at the start of consecutive utterances.

The Data Contract: What Downstream Consumers Receive

Event When It Fires Payload Consumer Action
🎙️ speech_start Probability exceeds START_THRESHOLD Event type, timestamp (optional) Prepare ASR connection; note start time
🏁 speech_end Silence duration exceeds MIN_SILENCE_MS Event type, raw PCM tensor, duration Send audio to ASR; log segment metadata

The speech_start event is intentionally lightweight — its primary purpose is to signal that the system should prepare. The speech_end event carries the full audio payload: a contiguous float32 PCM tensor at 16 kHz representing the entire detected utterance.

Correct thinking: The VAD integrator owns the buffer. On speech_end, it packages the audio and hands it off as a complete segment. The ASR consumer is decoupled from the buffering mechanism. This decoupling means you can unit test the VAD integrator by feeding it synthetic audio tensors and asserting events fire at the right times, without involving any ASR component, and swap out ASR providers without touching the VAD code.

Common Mistakes When Setting Up VAD

VAD integration looks straightforward on paper: feed audio in, get speech events out. In practice, most developers hit the same set of failure modes — and many of them are quiet. The pipeline keeps running, but probability scores drift, utterances get clipped, or detection becomes erratic in ways that only surface under realistic audio conditions.

Mistake 1: Feeding Variable-Length or Overlapping Chunks

The most fundamental integration error is treating the VAD model like a stateless classifier that accepts arbitrary input lengths. As established earlier, the model maintains a recurrent hidden state between calls, which means each call is expected to represent exactly one fixed-duration window of audio, fed in strict sequence with no gaps and no overlaps.

When you feed variable-length chunks, the model's internal time axis desynchronizes from real time. The damage is subtle: scores become inconsistent across runs on identical audio, thresholds that were carefully calibrated stop working, and borderline audio shows high variance where it should be stable.

Overlapping chunks are equally damaging. If you slide a window forward by 10 ms each call while feeding 30 ms windows, the model processes the same audio frames multiple times and advances its hidden state at the wrong rate.

Correct chunk delivery:

Time ──────────────────────────────────────────────────────────►
       [  chunk 1  ][  chunk 2  ][  chunk 3  ][  chunk 4  ]
        30 ms        30 ms        30 ms        30 ms
        contiguous, no overlap, no gap

Broken: variable lengths
       [  chunk 1  ][chunk 2][     chunk 3     ][chunk 4]
        30 ms        15 ms    45 ms              30 ms
        ← model's timing assumptions violated

Broken: overlapping chunks
       [  chunk 1  ]
            [  chunk 2  ]
                 [  chunk 3  ]
        ← frames re-processed; hidden state corrupted

The fix is to enforce a strict chunking contract at the boundary between your audio source and the model. Maintain a ring buffer or accumulator that only releases audio in exactly the expected chunk size — for Silero VAD at 16 kHz, typically 512 samples (32 ms). Never call the model with a partial chunk and never overlap.

💡 Pro Tip: Write a thin wrapper that raises an assertion error if the input tensor length ever deviates from the expected chunk size. This turns a silent misbehavior into a loud, immediate failure during development.

Mistake 2: Using a Single Threshold for Both Speech-Start and Speech-End

A natural first pass picks a probability cutoff — say 0.5 — and treats anything above it as speech, anything below it as silence. This produces a specific failure mode called threshold oscillation on any audio containing borderline content: a speaker trailing off, a breath, background TV, or a soft voiced consonant.

Real speech probability traces are not bimodal. They hover in the middle range — 0.4 to 0.6 — during transitions, and with a single threshold at 0.5, the state machine flips back and forth on every chunk during those transitions.

Oscillation with a single threshold (0.5):

Probability
1.0 ┤                  ████
    │             █████    ████
0.5 ┤─────────────────────────────── ← single threshold
    │        ████             ████
0.0 ┤████████                     ████████
    └──────────────────────────────────────► Time

State:  SIL  SIL  SPK SPK SPK SIL SPK SPK SIL SIL
                            ↑↑↑↑↑↑↑↑↑
                    rapid toggling in the middle zone

Hysteresis gap (start=0.5, end=0.35):

0.5 ┤─────────────────────────────── ← start threshold
    │                      ↕ gap
0.35┤─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ← end threshold

State:  SIL  SIL  SPEECH  SPEECH  SPEECH  SIL
                  ↑ start                ↑ end
                  fires at 0.5           fires at 0.35

The end threshold should always be strictly lower than the start threshold. The size of the gap determines how aggressively the model resists toggling. A larger gap (e.g., 0.5 start / 0.25 end) makes the system more stable but slower to respond to genuinely brief pauses. A smaller gap (e.g., 0.5 start / 0.45 end) is more responsive but more prone to oscillation on noisy audio. The Threshold Tuning child lesson covers calibrating this gap for specific acoustic environments.

Mistake 3: Discarding Audio Before the Speech-Start Event

The speech-start event fires after the model has accumulated enough consecutive high-probability chunks to be confident that speech has begun. That confirmation delay is intentional — it suppresses false positives — but it has a structural cost: by the time the start event fires, the speaker has already produced the first several tens of milliseconds of their utterance.

Capture starting at the start event (broken):

Actual speech:  [p][l][e][a][s][e][ ][t][e][l][l][ ][m][e]...
                 ↑
                 speaker starts here

VAD chunks:     |  0.1 |  0.2 |  0.4 |  0.6 |  0.7 |  0.8 |...
                                       ↑
                                       start event fires here

Capture starts at start event:
                               [a][s][e][ ][t][e][l][l][ ][m][e]
                               ← "ple" is gone forever

The mechanism for recovering this audio is the pre-roll buffer — a circular buffer that continuously retains the last N milliseconds of audio regardless of VAD state, so that when a start event fires, the pipeline can look back in time and prepend the buffered audio to the capture. If your ASR is consistently clipping the first syllable of utterances, you are almost certainly missing a pre-roll buffer. Lowering the start threshold as a workaround merely shifts where detection fires and introduces more false positives as a side effect; the pre-roll buffer is the correct structural solution.

Mistake 4: Forgetting to Reset VAD State Between Sessions

As established in the integration section, failing to call model.reset_states() between independent sessions causes the hidden state from the previous session to shape probability scores at the start of the new one.

A common place this goes unnoticed is in testing: developers run a batch of test audio files through the model in a loop without resetting state between files. File 1 produces clean results. File 2 produces subtly different results for the same audio content because File 1's state contaminated the initial conditions. The variance looks like model nondeterminism when it is actually missing resets.

🧠 Mnemonic: Think of VAD state like a whiteboard. Between independent sessions, erase the board. Running a new session on an un-erased board means solving a new problem while still looking at the old work.

Mistake 5: Treating the VAD Probability Score as an ASR Confidence Score

The VAD probability score measures one thing: the likelihood that speech is present in the current audio chunk. It says nothing about whether that speech is intelligible, whether the signal-to-noise ratio is sufficient for accurate transcription, or whether the acoustic conditions will produce a reliable ASR output.

Two audio streams, both flagged as speech by VAD:

Stream A: VAD = 0.92  │  SNR = 2 dB   │  ASR WER = 45%
           (loud music in background, speech present but buried)

Stream B: VAD = 0.71  │  SNR = 28 dB  │  ASR WER = 4%
           (soft-spoken in quiet room, borderline VAD confidence)

VAD score does NOT predict ASR quality.

If you need a signal about ASR suitability, look at ASR-specific quality metrics: word-level confidence scores from the ASR output, acoustic model log-likelihoods, or an explicit noise classifier trained to predict transcription quality. VAD is not a substitute for any of these.

Quick Reference: The Five Mistakes

Mistake Symptom Fix
Variable/overlapping chunks Inconsistent scores; calibrated thresholds stop working Fixed-size, contiguous chunks via an accumulator buffer
Single threshold Rapid speech-start/end toggling on borderline audio Set end threshold strictly lower than start threshold
No pre-roll buffer First syllable of every utterance clipped in ASR output Implement a pre-roll buffer (see child lesson)
No state reset False start events on session open; erratic latency Call model.reset_states() at session boundaries
VAD score ≠ ASR confidence High VAD scores produce poor transcription on noisy audio Use ASR-specific quality signals for routing decisions

🤔 Did you know? The first and fourth mistakes tend to cluster: teams that skip fixed-chunk enforcement often also skip state resets, because both require explicit bookkeeping at session boundaries that feels like "boilerplate" during early prototyping. When VAD behavior seems erratic, checking both simultaneously is usually faster than isolating them one at a time.

Key Takeaways and What Comes Next

You started this lesson facing a deceptively simple problem: a microphone produces an unbroken stream of PCM samples, but ASR engines, language models, and text-to-speech synthesizers all expect discrete, bounded utterances. VAD closes that gap. By this point the mechanism should feel concrete — not just that VAD works, but how it does so, what can go wrong, and where the remaining rough edges live.

The Four Ideas This Lesson Built

1. VAD is an event translator, not a classifier. VAD's fundamental job is to convert a continuous audio stream into a discrete pair of events: speech-start and speech-end. It does this by running a small neural network over fixed-size chunks to produce a per-chunk probability score, then applying threshold logic to translate those raw scores into named events. VAD does not transcribe, does not measure audio quality, and does not determine intelligibility. It answers one question per chunk: is there speech here right now?

2. Silero VAD is stateful — and that statefulness has real consequences. The recurrent hidden state is a feature: it lets the model reason about context across chunk boundaries, distinguishing a mid-sentence pause from a genuine utterance end. But it creates a hard dependency on how you feed the model. Chunks must arrive in order, be exactly the expected fixed size, and the state must be explicitly reset between independent sessions. Violate any of these three constraints and probability scores become unreliable in ways that are difficult to debug.

3. Three levers control detection behavior — and they interact.

Parameter What it controls
Start threshold Minimum probability required to enter speech-active state
End threshold Maximum probability sustained when transitioning back to silence
Minimum silence duration How long probability must stay below the end threshold before speech-end fires

The end threshold must be set lower than the start threshold to introduce hysteresis — a deliberate dead band that prevents oscillation on borderline audio. The right values are not universal; they depend on your microphone, ambient noise environment, and user speech patterns. The Threshold Tuning child lesson covers systematic calibration methodology.

4. Audio before the speech-start event is not automatically preserved. The speech-start event fires after the model has observed enough consecutive high-probability chunks to cross the threshold. By then, the speaker has already produced the first few tens of milliseconds of their utterance. The mechanism for reaching back and capturing that leading audio — a circular pre-roll buffer prepended to the capture at speech-start — is covered in the Pre-roll Buffer child lesson.

Integration Essentials

Concept What to remember Where it breaks
🎯 Event model VAD emits speech-start and speech-end; pipeline buffers between them Missed end event → buffer grows unbounded
🔒 Fixed chunk size Chunks must match the model's expected window exactly Variable-length chunks silently corrupt probability scores
🔒 In-order delivery Chunks must arrive in sequence with no gaps Out-of-order chunks corrupt recurrent state
🔧 State reset Reset between every independent session Stale state causes phantom detections at session open
🔧 Threshold hysteresis End threshold lower than start threshold Equal thresholds cause start/end oscillation
📚 Pre-roll buffer Leading audio is lost without a ring buffer prepended at speech-start First phonemes clipped; degraded ASR on utterance onsets
🧠 Sample rate Input must be resampled to 16 kHz before chunking Wrong sample rate produces garbage scores with no error raised

What You Can Now Do

Wire up a basic detection loop. You have the integration pattern: resample to 16 kHz, feed fixed-size chunks in order, check each probability score against your thresholds, emit start and end events. Get this loop running and emitting events before moving to the child lessons.

Diagnose detection problems systematically. The table above is a checklist. Phantom detections at session open → check state reset. Oscillating start/end events → check threshold hysteresis. Clipped first syllables → check for a pre-roll buffer. Garbage probability scores → verify sample rate.

Proceed to the child lessons with the right questions. You now know which parameters control detection behavior and why they interact — the Threshold Tuning lesson can go straight to methodology. You know why leading audio is lost and what mechanism fixes it — the Pre-roll Buffer lesson can go straight to implementation.

🧠 Mnemonic — FORCE: The five things you must get right to keep VAD working correctly: Fixed chunk size, Order guaranteed, Reset between sessions, Correct sample rate, End threshold lower than start. If any of the five are wrong, no amount of downstream tuning will save you.