Audio Fundamentals

Lesson 2 — Stop treating audio as a black box: PCM, sample rates, bit depth, and every int16↔float32 conversion in the system.

Last generated May 27, 2026 UTC

Why Audio Is Not Just Data: The Physical Reality Behind PCM

Imagine you've wired together a microphone capture library, a speech recognition API, and a text-to-speech synthesizer. You run your voice agent for the first time. The model transcribes nothing, or produces garbled nonsense, or the synthesized audio plays back as a high-pitched squeal at double speed. You check your API keys — fine. You check the network — fine. You add logging everywhere and the bytes are flowing. The system looks correct from the outside, and yet it is completely broken. This is the specific, infuriating category of bug that appears when you treat audio as an opaque blob of bytes rather than a structured physical signal encoded according to precise rules.

Sound as a Physical Phenomenon

Sound is a continuous disturbance propagating through a medium — in voice applications, that medium is air. When you speak, your vocal cords set air molecules into motion, creating regions of slightly higher and lower pressure that radiate outward as a wave. A microphone's diaphragm physically deflects in response to these pressure changes, and that deflection is converted into a continuously varying electrical voltage. At this stage, the signal is analog: it has a value at every instant in time, forming a smooth, unbroken curve.

Pressure
  |
  |    /\        /\      /\
  |   /  \      /  \    /  \
  |--/----\----/----\--/----\--> Time
  |        \  /      \/
  |         \/
  |
  Continuous analog waveform — infinite resolution in both time and amplitude

Computers cannot store or process infinite-resolution data. To get audio into a digital system, you must perform two acts of deliberate, controlled approximation: sampling (discretizing time) and quantization (discretizing amplitude).

Sampling means measuring the waveform's value at regular time intervals, producing a sequence of discrete snapshots. Quantization means rounding each snapshot to the nearest representable number given a fixed set of available levels. Together, they convert the smooth analog curve into a finite sequence of integers or floating-point numbers.

Pressure
  |
32767 |    *              *         *
      |         *    *         *
    0 |--*------------------*-----------*--> Sample index
      |              *
-32768|    *
  |
  Quantized samples — each * is one int16 value at one point in time

🎯 Key Principle: Digitization is not lossless. The quality of that trade is controlled by two parameters: how often you sample (the sample rate) and how many levels you can represent per sample (the bit depth). Every PCM buffer in your system is the product of specific choices made at capture time.

What PCM Actually Is

PCM — Pulse-Code Modulation — is the name for the data layout that results from this process. It is not a file format, not a codec, and not a compression scheme. PCM is simply the uncompressed, sample-by-sample record of quantized pressure values, stored sequentially in memory. If you have one second of audio at 16,000 samples per second with 16-bit resolution, PCM is the 32,000-byte region of memory where each successive pair of bytes encodes one pressure snapshot as a signed 16-bit integer.

PCM buffer layout (16-bit signed integer, mono, first 8 bytes shown):

Byte:  [ 0x1A 0x04 ] [ 0xE2 0x09 ] [ 0x55 0x0F ] [ 0x99 0x14 ]
         Sample 0      Sample 1      Sample 2      Sample 3
         (little-endian int16 values representing pressure at t=0, t=1/16000s, ...)

There is no header inside a raw PCM stream telling you the sample rate, bit depth, or number of channels. Those parameters are metadata — they live outside the buffer, in your code's assumptions or in a wrapper format like WAV. This is a deliberate simplicity that makes PCM the ideal interchange format for real-time systems.

💡 Mental Model: Think of a PCM buffer like a photograph's raw pixel data. The RGB byte array doesn't know its own width or height. You can interpret the same byte array as a 100×100 image or a 50×200 image, and both produce coherent but different pictures. Audio PCM works the same way: the same bytes can be "read" at 8 kHz or 16 kHz, and the system will dutifully play or process them — just at half or double speed, producing bugs that sound wrong but throw no exceptions.

PCM as the Common Currency of a Voice Pipeline

PCM is the medium that every component in a voice agent speaks. Consider the path audio travels:

┌─────────────────────────────────────────────────────────────────┐
│                   Voice Agent Audio Pipeline                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [Microphone]                                                    │
│       │  PCM bytes (device native rate, int16)                  │
│       ▼                                                          │
│  [OS Audio API]                                                  │
│       │  PCM bytes (possibly resampled)                         │
│       ▼                                                          │
│  [Your Application Code]                                         │
│       │  PCM buffer (numpy array or bytes object)               │
│       ▼                                                          │
│  [Speech Recognition Model]                                      │
│       │  Transcript text                                         │
│       ▼                                                          │
│  [LLM / Agent Logic]                                             │
│       │  Response text                                           │
│       ▼                                                          │
│  [Text-to-Speech Synthesizer]                                    │
│       │  PCM bytes (synthesizer's native format)                │
│       ▼                                                          │
│  [Playback API / Browser / Speaker]                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Your application code is the one responsible for ensuring the PCM handed from one component to the next is in the exact format that component expects — correct sample rate, correct bit depth, correct channel count, correct numeric type. This is a chain of precise handshakes, not an automatic flow. None of these components will politely tell you "I received 44.1 kHz audio but I expected 16 kHz." The speech recognizer will just produce garbage transcriptions because the audio sounds to it like a recording played at wildly the wrong speed.

💡 Real-World Example: A developer builds a voice agent that works perfectly on their laptop but produces near-zero transcription accuracy when deployed to a cloud VM. The root cause: the laptop's audio device returned 44,100 Hz PCM, but the VM's audio layer returned 48,000 Hz — both values the OS chose silently based on hardware capability. The speech recognition model expected 16,000 Hz. The bytes were flowing correctly; the format was wrong. Without knowing what PCM is, there is no vocabulary for diagnosing this. With it, the fix is a three-line resampling call.

The Cost of Treating Audio as an Opaque Blob

❌ Wrong thinking: "Audio is bytes, and bytes are bytes. I'll pass the data through and let the libraries handle the details."

✅ Correct thinking: "Audio is PCM with a specific sample rate, bit depth, channel count, and numeric type. Every boundary in my pipeline is a potential format mismatch. I'll verify the format at each handoff rather than assuming compatibility."

Knowing what PCM looks like at the byte level is what enables meaningful inspection. A PCM buffer is not magic — it is a numpy array of int16 or float32 values that you can print, plot, and reason about:

import numpy as np

## Load raw PCM bytes captured from a microphone (16-bit signed, mono, 16kHz)
raw_bytes = b'...'  # from your audio capture callback

## Interpret as signed 16-bit integers
samples = np.frombuffer(raw_bytes, dtype=np.int16)

## Immediate sanity checks
print(f"Sample count: {len(samples)}")        # e.g., 1600 for 100ms at 16kHz
print(f"Min: {samples.min()}, Max: {samples.max()}")  # should be well within [-32768, 32767]
print(f"Mean: {samples.mean():.2f}")            # should be close to 0.0 for speech

🧠 Mnemonic: Think of a PCM buffer as a contract in bytes: the sample rate says how often measurements were taken, the bit depth says how precisely, and the channel count says how many microphones were listening. Break any term of the contract and the receiving end cannot interpret the signal correctly — not because software is fragile, but because the physical meaning of the data depends on all three terms simultaneously.

Sample Rate and Bit Depth: The Two Numbers That Define Your Audio

With the physical picture of PCM established, we can now be precise about the two parameters that govern every format decision in the pipeline: sample rate and bit depth. Get these right — or more precisely, keep them consistent across every handoff — and audio flows cleanly. Get them wrong and the failure modes range from subtle degraded recognition to catastrophic silence.

Sample Rate: Slicing Time Into Snapshots

Sample rate is how many pressure snapshots you take per second. A sample rate of 16,000 Hz means you capture 16,000 discrete amplitude values every second. The unit is Hz (Hertz); you will also see kHz (kilohertz) in documentation: 16 kHz = 16,000 Hz.

The critical constraint governing sample rate is the Nyquist theorem: to faithfully reconstruct a frequency component in a signal, you must sample at at least twice that frequency. Equivalently, a given sample rate can faithfully represent frequencies up to exactly half its value — the Nyquist frequency.

Sample rate:    16,000 Hz
Nyquist limit:   8,000 Hz  ← highest frequency faithfully captured

Sample rate:    44,100 Hz
Nyquist limit:  22,050 Hz  ← covers full human hearing range (~20 Hz–20 kHz)

When a frequency exceeds the Nyquist limit, the sampler cannot distinguish it from a lower frequency — a phenomenon called aliasing. In practice, hardware anti-aliasing filters remove frequencies above the Nyquist limit before sampling. But the conceptual rule matters: once you choose a sample rate, you have permanently discarded all information above its Nyquist frequency. You cannot recover it later in software.

🎯 Key Principle: Halving the sample rate hard-cuts all frequency content above the new Nyquist limit. Going from 44.1 kHz to 8 kHz removes every frequency component above 4 kHz, which is why telephone audio sounds the way it does.

The Sample Rates You Will Actually Encounter

Rate	Common Context	Nyquist Limit
8 kHz	PSTN telephony, legacy VoIP	4 kHz
16 kHz	Modern ASR models (almost universal)	8 kHz
22–24 kHz	Some TTS synthesis outputs	11–12 kHz
44.1 kHz	CD standard, browser `AudioContext` default	22.05 kHz
48 kHz	Professional audio, WebRTC, many OS defaults	24 kHz

The gap between 16 kHz (what ASR models expect) and 44.1/48 kHz (what a browser or OS audio API delivers by default) is the single most common source of pipeline bugs. When you capture audio at 48 kHz and feed it to a 16 kHz model without resampling, the model interprets the extra samples as extra time, hearing your audio at one-third its actual speed. The transcript comes out garbled or empty.

🤔 Did you know? The telephone network standardized on 8,000 samples per second decades ago — the frequencies critical for consonant distinction are captured at that rate. Most modern speech recognition models expect 16,000 samples per second, which recovers higher-frequency sibilants (sounds like s and sh) that significantly improve word error rates. This single parameter difference between telephony and modern ML pipelines is a perennial source of mismatch bugs. The 44.1 kHz standard, meanwhile, originated from a practical constraint in CD-era manufacturing: engineers needed a sample rate fitting neatly into NTSC and PAL video frame structures — not any perceptual optimum.

Bit Depth: Slicing Amplitude Into Levels

Sample rate controls resolution in time. Bit depth controls resolution in amplitude — it determines how many discrete loudness levels a single sample can represent.

With N bits, you get 2^N distinct levels. The two values you will encounter in voice work are:

16-bit integer (int16): 2^16 = 65,536 levels, covering values from −32,768 to +32,767. The dominant format for audio I/O in voice pipelines.
32-bit float (float32): Typically normalized to [−1.0, +1.0]. Used internally for processing.

Amplitude representation comparison:

16-bit int16:
 ─────────────────────────────────────
 -32768                    0         +32767
   |________________________|__________|
   65,536 discrete steps

32-bit float32 (normalized):
 ─────────────────────────────────────
  -1.0                     0          +1.0
   |________________________|__________|
   much finer resolution within [-1,1]
   (plus dynamic range far outside [-1,1])

Dynamic range — the ratio between the loudest and softest representable signal — scales with bit depth. Each additional bit adds approximately 6 dB of dynamic range. 16-bit audio therefore provides roughly 96 dB of dynamic range (16 × 6 = 96), which is more than adequate for voice. Human speech in a typical environment spans roughly 40–50 dB.

🧠 Mnemonic: Every bit buys you 6 dB. 16 bits = 96 dB. 24 bits = 144 dB.

What Bit Depth Actually Does — and Doesn't Do

❌ Wrong thinking: "32-bit audio sounds better than 16-bit audio for voice."

✅ Correct thinking: "32-bit float reduces rounding error during mathematical operations, but a human listener cannot perceive any difference between 16-bit and 32-bit playback of a voice recording."

Bit depth matters in two distinct contexts:

Playback fidelity. For final output to a listener, 16-bit is sufficient for voice. The quantization noise introduced by 16-bit sampling is well below the threshold of human perception for speech content.

Processing headroom. When you apply mathematical operations to samples — multiplying by a gain factor, mixing two streams, applying a filter — you accumulate quantization error at each step. In int16 arithmetic, repeated operations can cause clipping or gradual precision loss. float32 avoids clipping by extending the representable range. This is why processing chains internally use float32 even when the input and output are int16.

💡 Real-World Example: Consider a gain normalization step. You read a quiet int16 recording, multiply each sample by 4.7 to bring it to a target loudness, then write back to int16. If you perform that multiplication in int16 space, samples near ±32,767 will overflow. If you first cast to float32, multiply, clamp, then cast back, the operation is safe. The int16 input and output are identical in format — the float32 is invisible to the pipeline consumers, but essential to correctness.

The Size Formula: Making PCM Predictable

One of the most useful properties of PCM is that its byte size is completely deterministic. Given four values, you can compute the exact size of any PCM buffer:

bytes = duration_seconds × sample_rate × (bit_depth / 8) × channels

Worked Examples

1 second of 16 kHz mono int16:

1 s × 16,000 samples/s × 2 bytes/sample × 1 channel = 32,000 bytes

100 ms of 44.1 kHz stereo int16:

0.1 s × 44,100 × 2 bytes × 2 channels = 17,640 bytes

500 ms of 48 kHz mono float32:

0.5 s × 48,000 × 4 bytes × 1 channel = 96,000 bytes

This formula is not just a sizing tool — it is a debugging instrument. When a buffer arrives at a pipeline stage, you can immediately check whether its byte length matches the expected duration:

Expected: 1.0 s of 16 kHz mono int16 = 32,000 bytes
Received: 16,000 bytes

Diagnosis candidates:
  - float32 source mistakenly read as int16 (each sample read as 2 bytes
    instead of 4, cutting effective duration in half)
  - Only 500 ms of audio was captured
  - Buffer was truncated mid-write
  - Stereo source, but only one channel was extracted

The formula also lets you reason about latency. If your speech recognition model processes audio in 20 ms chunks at 16 kHz mono int16, each chunk is exactly 0.020 × 16,000 × 2 × 1 = 640 bytes. A buffer of 639 bytes is not a complete frame.

Putting the Two Numbers Together

In practice, voice pipelines converge on a small set of canonical format pairs:

Format	Typical Owner	Bytes/Second (Mono)
8 kHz / int16	Legacy telephony ingest	16,000
16 kHz / int16	ASR model input (near-universal)	32,000
22–24 kHz / float32	TTS synthesis output	88,000–96,000
44.1 kHz / int16	Browser playback default	176,400
48 kHz / int16	WebRTC, OS audio API output	192,000

The 16 kHz / int16 row is the one you will spend the most time working toward. Almost every speech recognition system treats it as the canonical input contract. Everything upstream tends to deliver audio at higher sample rates, and everything you build will need to downsample to meet it. With the size formula in hand and a clear picture of what each parameter physically means, you can inspect any PCM buffer and immediately know whether it is what the next stage expects.

Anatomy of a Real-Time Voice Pipeline: Where the Numbers Flow

A real-time voice agent is a series of format contracts. Every component — microphone driver, speech recognition model, text-to-speech engine, playback device — speaks PCM, but none speaks exactly the same dialect. At each handoff the format must change: the sample rate shifts, the channel count collapses, the numeric type flips. When those conversions are done correctly and explicitly, the pipeline is invisible. When any single handoff goes wrong, the root cause is buried in a buffer of raw bytes.

The Pipeline at a Glance

┌─────────────────────────────────────────────────────────────────────┐
│                    REAL-TIME VOICE AGENT PIPELINE                   │
└─────────────────────────────────────────────────────────────────────┘

  ┌──────────────┐   raw PCM bytes    ┌──────────────────────────────┐
  │  Microphone  │ ─────────────────► │   OS Audio API / Driver      │
  └──────────────┘                    └──────────────┬───────────────┘
                                                     │
                                   Format: int16, native rate,
                                   1 or 2 channels, ~10–20ms frames
                                                     │
                              ╔══════════════════════▼══════╗
                              ║   HANDOFF POINT A           ║
                              ║   Resample + Downmix        ║
                              ╚══════════════════════╤══════╝
                                                     │
                                   Format: int16, 16 kHz, mono
                                                     │
                                      ┌──────────────▼───────────────┐
                                      │   Speech Recognition Model   │
                                      └──────────────┬───────────────┘
                                                     │  text transcript
                                      ┌──────────────▼───────────────┐
                                      │   Language Model / Logic     │
                                      └──────────────┬───────────────┘
                                                     │  text response
                                      ┌──────────────▼───────────────┐
                                      │   Text-to-Speech Engine      │
                                      └──────────────┬───────────────┘
                                                     │
                                   Format: float32 OR int16,
                                   22–24 kHz (or 44.1/48 kHz),
                                   mono or stereo
                                                     │
                              ╔══════════════════════▼══════╗
                              ║   HANDOFF POINT B           ║
                              ║   Convert + Resample        ║
                              ╚══════════════════════╤══════╝
                                                     │
                                   Format: int16, device native rate
                                                     │
                                      ┌──────────────▼───────────────┐
                                      │   OS Audio API (Playback)    │
                                      └──────────────┬───────────────┘
                                                     │
                                      ┌──────────────▼───────────────┐
                                      │   Speaker (hardware)         │
                                      └──────────────────────────────┘

The two handoff points — A and B — are where the majority of bugs in real-time voice agents originate. Every other part of the pipeline operates on a stable, internally consistent format.

Stage 1: Microphone Capture — What the OS Hands You

When your application opens an audio input stream, the OS returns a callback or buffer of raw bytes at regular intervals. The format is determined by hardware and driver negotiation — not by your voice agent:

Numeric type: Almost always 16-bit signed integers (int16).
Sample rate: The device's native sample rate, typically 44,100 Hz or 48,000 Hz — not 16 kHz, even if that is what you eventually need.
Channel layout: Consumer microphones are often stereo with samples interleaved — [L0, R0, L1, R1, L2, R2, ...].
Frame size: The OS delivers audio in frames — chunks of a fixed number of samples per channel, commonly 10–20 ms.

Nothing in this raw buffer is ready to hand to a speech recognition model. The format is wrong in at least two dimensions simultaneously.

⚠️ Common Mistake: Opening a stream with samplerate=16000, channels=1 does not guarantee you will receive 16 kHz mono audio. If the underlying hardware only supports 48 kHz natively, the driver may silently resample — or return 48 kHz regardless. You must check the actual stream parameters after opening.

Stage 2: Handoff Point A — Conforming to the Speech Model's Contract

Speech recognition models have a well-established input contract: mono, 16 kHz, 16-bit signed integer PCM. Deviating from it does not cause an error message; it causes the model to process the wrong data quietly. To go from raw capture to model input, your code must perform up to three independent transformations:

Downmixing: Stereo to Mono

import numpy as np

def stereo_to_mono(pcm_bytes: bytes) -> np.ndarray:
    """Convert interleaved int16 stereo PCM bytes to mono int16 array."""
    stereo = np.frombuffer(pcm_bytes, dtype=np.int16)
    stereo = stereo.reshape(-1, 2)
    # Average channels; .mean() promotes to float64 internally, avoiding overflow
    mono = stereo.mean(axis=1).astype(np.int16)
    return mono

⚠️ Common Mistake: If you write (stereo[:, 0] + stereo[:, 1]) // 2 using integer arithmetic, adding two int16 values in numpy with int16 dtype can wrap around before the division. The .mean() path above is the safer idiom.

Resampling: Native Rate to 16 kHz

Resampling converts a stream at one rate to a stream at a different rate, requiring interpolation and anti-aliasing filtering. Use a library that handles this correctly:

from scipy.signal import resample_poly
from math import gcd

def resample_audio(samples: np.ndarray, orig_rate: int, target_rate: int) -> np.ndarray:
    """Resample a mono int16 array from orig_rate to target_rate."""
    divisor = gcd(orig_rate, target_rate)
    up = target_rate // divisor
    down = orig_rate // divisor
    float_samples = samples.astype(np.float32)
    resampled = resample_poly(float_samples, up, down)
    return np.clip(resampled, -32768, 32767).astype(np.int16)

For 48,000 Hz → 16,000 Hz, up=1 and down=3. For 44,100 Hz → 16,000 Hz, the ratio simplifies to 147:160.

⚠️ Common Mistake: Naively skipping samples to "downsample" — taking every third sample to go from 48 kHz to 16 kHz — is only valid if no frequencies above 8 kHz are present. Real microphone input contains high-frequency noise that aliases into the speech band when you skip samples without filtering. resample_poly handles this filtering internally.

Stage 3: Speech Recognition and Language Model — Text, Not Bytes

Once the conforming audio buffer reaches the ASR model, the model's output is text. The language model processes that text and produces a text response. Neither stage touches audio at all. This matters architecturally: the ASR model's inference time and the language model's generation time contribute to end-to-end delay but do not introduce format conversion issues. The format problems are symmetric — they appear at Handoff A and Handoff B.

Stage 4: TTS Output — A Format You Don't Control

TTS engines are the least predictable part of the pipeline from a format perspective. Common TTS output formats include:

Sample rate: 22,050 Hz or 24,000 Hz for neural TTS models; some output 44,100 or 48,000 Hz.
Numeric type: Some engines return float32 in [-1.0, 1.0]; others return int16; some return encoded audio (MP3 or Opus) that must be decoded before you can treat it as PCM.
Channel count: Usually mono, but not universally.

You cannot assume TTS output is in any particular format without verifying it at runtime. Write the format assertion before writing any playback code.

Stage 5: Handoff Point B — Conforming to the Playback Device's Contract

Handoff B involves the inverse operations of Handoff A: upsampling from the TTS rate to the device rate, type conversion if TTS output was float32 and playback expects int16, and channel duplication if needed.

import numpy as np
from scipy.signal import resample_poly
from math import gcd

def tts_to_playback(tts_samples: np.ndarray,
                    tts_rate: int,
                    device_rate: int,
                    device_channels: int) -> bytes:
    """
    Convert float32 mono TTS output to int16 PCM bytes
    at the device's native rate and channel count.
    """
    # 1. Resample to device rate
    divisor = gcd(tts_rate, device_rate)
    up = device_rate // divisor
    down = tts_rate // divisor
    resampled = resample_poly(tts_samples.astype(np.float64), up, down)

    # 2. Convert float32 [-1.0, 1.0] to int16
    clipped = np.clip(resampled, -1.0, 1.0)
    as_int16 = (clipped * 32767).astype(np.int16)

    # 3. Duplicate to stereo if needed (interleaved)
    if device_channels == 2:
        as_int16 = np.column_stack([as_int16, as_int16]).reshape(-1)

    return as_int16.tobytes()

The np.clip before the integer conversion is not optional. If TTS samples exceed [-1.0, 1.0] — which neural models occasionally produce during unusual prosody — skipping the clip causes astype(np.int16) to wrap values (not clamp them), producing loud crackles. The factor 32767 rather than 32768 is intentional: the int16 range is asymmetric ([-32768, 32767]), and using 32767 ensures positive full-scale float maps correctly without overflow.

Frame Size and Its Relationship to Latency

Every buffering boundary in the pipeline adds latency. A frame is the number of samples each stage processes as an atomic unit. At capture, the OS delivers frames of typically 10–20 ms. A 20 ms frame at 16 kHz contains 320 samples.

End-to-end latency is roughly the sum of all buffering delays:

Capture buffer:      20 ms
ASR accumulation:   200 ms
ASR inference:       80 ms
LLM generation:     300 ms
TTS synthesis:      150 ms
Playback buffer:     20 ms
────────────────────────────
Total:              ~770 ms

(These numbers are illustrative — actual values depend on hardware, model size, and implementation.)

Smaller frame sizes reduce latency but increase per-frame processing overhead. Larger frame sizes reduce overhead but add latency. For a conversational voice agent, the typical target is to keep capture-to-transcript latency under 300–400 ms total, which requires deliberate frame size choices at every stage.

Tracking Boundaries Explicitly

The single most effective practice for avoiding format bugs is to name every format boundary in code — not in comments, but in function signatures, variable names, and assertion points:

Variables holding capture-format audio are named distinctly from ASR-format audio (raw_capture_bytes vs. asr_input_samples).
Each conversion function declares its input and output format in its docstring or type hints.
Assertion functions fire at every handoff point, catching mismatches at the boundary rather than inside the model.

Stage	Dtype	Sample Rate	Channels	Notes
OS capture output	int16	Device native (44.1/48 kHz)	1 or 2	May vary by hardware
After Handoff A	int16	16,000 Hz	1 (mono)	ASR contract
ASR output	—	—	—	Text, not PCM
TTS output	float32 or int16	22–24 kHz typical	Usually 1	Verify per engine
After Handoff B	int16	Device native	Device native	Playback contract

Practical Inspection: Reading and Verifying Raw Audio

With the pipeline anatomy clear, the next skill is being able to look directly at PCM data and confirm it matches what you expect. Every bug in a voice pipeline has a physical manifestation; the moment you read a buffer correctly into a numpy array and check its actual values, most format problems become immediately obvious.

Reading Raw PCM Into NumPy

A raw PCM file contains nothing but sample values — no headers, no metadata. The dtype argument is not optional — it tells numpy how to parse the byte stream:

import numpy as np

## Reading 16-bit signed integer PCM (the format most speech models expect)
samples_int16 = np.fromfile("recording.raw", dtype=np.int16)

## Reading 32-bit float PCM (common from TTS engines and some audio APIs)
samples_float32 = np.fromfile("synth_output.raw", dtype=np.float32)

Once you have an array, your first four checks should always be the same:

print("Shape:  ", samples_int16.shape)   # total sample count
print("dtype:  ", samples_int16.dtype)   # confirms the read type
print("Min:    ", samples_int16.min())
print("Max:    ", samples_int16.max())
print("Mean:   ", samples_int16.mean())  # should be near 0 for DC-balanced audio
print("Abs max:", np.abs(samples_int16).max())

Each number tells you something specific:

Shape tells you the total sample count. For 2 seconds of mono 16 kHz audio, you expect exactly 32,000 samples. A different count indicates truncation, wrong channel count, or a dtype mismatch.
Min and max tell you whether audio is active and whether it has clipped. If max is exactly 32,767 across many consecutive samples, the waveform is clipping. If both are near zero, the buffer is silent.
Mean should be close to zero. A mean far from zero (e.g., 2,000 for int16) indicates a DC offset that will cause problems downstream.

Diagnostic checklist from four array statistics:

  np.int16 speech — healthy signal
  ┌─────────────────────────────────────────────────────┐
  │  shape:  (32000,)  ← 2s × 16000 Hz × 1 channel    │
  │  min:    -21453    ← good dynamic range             │
  │  max:     19872    ← no clipping                   │
  │  mean:      -3.2   ← near-zero (DC balanced)       │
  └─────────────────────────────────────────────────────┘

  np.int16 speech — clipped signal
  ┌─────────────────────────────────────────────────────┐
  │  min:    -32768    ← hit the floor                 │
  │  max:     32767    ← hit the ceiling               │
  └─────────────────────────────────────────────────────┘

  np.int16 read on float32 bytes (wrong dtype)
  ┌─────────────────────────────────────────────────────┐
  │  shape:  (64000,)  ← 2× expected (4 bytes vs 2)   │
  └─────────────────────────────────────────────────────┘

💡 Pro Tip: The shape check is often the fastest bug detector. If you expect N samples and get 2N, you are almost certainly reading float32 bytes as int16 (or stereo as mono). The math is exact because PCM byte layout is deterministic.

Deliberate Misplay: Building Sample-Rate Intuition

Understanding sample rate mismatches intellectually is useful; hearing them is indispensable.

import numpy as np
import sounddevice as sd

## Load a 16 kHz mono recording
samples = np.fromfile("speech_16khz.raw", dtype=np.int16).astype(np.float32) / 32768.0

## Play at correct rate
print("Playing at correct rate (16 kHz)...")
sd.play(samples, samplerate=16000)
sd.wait()

## Lie to the playback API: declare half the real rate
print("Playing at wrong rate (8 kHz — slowed, lower pitch)...")
sd.play(samples, samplerate=8000)
sd.wait()

## Declare double the real rate
print("Playing at wrong rate (32 kHz — faster, higher pitch)...")
sd.play(samples, samplerate=32000)
sd.wait()

When you play 16 kHz audio declared as 8 kHz, the playback device stretches each sample over twice the expected time — the audio comes out at half speed and a full octave lower. Declared as 32 kHz, it plays twice as fast and an octave too high. The playback device does not know your audio was recorded at 16 kHz; it simply plays N samples per second at whatever rate you declare.

Once you have heard what a 2× rate mismatch sounds like, you will recognize the symptom immediately when a speech recognition model receives audio at the wrong rate. The model is effectively hearing speech at half or double speed, and its acoustic models are not built for that.

Plotting the Waveform: Seeing What Your Ears Miss

Certain problems are easier to see than to hear. Plotting a 50–200 ms segment reveals structure that statistics alone cannot:

import numpy as np
import matplotlib.pyplot as plt

SAMPLE_RATE = 16000
samples = np.fromfile("speech_16khz.raw", dtype=np.int16)

## Extract a 100ms window starting at 0.5 seconds
start_s, duration_s = 0.5, 0.1
start_idx = int(start_s * SAMPLE_RATE)
end_idx = int((start_s + duration_s) * SAMPLE_RATE)
window = samples[start_idx:end_idx]

time_ms = np.linspace(start_s * 1000, (start_s + duration_s) * 1000, len(window))

plt.figure(figsize=(12, 3))
plt.plot(time_ms, window, linewidth=0.5)
plt.axhline(y=32767,  color='red', linestyle='--', linewidth=0.8, label='int16 ceiling')
plt.axhline(y=-32768, color='red', linestyle='--', linewidth=0.8, label='int16 floor')
plt.xlabel("Time (ms)")
plt.ylabel("Amplitude (int16)")
plt.title("100ms PCM Window — Inspect for Clipping, Silence, or Anomalies")
plt.legend()
plt.tight_layout()
plt.show()

What to look for:

Clipping — the waveform appears to have its peaks sheared flat, hugging the red lines. This is irreversible distortion.
Unexpected silence — a flat line at zero in the middle of a speech segment, indicating a dropped chunk or incorrectly zeroed region.
Channel duplication — two interleaved waveforms (visible in even vs. odd indices) suggest stereo audio read as mono.

## Check for accidental stereo-as-mono
ch0 = samples[0::2]  # even-indexed samples (left channel if stereo)
ch1 = samples[1::2]  # odd-indexed samples (right channel if stereo)

if np.allclose(ch0[:500], ch1[:500]):
    print("Channels appear identical — likely stereo duplicate of a mono source")
else:
    print("Channels differ — genuine stereo content")

The Format-Assertion Function: Your Pipeline's First Line of Defense

The techniques above are diagnostic tools for development. The most durable investment is a format-assertion function called at every handoff point, turning silent corruption into loud, fast failures:

import numpy as np

PIPELINE_SAMPLE_RATE = 16000
PIPELINE_DTYPE       = np.int16
PIPELINE_CHANNELS    = 1
INT16_ABS_MAX        = 32768

def assert_pcm_format(
    samples: np.ndarray,
    label: str,
    expected_dtype: type    = PIPELINE_DTYPE,
    expected_channels: int  = PIPELINE_CHANNELS,
    clipping_threshold: float = 0.999,
) -> None:
    """
    Assert that a numpy PCM array matches expected format at a pipeline boundary.
    Raises ValueError with a descriptive message on any violation.
    """
    errors = []

    if samples.dtype != expected_dtype:
        errors.append(f"dtype mismatch: got {samples.dtype}, expected {expected_dtype.__name__}")

    if samples.ndim == 1:
        actual_channels = 1
    elif samples.ndim == 2:
        actual_channels = samples.shape[0]
    else:
        errors.append(f"unexpected array dimensions: {samples.ndim}")
        actual_channels = None

    if actual_channels is not None and actual_channels != expected_channels:
        errors.append(f"channel mismatch: got {actual_channels}, expected {expected_channels}")

    if samples.dtype == np.int16:
        abs_max = np.abs(samples).max()
        if abs_max > INT16_ABS_MAX * clipping_threshold:
            errors.append(f"possible clipping: abs max = {abs_max}")
        if abs_max < 10:
            errors.append(f"near-silent buffer: abs max = {abs_max}")
    elif samples.dtype == np.float32:
        abs_max = np.abs(samples).max()
        if abs_max > 1.0:
            errors.append(f"float32 values outside [-1.0, 1.0]: abs max = {abs_max:.4f}")

    if errors:
        raise ValueError(f"[PCM assertion failed at '{label}']\n  " + "\n  ".join(errors))

Note a deliberate limitation: sample rate cannot be read from a raw numpy array — it is metadata that lives outside the samples. A lightweight wrapper makes it a first-class attribute:

from dataclasses import dataclass

@dataclass
class PCMBuffer:
    samples: np.ndarray
    sample_rate: int

def assert_pcm_buffer(buf: PCMBuffer, label: str) -> None:
    if buf.sample_rate != PIPELINE_SAMPLE_RATE:
        raise ValueError(
            f"[PCM assertion failed at '{label}'] "
            f"sample rate: got {buf.sample_rate}, expected {PIPELINE_SAMPLE_RATE}"
        )
    assert_pcm_format(buf.samples, label)

Call it at every pipeline boundary:

def transcribe(buf: PCMBuffer) -> str:
    assert_pcm_buffer(buf, label="transcribe:input")
    # ... feed to model
    return transcript

def synthesize(text: str) -> PCMBuffer:
    raw = tts_engine.synthesize(text)
    result = PCMBuffer(samples=raw, sample_rate=tts_engine.sample_rate)
    assert_pcm_buffer(result, label="synthesize:output")
    return result

💡 Real-World Example: A common integration bug is a TTS engine that produces float32 audio at 22,050 Hz, passed directly to a speech model expecting int16 at 16 kHz. Without assertions, the model receives what looks like a valid array but is interpreted as a much longer sequence of incorrectly-scaled samples — output is garbage, no exception raised. With a format assertion at the asr:input boundary, the failure is immediate and names the exact violation.

🎯 Key Principle: Format assertions are not just development tooling — they are load-bearing invariants. An assertion that fires at asr:input tells you to look at the TTS output or conversion layer; an uncaught violation that reaches the model produces a silent behavioral failure that could take hours to trace.

Common Mistakes and What They Sound Like

Every mistake in a PCM audio pipeline eventually becomes audible — but the gap between the error and the symptom is often wide enough that developers spend hours hunting in the wrong place. Each of the five mistakes below has a precise mechanical cause, and once you know the cause-symptom pairing, diagnosis becomes fast.

Mistake 1: Reinterpreting Bytes Instead of Converting

The mistake: You have a buffer of int16 samples and call np.frombuffer(data, dtype=np.float32) — or receive float32 audio and reinterpret it as int16 — without performing a numeric conversion between the two representations.

A conversion maps values from one numeric range to another (e.g., dividing int16 samples by 32768.0 to land in [-1.0, 1.0]). A reinterpretation just tells numpy to read the same bytes under a different type assumption.

import numpy as np

int16_samples = np.array([1000, -500, 20000, -18000], dtype=np.int16)
raw_bytes = int16_samples.tobytes()

## ❌ Reinterpretation — two int16 values share the same 4 bytes as one float32
wrong = np.frombuffer(raw_bytes, dtype=np.float32)
print(wrong)  # e.g., [1.40130e-42, 2.80260e-40] — not audio

## ✅ Correct conversion
correct = int16_samples.astype(np.float32) / 32768.0
print(correct)  # [ 0.0305, -0.0153,  0.6104, -0.5493] — valid normalized audio

What you hear: Reinterpreting int16 bytes as float32 produces astronomically small float values — near-silence. The reverse produces full-amplitude noise — loud static or a single harsh crack. Neither case raises an exception.

What to do instead: Normalize explicitly. int16 → float32: divide by 32768.0. float32 → int16: multiply by 32767.0, clip to [-32768, 32767], then cast.

Mistake 2: Passing Stereo Audio to a Mono Speech Model

The mistake: Your capture device returns stereo audio (two interleaved channels), and you pass the raw buffer directly to a speech recognition model that expects mono input.

Interleaved stereo PCM: [L0, R0, L1, R1, L2, R2, ...]. When handed to a mono model without downmixing, the model reads twice as many samples as it should — a one-second clip at 16 kHz stereo contains 32,000 sample values, which the model interprets as two seconds of audio.

import numpy as np

## 0.5 seconds of stereo int16 at 16 kHz
stereo = np.random.randint(-1000, 1000, size=(8000, 2), dtype=np.int16)

## ❌ Wrong: flatten and pass raw — model sees 16000 samples (1.0 sec)
wrong_mono = stereo.flatten()

## ✅ Correct: average the two channels
correct_mono = stereo.mean(axis=1).astype(np.int16)  # shape: (8000,)

What you hear (or what the model produces): Speech transcribed at roughly half the expected word rate, words that trail off and are cut, or entirely wrong words because phoneme timing is stretched. The model doesn't crash — it transcribes something, just not what was spoken.

What to do instead: Downmix before the model boundary. Averaging both channels works well for voice. If you know your microphone only captures meaningful signal on one channel, selecting it directly (stereo[:, 0]) is equally valid.

Mistake 3: Trusting the Requested Sample Rate Without Verifying

The mistake: You ask your OS audio API for a 16 kHz stream, receive audio, and assume it is 16 kHz.

This breaks on hardware that doesn't natively support 16 kHz. Audio drivers commonly operate at a fixed internal rate (44.1 or 48 kHz) and may silently resample, or return audio at the hardware's native rate regardless of your request.

import sounddevice as sd

## After opening a stream, read back the actual parameters:
with sd.RawInputStream(samplerate=16000, channels=1, dtype='int16') as stream:
    # stream.samplerate reflects what was actually configured
    print(f"Stream configured at: {stream.samplerate} Hz")

What you hear: A sample rate mismatch sounds like a pitch and speed shift. For speech recognition, the symptom is subtler: the model may transcribe with low confidence, return fragmented output, or produce timing-sensitive errors.

What to do instead: After opening any audio stream, read back the actual configured rate from the stream object. If it differs from your target, apply explicit resampling using scipy.signal.resample_poly or a similar library before passing audio downstream. The byte-size formula gives you a cross-check: if the buffer you receive doesn't match the expected byte count for your declared rate and duration, something is wrong.

Mistake 4: Doing Sample Arithmetic on Lists or Bytes

The mistake: You have audio data as a Python list or raw bytes and perform arithmetic directly on it — summing chunks, scaling amplitude, downmixing — without converting to a numpy array first.

## ❌ Python list arithmetic — no overflow, but silent range violation
samples_list = [30000, 28000, -30000, -28000]
scaled = [s * 2 for s in samples_list]  # [60000, 56000, -60000, -56000]
## Values exceed int16 range — no error raised here
## But: np.array(scaled, dtype=np.int16) silently wraps: 60000 → -5536

## ✅ numpy with explicit upcasting
samples_np = np.array([30000, 28000, -30000, -28000], dtype=np.int16)
scaled_np = (samples_np.astype(np.int32) * 2).clip(-32768, 32767).astype(np.int16)

⚠️ Note: numpy's own int16 arithmetic also overflows silently — the difference is that with numpy you declare the dtype upfront and can deliberately upcast before scaling. With Python lists, the type is invisible until serialization.

What you hear: Wrapped integer values produce random-amplitude pops, inverted transients, and audio that has the right rhythm but completely wrong amplitude shape. Because only some samples exceed the range (the louder parts of speech), the corruption sounds like intermittent distortion that comes and goes with speech volume.

What to do instead: Convert to numpy as close to the data source as possible:

audio = np.frombuffer(raw_bytes, dtype=np.int16).copy()
## .copy() is important: frombuffer returns a read-only view

Do all arithmetic in a wider type (int32 or float32) and only cast back to int16 at the final output boundary, after clipping.

Mistake 5: Concatenating Chunks Without Tracking Frame Boundaries

The mistake: Real-time audio arrives in chunks. You concatenate them and pass the buffer to a model at regular intervals without verifying the buffer contains a whole number of frames.

A frame is one sample per channel. For mono 16-bit audio, each frame is exactly 2 bytes. If you slice at a byte offset that falls in the middle of a 2-byte frame, every subsequent sample in that chunk is off by one byte.

Chunk boundary misalignment (int16 = 2 bytes per sample):

... [byte 28] [byte 29] | [byte 30] [byte 31] [byte 32] ...
      ←── sample 14 ──→   ←── sample 15 ──→   ←─ sample 16
                         ^
                 Chunk split here — sample 15 is split across chunks
                 Chunk 2 starts mid-sample — every subsequent sample is wrong

import numpy as np

def safe_append(buffer: np.ndarray, chunk_bytes: bytes, dtype=np.int16) -> np.ndarray:
    """Append only whole frames from chunk_bytes to buffer."""
    bytes_per_frame = np.dtype(dtype).itemsize  # 2 for int16
    usable = len(chunk_bytes) - (len(chunk_bytes) % bytes_per_frame)
    # In production: store remainder bytes for the next call
    new_samples = np.frombuffer(chunk_bytes[:usable], dtype=dtype)
    return np.concatenate([buffer, new_samples])

What you hear: Periodic clicks or static artifacts at the rate of your chunk interval. With some codecs, a misaligned buffer causes the entire chunk to fail silently and be dropped, producing gaps that sound like a poor network connection even on localhost.

What to do instead: Track a remainder buffer of unconsumed bytes between callbacks. Each time you receive a new chunk, prepend the remainder, compute how many whole frames fit, pass those to the model, and store the leftover bytes for the next call.

Summary

You've now seen the full anatomy of a PCM audio pipeline — from the physics of pressure waves through sample rates and bit depth, the format handoffs between pipeline stages, inspection techniques, and the specific mistakes that corrupt audio silently.

The five mistakes and their fixes at a glance:

Mistake	Symptom	Fix
Reinterpret int16 bytes as float32	Silence or loud static	Divide by 32768.0 to convert; don't rely on dtype alone
Pass stereo to mono model	Half-speed transcription, word drops	Average channels: `stereo.mean(axis=1)`
Trust requested sample rate	Pitch shift, garbled recognition	Read `stream.samplerate` back after opening
Arithmetic on lists or bytes	Intermittent distortion on loud audio	Convert to numpy early; upcast before scaling
Concatenate without frame alignment	Periodic clicks; dropped chunks	Track remainder bytes; only pass whole frames

Three principles to carry forward:

None of these mistakes raise exceptions. The pipeline will run, produce output, and appear to work — the only signal is audible corruption or degraded model performance. Testing with real audio, not just code execution, is essential.
Symptoms are often displaced from causes. A sample rate mismatch introduced at capture will corrupt transcription quality at inference — two pipeline stages away. Trace backward from the symptom to the earliest format boundary that could be wrong.
Format assertions are load-bearing, not optional. A small assertion function checking dtype, shape, sample rate, and value range at each handoff turns silent corruption into loud, fast failures during development.

🎯 Key Principle: Every format boundary in a voice pipeline is a potential corruption site. Explicit conversion, explicit verification, and explicit frame tracking are not defensive programming — they are the minimum viable discipline for working with PCM.

Where to go next:

Build a format-asserting wrapper around your audio capture callback that logs dtype, sample rate, channel count, and min/max on every buffer — and run it against your actual hardware before writing any model integration code.
Stress-test your chunk concatenation logic by deliberately feeding it odd-sized buffers (e.g., 641 bytes when you expect 640) and verifying that remainder tracking produces clean output.
Once you have clean, verified PCM flowing through your pipeline, the next frontier is latency: understanding how frame size and buffering strategy interact to determine end-to-end delay in a real-time voice agent.

📝

Ready to practice?

This lesson has 15 questions to help you learn