You are viewing a preview of this lesson. Sign in to start learning
Back to System Design Interviews for Software Developers with Examples

YouTube / Netflix Streaming

Handle video upload, transcoding pipelines, CDN delivery, and adaptive bitrate streaming at scale.

Why Streaming System Design Is a Must-Know Interview Skill

Think about the last time you hit play on a video and it just... worked. No buffering wheel of doom. No pixelated mess. Just crisp, instant playback — whether you were on a fiber connection in Seoul or a spotty café WiFi in Buenos Aires. You probably didn't think twice about it. That seamlessness is a lie, in the best possible way. Behind those few milliseconds between your tap and that first frame is one of the most sophisticated distributed systems ever built by human engineers. And in your next FAANG interview, you might be asked to design it from scratch on a whiteboard. Get your free flashcards below to reinforce the key ideas from this lesson as you go — they're woven right into the content where you need them most.

Video streaming system design questions have become a rite of passage at top-tier tech companies. YouTube, Netflix, TikTok, Twitch, Disney+ — interviewers reach for these systems not to torture you, but because they are the proving grounds for every concept that matters in distributed systems at once: scale, latency, consistency, fault tolerance, cost optimization, and real-time data flow. If you can design a streaming system thoughtfully, you can design almost anything.

Why FAANG and Top Tech Companies Love This Question

System design interviews at companies like Google, Meta, Amazon, Apple, Netflix, and Microsoft are specifically engineered to reveal how you think under pressure about ambiguous, open-ended problems. The interviewer is not looking for a single correct answer — they are evaluating your engineering judgment. Video streaming questions are ideal for this purpose for several reasons.

First, they are universally relatable. Every engineer has used YouTube or Netflix, which means the problem statement needs almost no explanation. The interviewer can skip the domain briefing and immediately probe your depth. Second, these systems touch virtually every major distributed systems concept: horizontal scaling, content delivery networks (CDNs), message queues, blob storage, video transcoding pipelines, database sharding, and cache eviction strategies. A single question can naturally branch into half a dozen rich technical discussions depending on where you take it.

Third, and perhaps most importantly, streaming systems have genuinely hard trade-offs. Should you pre-generate every video resolution, or transcode lazily on demand? Should thumbnails be served from the same CDN as video chunks, or a separate image CDN? How do you handle a viral video that suddenly gets 50 million concurrent viewers? These are not trick questions — they are real decisions engineers at YouTube and Netflix make every week, and your ability to reason through them reveals your seniority.

🤔 Did you know? Netflix delivers over 15 petabytes of video data every single day. During peak evening hours in North America, Netflix alone accounts for roughly 15% of all downstream internet traffic. Designing a system that scales to that level requires fundamentally different thinking than a typical web application.

The Scale Challenge: Numbers That Should Make You Rethink Everything

Let's anchor the conversation in real numbers, because scale is not just a buzzword — it changes which solutions are even possible.

YouTube reports that users upload more than 500 hours of video every single minute. That is not a typo. Five hundred hours. Per minute. If you wanted to watch everything uploaded in a single day, it would take you over 80 years. Meanwhile, the platform serves over 2 billion logged-in users per month and handles 1 billion hours of video watched per day. Netflix has over 260 million subscribers across 190 countries, with a content library that has grown to tens of thousands of titles, each available in multiple resolutions, languages, and subtitle formats.

What does this mean architecturally? It means that the solutions you reach for in a typical application — a single database, a monolithic server, uploading files directly to your web server — will not just be slow. They will be physically impossible. The laws of physics and the limits of individual machines impose hard ceilings, and the only way through is horizontal scaling, geographic distribution, and intelligent data partitioning.

📋 Quick Reference Card: Scale Comparison

🖥️ Typical Web App 📹 YouTube-Scale System
🔢 Users Thousands Billions
📦 Storage Gigabytes Exabytes
⏱️ Latency Target Seconds acceptable Sub-100ms for playback start
🌍 Geography Single region 200+ countries
🔄 Uploads Occasional 500 hours/minute
🗄️ Database Single relational DB Sharded, polyglot persistence

The latency expectation is particularly brutal. Users will tolerate a slight delay loading a news article. They will not tolerate a streaming video that takes three seconds to start. Research from Akamai found that a two-second delay in video loading increases abandonment by 6%, and the impact compounds as delay grows. For a platform with 2 billion users, that is tens of millions of lost views per percentage point of degradation.

🎯 Key Principle: At streaming scale, performance is not a feature — it is the product. A system that is functionally correct but slow enough to cause buffering is a failed system from the user's perspective, regardless of how elegant the architecture is.

Streaming Systems vs. CRUD Applications: A Fundamentally Different Beast

Most engineers learn system design through the lens of CRUD applications — Create, Read, Update, Delete. A social media post, an e-commerce product listing, a user profile. These systems are well-understood: a database stores records, a web server reads and writes them, a cache sits in front to reduce load. Straightforward.

Streaming systems break almost every assumption that makes CRUD design intuitive.

Data size and immutability. In a CRUD app, your payload might be a few kilobytes of JSON. In a streaming system, a single 4K movie is 50–100 GB of raw data. More importantly, once a video is uploaded and processed, it is essentially immutable. You do not update individual frames. This changes your storage strategy entirely — you optimize for write-once, read-many patterns, which opens the door to object storage systems like Amazon S3 rather than traditional databases.

Read vs. write asymmetry. A typical social app might have a 10:1 read-to-write ratio. A streaming platform might have a 10,000:1 ratio or higher. A video uploaded once can be watched a billion times. This asymmetry means your read path and your write path are effectively separate systems with entirely different scaling requirements and failure modes.

Sequential access patterns. Database queries jump around randomly. Video playback is fundamentally sequential — users watch frames in order. This allows for pre-fetching and chunked delivery strategies that would be useless in a CRUD context. Your CDN can predict with high confidence what bytes a user will need next.

Media processing as a first-class concern. In a CRUD app, you store what the user gives you. In a streaming platform, the raw video a user uploads is just the beginning. Before anyone can watch it, it must be transcoded into multiple formats (H.264, H.265/HEVC, AV1), multiple resolutions (360p, 720p, 1080p, 4K), and multiple bitrates. This processing pipeline is itself a complex distributed system that must be fault-tolerant, scalable, and reasonably fast.

❌ Wrong thinking: "I'll store the uploaded video in my database and serve it directly from my web servers."

✅ Correct thinking: "Video is a first-class data type requiring specialized object storage, a separate processing pipeline, and edge delivery infrastructure completely decoupled from my application servers."

💡 Mental Model: Think of a streaming system as having two completely separate worlds: the cold path (upload, process, store — happens once, can tolerate latency) and the hot path (retrieve, deliver, play — happens billions of times, must be fast). Design each world independently, then connect them.

The Major Components You Must Reason About

Before your interview even begins, you should have a mental map of the major subsystems in a streaming architecture. The rest of this lesson will go deep on each one, but let's preview the landscape now so you know the terrain.

┌─────────────────────────────────────────────────────────────────┐
│                    STREAMING SYSTEM OVERVIEW                    │
└─────────────────────────────────────────────────────────────────┘

  USER UPLOAD              PROCESSING               STORAGE
 ┌──────────┐         ┌──────────────────┐      ┌──────────────┐
 │  Client  │──────▶ │  Upload Service  │────▶ │  Raw Blob    │
 │  Browser │         │  (chunked, TUS)  │      │  Storage     │
 └──────────┘         └──────────────────┘      │  (S3/GCS)    │
                               │                 └──────┬───────┘
                               ▼                        │
                      ┌────────────────┐               ▼
                      │  Message Queue │      ┌──────────────────┐
                      │  (Kafka/SQS)  │────▶ │  Transcoding     │
                      └────────────────┘      │  Workers         │
                                              │  (FFmpeg-based)  │
                                              └──────┬───────────┘
                                                     │
                                              Multiple formats
                                              720p, 1080p, 4K
                                                     │
                                                     ▼
  USER PLAYBACK             DELIVERY             PROCESSED
 ┌──────────┐         ┌──────────────────┐      ┌──────────────┐
 │  Player  │◀─────── │   CDN Edge       │◀──── │  Storage     │
 │  (ABR)   │         │   Nodes          │      │  (segments)  │
 └──────────┘         │   (CloudFront    │      └──────────────┘
   Adaptive           │    Akamai, etc.) │
   Bitrate            └──────────────────┘
   Streaming

🔧 The Upload Pipeline is where raw video enters your system. This is more complex than it sounds — raw video files are enormous, uploads can fail midway, and clients may be on unreliable connections. You need resumable upload protocols, chunked transfer, and immediate acknowledgment.

📦 Storage splits into at least two tiers: raw storage (what the user uploaded, kept for potential reprocessing) and processed storage (the transcoded segments ready for delivery). Both live in object storage systems designed for massive files, not relational databases.

🎬 The Transcoding Pipeline converts raw video into every format and quality level your platform supports. This is computationally expensive, parallelizable, and a natural fit for worker queues and distributed job processing.

🌍 The CDN (Content Delivery Network) is arguably the most important component for playback performance. It is a globally distributed network of servers that caches your video segments close to users. Without a CDN, every viewer in Tokyo would be pulling bytes from a data center in Virginia.

Adaptive Bitrate (ABR) Streaming is the client-side intelligence that switches video quality dynamically based on current network conditions. This is why your video starts in lower quality and then sharpens — the player is measuring your bandwidth and selecting the best available stream.

Framing Requirements in the First Few Minutes: Your Most Critical Skill

Here is a truth that separates strong candidates from great ones: the first five minutes of a system design interview determine more of your score than any diagram you draw. Why? Because requirements clarification demonstrates senior engineering judgment. You are showing the interviewer that you know a solution cannot exist without understanding the problem.

For a streaming system specifically, you need to nail down several dimensions before you touch the whiteboard.

Functional Requirements: What Must the System Do?

Do not assume. Ask. For a YouTube-like system, the minimum clarifying questions should be:

  • 🎯 Are we designing upload and playback, or just one side?
  • 🎯 Do we need live streaming or only pre-recorded video on demand?
  • 🎯 What is the expected maximum video duration and file size?
  • 🎯 Do we need search and recommendations, or just video delivery?
  • 🎯 Is content moderation in scope?

Each answer changes your architecture substantially. Live streaming requires entirely different infrastructure than video-on-demand — it introduces real-time ingest, near-zero latency requirements, and no opportunity for pre-processing.

Non-Functional Requirements: What Are the Quality Bars?

This is where you get to show off your distributed systems vocabulary:

  • 📊 Scale targets: How many daily active users? How many uploads per day? How many concurrent viewers?
  • ⏱️ Latency: What is acceptable time-to-first-frame? Under 200ms? Under 2 seconds?
  • 📈 Availability: 99.9%? 99.99%? The difference between these numbers drives significant architectural complexity.
  • 🌍 Geographic distribution: Global or regional?
  • 💾 Storage duration: Are videos stored forever, or is there a retention policy?

Here is a simple code snippet showing how you might model the core entities and their relationships in a streaming system — a useful mental exercise before drawing your architecture:

## Core data models for a streaming system
## These inform your storage, API, and pipeline design

from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
from datetime import datetime

class VideoStatus(Enum):
    UPLOADING = "uploading"          # Raw file being received
    PROCESSING = "processing"        # Transcoding in progress
    READY = "ready"                  # Available for playback
    FAILED = "failed"                # Processing error

class VideoQuality(Enum):
    P360 = "360p"
    P720 = "720p"
    P1080 = "1080p"
    P2160 = "4k"                    # 2160p

@dataclass
class VideoMetadata:
    video_id: str                    # UUID, primary key in metadata DB
    user_id: str                     # Uploader
    title: str
    description: str
    duration_seconds: int
    status: VideoStatus
    created_at: datetime
    raw_storage_path: str           # e.g., s3://raw-bucket/video_id/original.mp4
    thumbnail_url: Optional[str]

@dataclass  
class ProcessedVideo:
    video_id: str
    quality: VideoQuality
    bitrate_kbps: int
    codec: str                       # e.g., "h264", "hevc", "av1"
    manifest_url: str               # HLS or DASH manifest location
    segment_base_url: str           # CDN URL prefix for video chunks
    total_segments: int

## A video in a production system has MANY ProcessedVideo rows:
## One per quality level × codec combination
## e.g., 4 qualities × 2 codecs = 8 rows per uploaded video

This data model exercise — even sketched informally — helps you and your interviewer agree on scope before you dive into architecture. Notice how VideoStatus immediately surfaces the idea of an asynchronous processing pipeline: the video is not immediately available after upload.

Now consider how you would structure a basic API surface for the client:

## REST API endpoints for a streaming platform
## Scoping these early tells your interviewer you think end-to-end

"""
Upload Flow:
  POST   /api/v1/videos/initiate          - Get upload URL + video_id
  PUT    /api/v1/videos/{id}/upload       - Upload raw file (resumable)
  GET    /api/v1/videos/{id}/status       - Poll processing status

Playback Flow:
  GET    /api/v1/videos/{id}              - Get metadata + manifest URL
  GET    /api/v1/videos/{id}/stream       - Redirect to CDN manifest

  # The actual video segments are served directly from CDN:
  # GET https://cdn.example.com/videos/{id}/720p/segment_0001.ts
  # GET https://cdn.example.com/videos/{id}/720p/segment_0002.ts
  # ... (never touches your application servers!)

Search & Discovery (if in scope):
  GET    /api/v1/search?q={query}&page={n}
  GET    /api/v1/feed                     - Personalized recommendations
"""

## Key insight: The application server is ONLY in the critical path
## for metadata and authentication. Video bytes flow through CDN exclusively.
## This separation is fundamental to achieving streaming scale.

Notice the critical comment at the bottom of that snippet. One of the most important architectural decisions in any streaming system is that application servers must never touch video bytes during playback. Your Node.js or Python API server simply cannot handle the volume. This realization — that the playback path bypasses your application layer entirely — is a distinction that immediately signals to your interviewer that you understand streaming architecture at a deep level.

// Client-side adaptive bitrate player pseudocode
// Shows how the player makes quality decisions

class AdaptiveBitratePlayer {
  constructor(manifestUrl) {
    this.manifestUrl = manifestUrl;
    this.availableQualities = [];  // Fetched from HLS/DASH manifest
    this.currentQuality = null;
    this.bufferSeconds = 0;
    this.bandwidthEstimate = 0; // bits per second
  }

  async selectQuality() {
    // Measure current download speed using last segment download time
    const lastSegmentBytes = this.lastSegment.byteSize;
    const lastSegmentLoadTime = this.lastSegment.loadTimeMs / 1000;
    this.bandwidthEstimate = (lastSegmentBytes * 8) / lastSegmentLoadTime;

    // Use 80% of estimated bandwidth to add safety margin
    const safeBandwidth = this.bandwidthEstimate * 0.80;

    // Pick highest quality that fits within safe bandwidth
    const target = this.availableQualities
      .filter(q => q.bitrateKbps * 1000 <= safeBandwidth)
      .sort((a, b) => b.bitrateKbps - a.bitrateKbps)[0];

    // Also consider buffer health — if buffer is low, drop quality
    if (this.bufferSeconds < 5) {
      return this.lowestQuality(); // Prioritize uninterrupted playback
    }

    return target || this.lowestQuality();
  }
}
// This logic runs continuously, making decisions every few seconds.
// The result: smooth playback that degrades gracefully on poor networks.

This pseudocode illustrates that adaptive bitrate is not magic — it is a continuous feedback loop between measured network conditions and quality selection. Understanding this loop helps you design the CDN and segment sizes appropriately on the server side.

💡 Pro Tip: In your interview, explicitly state your assumptions when you make them. Instead of just drawing a CDN box, say: "I'm assuming we need global delivery with under 100ms latency to end users, which means we'll need a CDN with edge nodes in at least North America, Europe, and Asia-Pacific. Let me explain how that affects the architecture." This narrates your thinking and invites the interviewer to redirect you if assumptions are wrong.

Setting Yourself Up for the Rest of the Interview

By the time you finish your requirements clarification, you should have established a shared understanding of scope that allows you to make every subsequent architectural decision with confidence. You are not guessing — you are engineering.

Here is the mental checklist to carry into any streaming system design question:

🧠 Separate the upload path from the playback path — they scale differently and have different failure modes

📚 Think in data types, not features — video bytes, metadata, and thumbnails each belong in fundamentally different storage systems

🔧 Identify the asynchronous seams — transcoding happens in a queue, not inline with the upload request; design accordingly

🎯 CDN is not optional — it is the foundation of playback performance; any design without it fails the scale requirement

🔒 Non-functional requirements drive architecture — latency, availability, and geographic distribution targets should appear in your diagram, not just your conversation

⚠️ Common Mistake: Spending the first fifteen minutes of a streaming design interview designing the user authentication system or the recommendation algorithm in detail. Both are real components, but they are not what makes streaming hard. Keep your focus on the video pipeline — upload, process, store, deliver. Authentication is a footnote. Video delivery at petabyte scale is the entire story.

🧠 Mnemonic: Remember "UPSET" to recall the core pipeline components:

  • Upload service
  • Processing / transcoding
  • Storage (raw and processed)
  • Edge delivery (CDN)
  • Transport (adaptive bitrate to client)

The remainder of this lesson will take you deep into each of these components, showing you not just what they are, but why each design choice exists and how to articulate trade-offs in real-time under interview pressure. By the end, you will not just recognize a streaming architecture — you will be able to construct and defend one from first principles.

The systems that feel like magic to users are always deeply rational to the engineers who built them. Time to pull back the curtain.

Core Architecture: Video Upload, Processing, and Storage

Before a single frame reaches a viewer's screen, a video must survive a complex gauntlet: upload, validation, transcoding, packaging, and storage across globally distributed infrastructure. This pipeline is one of the richest areas of system design, touching distributed queues, blob storage, encoding theory, and database schema decisions all at once. Understanding it end-to-end is what separates candidates who draw boxes on a whiteboard from those who can justify every arrow connecting them.

The Upload Problem: Why Naive Approaches Fail

Imagine a user uploading a raw 4K video file. Uncompressed, that file can easily exceed 50 GB. Even a compressed H.264 file at high bitrate might be 10–20 GB. If you design the upload as a single HTTP POST to your application server, you have created at least three failure modes before the file even lands: network interruptions force a complete restart, your application servers become a bottleneck holding giant files in memory, and you have no way to parallelize processing until the entire upload completes.

The industry solution is chunked uploading combined with resumable uploads. The client splits the file into fixed-size chunks — typically 5 MB to 16 MB — and uploads each chunk independently. If the connection drops after chunk 47 of 200, the client can resume from chunk 48 rather than starting over. YouTube's actual upload API uses this pattern, and AWS S3's Multipart Upload API is the canonical cloud implementation.

To avoid routing upload traffic through your application servers entirely, you use presigned URLs. Instead of the client uploading to your backend, your backend generates a time-limited, cryptographically signed URL that grants the client direct write access to a specific object in blob storage (S3, GCS). The client then uploads each chunk directly to blob storage, keeping your application servers completely out of the data path.

Client                  API Server              Blob Storage (S3/GCS)
  |                         |                          |
  |-- POST /upload/initiate->|                          |
  |                         |-- CreateMultipartUpload->|
  |                         |<-- uploadId ------------ |
  |<-- { uploadId, urls } --|
  |                         |                          |
  |-- PUT chunk 1 ---------------------------------------->|
  |-- PUT chunk 2 ---------------------------------------->|
  |-- PUT chunk N ---------------------------------------->|
  |                         |                          |
  |-- POST /upload/complete->|                          |
  |                         |-- CompleteMultipart ----->|
  |                         |<-- final object URL ----- |
  |<-- { videoId, status } -|

The API server's role is thin: issue credentials, track upload state, and trigger downstream processing when all chunks are confirmed. This is a critical architectural boundary to articulate in interviews.

💡 Pro Tip: In an interview, explicitly mention that the presigned URL approach removes your application servers from the upload data path. This demonstrates you understand the difference between the control plane (your API server deciding what is allowed) and the data plane (the actual bytes moving to storage).

The metadata you store during initiation matters. You need to record the uploadId, the expected total chunks, which chunks have been confirmed, the video owner, and an expiry time for incomplete uploads. A simple Redis hash keyed on uploadId works well for this transient state, with a TTL of 24–48 hours after which incomplete uploads are cleaned up by a background job.

## Upload initiation handler (Python/Flask example)
import boto3
import uuid
import redis
import json
from flask import Flask, request, jsonify

app = Flask(__name__)
s3 = boto3.client('s3')
redis_client = redis.Redis(host='localhost', port=6379, db=0)

BUCKET = 'raw-video-uploads'
CHUNK_SIZE_MB = 8
URL_EXPIRY_SECONDS = 3600  # 1 hour per presigned URL

@app.route('/upload/initiate', methods=['POST'])
def initiate_upload():
    data = request.json
    filename = data['filename']
    total_chunks = data['total_chunks']
    user_id = data['user_id']

    video_id = str(uuid.uuid4())
    object_key = f'raw/{user_id}/{video_id}/original'

    # Start S3 multipart upload session
    response = s3.create_multipart_upload(
        Bucket=BUCKET,
        Key=object_key,
        ContentType='video/mp4'
    )
    upload_id = response['UploadId']

    # Generate presigned URLs for each chunk (PartNumber is 1-indexed)
    presigned_urls = []
    for part_num in range(1, total_chunks + 1):
        url = s3.generate_presigned_url(
            'upload_part',
            Params={
                'Bucket': BUCKET,
                'Key': object_key,
                'UploadId': upload_id,
                'PartNumber': part_num
            },
            ExpiresIn=URL_EXPIRY_SECONDS
        )
        presigned_urls.append({'part_number': part_num, 'url': url})

    # Persist upload session state in Redis with 48-hour TTL
    session = {
        'video_id': video_id,
        'user_id': user_id,
        'upload_id': upload_id,
        'object_key': object_key,
        'total_chunks': total_chunks,
        'completed_parts': []  # Will be populated as client confirms ETags
    }
    redis_client.setex(
        f'upload:{video_id}',
        172800,  # 48 hours in seconds
        json.dumps(session)
    )

    return jsonify({
        'video_id': video_id,
        'upload_id': upload_id,
        'presigned_urls': presigned_urls
    })

This handler does exactly what was described: it creates an S3 multipart session, generates one presigned URL per chunk, and saves upload state to Redis. The client uploads chunks directly to S3 using those URLs, collecting ETag headers from each response. When all chunks are uploaded, the client calls a /upload/complete endpoint, passing the ETags so your server can call CompleteMultipartUpload on S3 and then enqueue a transcoding job.

The Transcoding Pipeline: From Raw File to Streamable Formats

Once raw video lands in blob storage, it is useless to end users in its original form. A raw .mov from a professional camera uses a codec (like ProRes) that browsers cannot decode. Even if the codec were compatible, serving a single 4K file to every user regardless of their connection speed would be catastrophic. The transcoding pipeline solves both problems.

Transcoding is the process of converting a video from one encoding to another. In the context of streaming, this means producing multiple output renditions from a single source: 1080p, 720p, 480p, 360p, and so on, each in a format browsers and mobile devices can actually play.

The dominant streaming formats are HLS (HTTP Live Streaming), developed by Apple, and MPEG-DASH (Dynamic Adaptive Streaming over HTTP). Both work by splitting video into small segments (typically 2–10 seconds each) and generating a manifest file that lists all available segments and their URLs. The player downloads the manifest, then fetches segments one by one, choosing the appropriate quality tier based on available bandwidth.

Raw Video (S3)
      |
      v
[Transcoding Job Queue]
      |
      +-----> Worker 1: encode 1080p --> segment_0001.ts, segment_0002.ts... + manifest.m3u8
      +-----> Worker 2: encode  720p --> segment_0001.ts, segment_0002.ts... + manifest.m3u8
      +-----> Worker 3: encode  480p --> segment_0001.ts, segment_0002.ts... + manifest.m3u8
      +-----> Worker 4: encode  360p --> segment_0001.ts, segment_0002.ts... + manifest.m3u8
      |
      v
[All renditions complete]
      |
      v
[Master Manifest written to S3]
      |
      v
[Metadata DB updated: status = READY]
      |
      v
[Notification Service: video available]

The key architectural insight here is parallelism. Each resolution can be encoded independently, so you can fan out the work across multiple worker instances simultaneously. A single 10-minute video might produce 40+ output files (4 resolutions × ~10 segments per minute), and doing this sequentially would be prohibitively slow.

Distributed worker queues are the standard pattern. You publish one transcoding job per resolution (or per segment for more granular parallelism) to a message queue such as AWS SQS, RabbitMQ, or Kafka. Workers pull jobs, call FFmpeg or a managed transcoding service (AWS Elemental MediaConvert, Google Transcoder API), write outputs to blob storage, and acknowledge the message. A job coordinator — either a separate service or a state machine (AWS Step Functions is a common choice here) — waits for all renditions to complete before marking the video as ready.

⚠️ Common Mistake: Designing transcoding as a single synchronous HTTP call. Encoding a 1-hour video can take 20+ minutes. Synchronous HTTP will time out, and you lose all retry and parallelism benefits. Always push transcoding into an async queue.

🎯 Key Principle: Each stage of the pipeline should be idempotent. If a worker crashes after encoding 720p but before writing the output to S3, re-running the same job should produce the same output without side effects. This is achieved by writing output to deterministic paths (based on videoId and resolution) and checking for existence before writing.

## Transcoding worker (Python example using SQS and FFmpeg)
import boto3
import subprocess
import json
import os

sqs = boto3.client('sqs', region_name='us-east-1')
s3 = boto3.client('s3')

QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/transcode-jobs'
RAW_BUCKET = 'raw-video-uploads'
PROCESSED_BUCKET = 'processed-video-segments'

RESO_CONFIGS = {
    '1080p': {'width': 1920, 'height': 1080, 'bitrate': '5000k'},
    '720p':  {'width': 1280, 'height': 720,  'bitrate': '2500k'},
    '480p':  {'width': 854,  'height': 480,  'bitrate': '1000k'},
    '360p':  {'width': 640,  'height': 360,  'bitrate': '500k'},
}

def process_job(job):
    video_id = job['video_id']
    resolution = job['resolution']
    raw_key = job['raw_s3_key']
    config = RESO_CONFIGS[resolution]

    # Download raw file to local tmp
    local_raw = f'/tmp/{video_id}_raw.mp4'
    s3.download_file(RAW_BUCKET, raw_key, local_raw)

    # Output path for HLS segments
    output_dir = f'/tmp/{video_id}_{resolution}'
    os.makedirs(output_dir, exist_ok=True)
    output_manifest = f'{output_dir}/index.m3u8'

    # Run FFmpeg to produce HLS segments at target resolution
    cmd = [
        'ffmpeg', '-i', local_raw,
        '-vf', f"scale={config['width']}:{config['height']}",
        '-b:v', config['bitrate'],
        '-codec:v', 'libx264', '-codec:a', 'aac',
        '-hls_time', '6',               # 6-second segments
        '-hls_playlist_type', 'vod',
        '-hls_segment_filename', f'{output_dir}/seg_%04d.ts',
        output_manifest
    ]
    subprocess.run(cmd, check=True)  # Raises on non-zero exit

    # Upload all produced files to processed bucket
    for fname in os.listdir(output_dir):
        local_path = os.path.join(output_dir, fname)
        s3_key = f'processed/{video_id}/{resolution}/{fname}'
        s3.upload_file(local_path, PROCESSED_BUCKET, s3_key)
        print(f'Uploaded {s3_key}')

    # Emit completion event (could also publish to SNS/another SQS queue)
    completion_event = {
        'event': 'RENDITION_COMPLETE',
        'video_id': video_id,
        'resolution': resolution,
        'manifest_key': f'processed/{video_id}/{resolution}/index.m3u8'
    }
    print(f'Emitting event: {json.dumps(completion_event)}')
    # In production: sqs.send_message(QueueUrl=COMPLETION_QUEUE_URL, MessageBody=json.dumps(completion_event))
    return completion_event

def poll_queue():
    while True:
        resp = sqs.receive_message(
            QueueUrl=QUEUE_URL,
            MaxNumberOfMessages=1,
            WaitTimeSeconds=20  # Long polling reduces empty receives
        )
        messages = resp.get('Messages', [])
        if not messages:
            continue

        msg = messages[0]
        job = json.loads(msg['Body'])
        try:
            process_job(job)
            # Delete message only after successful processing
            sqs.delete_message(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=msg['ReceiptHandle']
            )
        except Exception as e:
            print(f'Job failed: {e} — message will return to queue after visibility timeout')
            # Do NOT delete message; SQS will redeliver after visibility timeout expires

if __name__ == '__main__':
    poll_queue()

This worker demonstrates several important patterns: long-polling to reduce idle API calls, deleting the SQS message only after successful completion (so failures automatically retry), and emitting a typed completion event that a coordinator can consume to track overall job progress.

Blob Storage Design: Organizing Raw and Processed Video

Blob storage (S3, GCS, Azure Blob Storage) is the right tool for video files because it is designed for large, immutable objects with high throughput reads. You should design a clear separation between raw uploads and processed outputs, because they have different access patterns, retention policies, and cost profiles.

A sensible S3 bucket layout:

raw-video-uploads/
  raw/{user_id}/{video_id}/original.mp4      # Short-lived; delete after processing

processed-video-segments/
  processed/{video_id}/1080p/index.m3u8
  processed/{video_id}/1080p/seg_0001.ts
  processed/{video_id}/1080p/seg_0002.ts
  processed/{video_id}/720p/index.m3u8
  processed/{video_id}/720p/seg_0001.ts
  processed/{video_id}/master.m3u8           # Master manifest pointing to all resolutions
  thumbnails/{video_id}/thumb_0.jpg

The master manifest is a top-level HLS playlist that lists all available renditions and their bandwidth requirements. The player downloads this single file and then chooses which resolution playlist to follow based on current network conditions.

💡 Real-World Example: Netflix runs separate S3 buckets per content type and per geographic region to reduce cross-region data transfer costs. Raw uploads land in a single region; encoded outputs are replicated to regional buckets that feed regional CDN origin servers.

For versioning, consider that re-encoding a video (for quality fixes or new format support) should not overwrite existing segments that CDNs may have cached. A common pattern is to include an encoding version in the S3 key: processed/{video_id}/v2/1080p/seg_0001.ts. The metadata database stores the active version, and the master manifest is regenerated to point to the new paths.

⚠️ Common Mistake: Storing video files in a relational database as BLOBs. Even for small files, this destroys database performance and is architecturally incorrect. Databases store references to video objects; blob storage stores the bytes.

Metadata Service and Database Schema

The metadata service is the brain of the system. It tracks everything about a video that is not the video bytes themselves: title, description, owner, upload status, processing state, available renditions, thumbnail URLs, view counts, and playback position per user.

This is where the relational versus NoSQL trade-off becomes concrete. Consider the different access patterns:

Data Access Pattern Best Fit
📋 Video metadata (title, owner, status) Lookup by video_id; consistent updates Relational (PostgreSQL)
📋 User watch history Append-heavy; per-user queries NoSQL (DynamoDB, Cassandra)
📋 Playback state (resume position) High-frequency writes per user-video pair Redis (cache) + async flush to DB
📋 View counts Extremely high write frequency Redis counter + periodic batch write
📋 Video search (title, tags) Full-text, faceted queries Elasticsearch

For the core video record, a relational schema works well because the data is structured and you want transactional consistency when a video transitions from PROCESSING to READY:

-- Core relational schema (PostgreSQL)

CREATE TABLE users (
    user_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username    VARCHAR(64) UNIQUE NOT NULL,
    email       VARCHAR(255) UNIQUE NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE videos (
    video_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    owner_id        UUID NOT NULL REFERENCES users(user_id),
    title           VARCHAR(255) NOT NULL,
    description     TEXT,
    status          VARCHAR(32) NOT NULL DEFAULT 'UPLOADING',
    -- Status lifecycle: UPLOADING -> PROCESSING -> READY | FAILED
    raw_s3_key      TEXT,         -- Path to original upload; nullable after cleanup
    master_manifest TEXT,         -- S3 key of master.m3u8 when READY
    duration_secs   INTEGER,      -- Populated after transcoding completes
    encoding_version INTEGER DEFAULT 1,
    created_at      TIMESTAMPTZ DEFAULT now(),
    published_at    TIMESTAMPTZ   -- NULL until owner explicitly publishes
);

-- Stores one row per resolution per video
CREATE TABLE video_renditions (
    rendition_id    SERIAL PRIMARY KEY,
    video_id        UUID NOT NULL REFERENCES videos(video_id),
    resolution      VARCHAR(16) NOT NULL,  -- '1080p', '720p', etc.
    manifest_key    TEXT NOT NULL,
    bitrate_kbps    INTEGER,
    completed_at    TIMESTAMPTZ,
    UNIQUE(video_id, resolution)
);

-- Per-user playback position; high write volume, consider caching layer
CREATE TABLE playback_state (
    user_id         UUID NOT NULL REFERENCES users(user_id),
    video_id        UUID NOT NULL REFERENCES videos(video_id),
    position_secs   INTEGER NOT NULL DEFAULT 0,
    updated_at      TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (user_id, video_id)
);

The status column on the videos table is the single source of truth for where a video is in the pipeline. Workers update it atomically using conditional updates (UPDATE videos SET status = 'READY' WHERE status = 'PROCESSING' AND video_id = $1) to prevent race conditions when multiple workers might complete simultaneously.

🧠 Mnemonic: Think of the pipeline stages as U-P-R-F: Uploading → Processing → Ready → (optionally) Failed. Every video lives in exactly one of these states at any moment.

For playback state specifically, writing directly to PostgreSQL on every few seconds of playback would generate enormous write load. The standard pattern is to write the current position to Redis (O(1), in-memory) and flush to the database asynchronously every 30–60 seconds or when the user pauses/leaves. This is a classic write-behind cache pattern.

Putting It Together: The Full Upload-to-Ready Flow

With all components defined, the complete flow from a user pressing "Upload" to the video appearing as watchable looks like this:

1. Client requests upload session  --> API Server issues presigned URLs + stores session in Redis
2. Client uploads chunks directly  --> S3 (raw-video-uploads bucket)
3. Client confirms completion      --> API Server calls CompleteMultipartUpload on S3
4. API Server creates DB record    --> videos table (status=PROCESSING) + enqueues jobs
5. Workers pick up jobs            --> Pull from SQS, run FFmpeg, write segments to S3
6. Each rendition completes        --> Worker emits RENDITION_COMPLETE event
7. Coordinator tracks completions  --> When all renditions done, writes master manifest
8. Coordinator updates DB          --> videos.status = READY, master_manifest = <key>
9. Notification sent               --> Push notification / webhook to uploader
10. Playback request               --> API returns master manifest URL; CDN serves segments

💡 Pro Tip: In an interview, walking through this numbered sequence shows your interviewer that you understand causality in distributed systems — that each step has a trigger, a side effect, and a failure mode. For bonus points, mention what happens if step 5 fails: the SQS visibility timeout expires, another worker picks up the job, and idempotent segment naming means no duplication occurs.

🎯 Key Principle: The pipeline is designed around async, event-driven handoffs at every boundary. No service waits synchronously for another. This is what makes the system scalable — you can add more workers to the transcoding pool without touching any other component.

📋 Quick Reference Card:

🔧 Component 📚 Technology 🎯 Responsibility
🔧 Upload coordinator Flask/Express + Redis Manage multipart sessions, issue presigned URLs
📦 Raw storage S3 / GCS Hold original uploads temporarily
📬 Job queue SQS / RabbitMQ / Kafka Distribute transcoding work to workers
⚙️ Transcoding workers EC2 / ECS + FFmpeg Encode video into HLS/DASH renditions
📦 Processed storage S3 / GCS Store segments and manifests permanently
🗃️ Metadata DB PostgreSQL Video records, status, rendition info
⚡ Playback state cache Redis Resume positions, view counts
🔍 Search index Elasticsearch Title/tag-based video discovery

The architecture in this section is the backbone on which every other part of the system sits. CDN delivery (covered next) can only work once processed segments exist in blob storage with proper manifests. Adaptive bitrate streaming can only function because the transcoding pipeline produced multiple quality tiers. Getting this core pipeline right — and being able to explain why each component exists and what failure looks like at each step — is what distinguishes a strong streaming system design candidate from the rest.

Global Delivery at Scale: CDN, Adaptive Bitrate, and Caching

Once a video has been uploaded, transcoded into multiple quality levels, and stored in distributed object storage, the hardest problem is not over — it is just beginning. Getting that video to hundreds of millions of simultaneous viewers across six continents, on devices ranging from a 4K smart TV to a 3G feature phone in rural Indonesia, with buffering measured in milliseconds rather than seconds, is one of the most sophisticated engineering challenges in modern computing. This section breaks down exactly how platforms like YouTube and Netflix solve it, and how you should explain it in an interview.

How a Content Delivery Network Works

The naive approach to video delivery would be to put all video files on a cluster of servers in one data center and let users download from there. This fails immediately for two reasons: latency (a user in São Paulo fetching from a Virginia data center adds 100–150 ms of round-trip time before a single byte arrives) and bandwidth cost (one origin cluster cannot handle terabits per second of egress). The solution is a Content Delivery Network (CDN).

A CDN is a geographically distributed network of servers, called edge nodes or Points of Presence (PoPs), strategically placed in data centers around the world. When a user in Tokyo requests a video segment, the CDN routes that request to a Tokyo PoP rather than to the origin server in the US. The edge node either serves the content from its local cache or fetches it from the origin once and caches it for all subsequent requests.

User Request Flow through CDN

[User in Tokyo]
       |
       | DNS lookup: video.cdn.example.com
       |
  [CDN DNS / Anycast]
       |
       | Routes to geographically nearest PoP
       |
  [Tokyo Edge PoP]
    /        \
  HIT        MISS
   |            |
[Cached]   [Fetch from Regional Cache (Singapore)]
   |            |
[Serve]    HIT  |  MISS
                |       |
           [Cached]  [Fetch from Origin (US)]
                |       |
           [Serve]  [Cache + Serve]

The routing mechanism typically uses Anycast DNS, where the same IP address is advertised from multiple locations worldwide. BGP (Border Gateway Protocol) routing naturally directs network traffic to the topologically closest server announcing that IP. More sophisticated CDNs also factor in real-time metrics like server load and latency measurements when making routing decisions.

💡 Real-World Example: Netflix operates its own CDN called Open Connect, with appliances physically installed inside ISP data centers. When a Comcast subscriber in Chicago streams Stranger Things, the data often never leaves Comcast's own network — Netflix pre-populates the ISP's Open Connect appliance with popular content during off-peak hours. This is why Netflix can claim sub-10ms delivery times to many users.

🎯 Key Principle: A CDN's primary job is to push content as close to the end user as possible, reducing both latency and the load on your origin infrastructure.

Adaptive Bitrate Streaming

Even with a globally distributed CDN, you face a second problem: network conditions vary wildly. A user on a home fiber connection can receive 50 Mbps, while the same user steps onto a crowded subway train and drops to 500 Kbps. If you lock a video to one quality level, you either waste bandwidth on users who could watch 4K, or you force constant buffering on users with limited bandwidth.

The solution is Adaptive Bitrate Streaming (ABR), a technique where the video is encoded at multiple quality levels (bitrates) and the client dynamically selects which quality segment to download next based on current network conditions.

How ABR Works: Manifests and Segments

ABR relies on two key primitives. First, video is divided into short segments — typically 2 to 10 seconds each. Second, a manifest file (also called a playlist) describes all available quality levels and the URLs to their segments.

The two dominant ABR protocols are HLS (HTTP Live Streaming), developed by Apple, and MPEG-DASH (Dynamic Adaptive Streaming over HTTP), an open standard. Both use the same conceptual model: a manifest points to quality-specific sub-playlists, which list individual segment URLs.

Here is a simplified HLS master manifest illustrating how quality levels are declared:

## Simplified HLS Master Manifest (master.m3u8)
## This file is the entry point the client fetches first.
## It lists all available renditions (quality levels).

#EXTM3U
#EXT-X-VERSION:3

## Low quality: 360p at 400 Kbps — for slow mobile connections
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=640x360,CODECS="avc1.42c01e,mp4a.40.2"
https://cdn.example.com/video/abc123/360p/playlist.m3u8

## Medium quality: 720p at 1.5 Mbps — for average broadband
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=1280x720,CODECS="avc1.4d401f,mp4a.40.2"
https://cdn.example.com/video/abc123/720p/playlist.m3u8

## High quality: 1080p at 4 Mbps — for good broadband
#EXT-X-STREAM-INF:BANDWIDTH=4000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
https://cdn.example.com/video/abc123/1080p/playlist.m3u8

## Ultra quality: 4K at 15 Mbps — for fiber/fast WiFi
#EXT-X-STREAM-INF:BANDWIDTH=15000000,RESOLUTION=3840x2160,CODECS="hev1.1.6.L150.90,mp4a.40.2"
https://cdn.example.com/video/abc123/4k/playlist.m3u8

Each quality-specific playlist then lists the individual segment files:

## Quality-specific playlist (720p/playlist.m3u8)
## Each segment is ~6 seconds of video at 720p

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:0

#EXTINF:6.0,
seg_000.ts
#EXTINF:6.0,
seg_001.ts
#EXTINF:6.0,
seg_002.ts
## ... continues for entire video duration
#EXT-X-ENDLIST

The client downloads the master manifest once, then uses an ABR algorithm to decide which quality playlist to use and which segment to fetch next.

The Client-Side ABR Algorithm

The intelligence of ABR lives in the client. Below is a simplified Python representation of a throughput-based ABR algorithm — the kind you might sketch out conceptually in an interview:

import time
from collections import deque

## Quality levels available, ordered lowest to highest
QUALITY_LEVELS = [
    {"name": "360p",  "bitrate": 400_000},    # 400 Kbps
    {"name": "720p",  "bitrate": 1_500_000},  # 1.5 Mbps
    {"name": "1080p", "bitrate": 4_000_000},  # 4 Mbps
    {"name": "4k",    "bitrate": 15_000_000}, # 15 Mbps
]

SEGMENT_DURATION_SECONDS = 6
SAFETY_FACTOR = 0.8  # Only use 80% of measured bandwidth to avoid rebuffering
BUFFER_TARGET_SECONDS = 30  # We want 30s of video buffered ahead

class ABRController:
    def __init__(self):
        # Rolling window of recent throughput measurements (in bps)
        self.throughput_samples = deque(maxlen=5)
        self.current_quality_index = 0
        self.buffer_level_seconds = 0

    def record_segment_download(self, bytes_downloaded, download_time_seconds):
        """Call this after each segment download completes."""
        measured_throughput = (bytes_downloaded * 8) / download_time_seconds
        self.throughput_samples.append(measured_throughput)
        # Update buffer: one segment worth of video added, time passed subtracted
        self.buffer_level_seconds += SEGMENT_DURATION_SECONDS - download_time_seconds

    def estimate_bandwidth(self):
        """Use the harmonic mean of recent samples — more conservative than average."""
        if not self.throughput_samples:
            return QUALITY_LEVELS[0]["bitrate"]  # Start conservative
        n = len(self.throughput_samples)
        harmonic_mean = n / sum(1 / s for s in self.throughput_samples)
        return harmonic_mean * SAFETY_FACTOR

    def select_next_quality(self):
        """
        Core ABR decision: pick the highest quality whose bitrate
        fits within our estimated available bandwidth.
        """
        available_bandwidth = self.estimate_bandwidth()

        # If buffer is critically low, drop to lowest quality immediately
        if self.buffer_level_seconds < 5:
            print("⚠️  Buffer critical — forcing lowest quality")
            self.current_quality_index = 0
            return QUALITY_LEVELS[0]

        # Find the highest quality that fits in available bandwidth
        selected = QUALITY_LEVELS[0]  # Default to lowest
        for level in QUALITY_LEVELS:
            if level["bitrate"] <= available_bandwidth:
                selected = level

        # Avoid thrashing: only step up one level at a time
        target_index = QUALITY_LEVELS.index(selected)
        if target_index > self.current_quality_index:
            target_index = self.current_quality_index + 1  # Step up gradually
        # But allow immediate step down to avoid rebuffering
        self.current_quality_index = target_index

        print(f"Bandwidth: {available_bandwidth/1_000_000:.2f} Mbps → "
              f"Selected: {QUALITY_LEVELS[self.current_quality_index]['name']}")
        return QUALITY_LEVELS[self.current_quality_index]


## --- Simulation ---
controller = ABRController()

## Simulate varying network conditions across segment downloads
download_events = [
    (450_000, 6.0),   # Slow start: 600 Kbps measured, 6s segment took 6s
    (1_350_000, 5.4), # Improving: ~2 Mbps
    (3_600_000, 5.4), # Good connection: ~5.3 Mbps
    (1_800_000, 5.7), # Dropped: ~2.5 Mbps (subway tunnel)
    (450_000, 8.0),   # Very bad: heavy congestion, took longer than segment duration!
]

for bytes_dl, time_taken in download_events:
    controller.record_segment_download(bytes_dl, time_taken)
    next_quality = controller.select_next_quality()
    print(f"  → Buffer level: {controller.buffer_level_seconds:.1f}s\n")

Two design decisions in this code deserve attention. First, the algorithm uses the harmonic mean of recent throughput samples rather than the arithmetic mean — this is more conservative and appropriate for bandwidth estimation because it gives more weight to low values. Second, the algorithm steps up quality gradually (one level at a time) but allows immediate downgrade when the buffer is endangered. This asymmetry is deliberate: unnecessary quality drops are annoying to users, but rebuffering is catastrophic.

💡 Pro Tip: In an interview, mentioning that real ABR algorithms (like Netflix's BOLA or the throughput-based algorithm in the DASH.js reference player) must balance three competing goals — maximize quality, minimize rebuffering, and minimize quality switches — immediately signals deep understanding. These goals are often in direct conflict.

Cache Hierarchy Design

A CDN is not a single layer — it is a hierarchy of caches, each with different sizes, costs, and miss penalties. Designing this hierarchy correctly is central to achieving both low latency and cost efficiency at scale.

Cache Hierarchy (Three-Tier Model)

┌─────────────────────────────────────────────────────┐
│                    END USERS                        │
└──────────────────────┬──────────────────────────────┘
                       │ Request
                       ▼
┌─────────────────────────────────────────────────────┐
│              TIER 1: EDGE CACHE (PoP)               │
│  • Hundreds of locations worldwide                  │
│  • Small storage (1–100 TB per node)                │
│  • Stores only the most popular content             │
│  • TTL: minutes to hours for hot content            │
└──────────────────────┬──────────────────────────────┘
                       │ Cache MISS
                       ▼
┌─────────────────────────────────────────────────────┐
│           TIER 2: REGIONAL CACHE                    │
│  • ~20–50 regional hubs globally                   │
│  • Larger storage (1–10 PB per hub)                 │
│  • Stores popular + moderately popular content      │
│  • TTL: hours to days                              │
└──────────────────────┬──────────────────────────────┘
                       │ Cache MISS
                       ▼
┌─────────────────────────────────────────────────────┐
│              TIER 3: ORIGIN STORAGE                 │
│  • 1–3 locations (e.g., S3, GCS)                   │
│  • Unlimited storage (petabytes to exabytes)        │
│  • Stores ALL content permanently                   │
│  • TTL: indefinite (source of truth)               │
└─────────────────────────────────────────────────────┘

Cache TTL (Time-To-Live) controls how long a segment stays cached before the node considers it stale and re-validates with the upstream. For video segments, the right TTL strategy depends on content type:

Content Type Example Recommended TTL
Completed video segments seg_042.ts Very long (days–years). Segments are immutable once created.
Live stream segments live_seg_current.ts Very short (2–6 seconds). Content changes constantly.
Manifests for VOD master.m3u8 Medium (minutes). Rarely changes but may be updated.
Manifests for live live.m3u8 Very short (same as segment duration). Updates every segment.

🎯 Key Principle: Video segments are immutable once encoded and stored. A segment file named seg_042.ts will never change. This makes them ideal candidates for long-lived caching and even cache-busting by filename rather than by TTL expiry.

Cache Invalidation

Cache invalidation — purging stale content from CDN edge nodes — is notoriously difficult and is one of the famous "hard problems" in computer science (Phil Karlton's joke about naming things and cache invalidation). For video platforms, common invalidation scenarios include:

  • 🔧 A video was incorrectly encoded and must be replaced with a corrected version
  • 🔧 A piece of content must be removed immediately (copyright claim, legal takedown)
  • 🔧 A manifest was updated to add a new quality tier

Most CDNs provide an invalidation API that sends purge signals to all edge nodes. However, propagation is not instantaneous — it can take seconds to minutes to reach every node globally. For segments (which are immutable by design), invalidation is rare. For manifests, a common pattern is to use versioned URLs (master_v2.m3u8) so a new manifest is treated as entirely new content rather than requiring invalidation of the old one.

⚠️ Common Mistake: Setting video segment TTLs too short "to be safe." If a 6-second segment has a 60-second TTL, every edge node in the world is constantly re-fetching content that will never change, generating massive unnecessary load on regional caches and the origin. Always set segment TTLs to the maximum value you can — days or even weeks.

The Hotspot Problem: Viral Videos

Even a perfectly designed CDN hierarchy faces one extreme scenario: the hotspot problem (sometimes called the thundering herd). Imagine a major live sports event ends and millions of viewers immediately search for the highlight clip. That single video goes from zero requests per second to millions of requests per second in the span of seconds, before any CDN edge node has had the chance to cache it.

In this scenario, every edge node receives a cache miss simultaneously and fires a request upstream to the regional cache. If the regional cache also misses (because the content is brand new), every regional cache fires a request to the origin simultaneously. This cache miss stampede can overwhelm the origin server and cascade into a complete outage.

Hotspot / Thundering Herd Scenario

  1M users request video at t=0
        |
  ┌─────┴──────┐
  │  Edge PoPs │  ← All 500+ edge nodes: CACHE MISS simultaneously
  └─────┬──────┘
        │ 500 parallel requests
  ┌─────┴──────────┐
  │ Regional Cache │  ← Also MISS (new content)
  └─────┬──────────┘
        │ 20 parallel requests
  ┌─────┴──────┐
  │   ORIGIN   │  ← 💥 Overwhelmed by request storm
  └────────────┘

Several techniques mitigate the hotspot problem:

1. Request Coalescing (also called Request Collapsing) When multiple requests arrive at an edge node for the same uncached content simultaneously, the CDN software holds all requests in a queue and fires only a single upstream request. When the response arrives, it is fanned out to all waiting requesters and simultaneously stored in the edge cache. Netflix calls this "request collapsing" and it is a fundamental feature of their Open Connect appliance software.

2. Origin Shield An origin shield is a designated regional cache node that acts as the sole gateway to the origin. Instead of all 20 regional nodes potentially hitting the origin, all regional nodes that experience a miss first check the origin shield. The origin shield then makes at most one request to the actual origin. This collapses what could be 20 parallel origin requests into one.

3. Proactive Pre-warming For predictable hotspots (a scheduled product launch, a series finale, a sports event), platform engineers can pre-warm the cache by pushing the content to edge nodes before the event begins, rather than waiting for organic requests to populate the cache. Netflix pre-positions content onto ISP-embedded Open Connect appliances days in advance based on their recommendation engine predicting what millions of subscribers will want to watch.

4. Staggered Cache Expiry (Cache Jitter) If every edge node caches a segment with the same TTL, they will all expire simultaneously, causing another stampede at expiry time. Adding a small random jitter to TTL values (e.g., TTL = base_ttl + random(0, 300 seconds)) spreads expirations over time and eliminates the synchronized expiry thundering herd.

💡 Mental Model: Think of an origin shield the way you think of a receptionist. Without one, every visitor to a building walks directly to the executive's office. With one, all visitors talk to the receptionist, who consolidates duplicate requests before disturbing the executive.

🤔 Did you know? During the COVID-19 pandemic lockdowns in March 2020, Netflix, YouTube, and other streaming services voluntarily reduced video bitrates in Europe by 25% at the request of regulators, to prevent internet infrastructure from collapsing under unprecedented load. This is one of the most dramatic real-world demonstrations of how much bandwidth video streaming consumes at scale.

Putting It All Together: The Request Journey

To cement these concepts, trace a single video segment request from click to playback:

  1. User presses play on a video. The video player (whether a browser, mobile app, or smart TV app) makes an HTTPS request for the master manifest URL.
  2. DNS resolution routes the manifest request to the nearest CDN PoP via Anycast DNS.
  3. Edge node checks its cache. If the manifest is cached (HIT), it responds immediately. If not (MISS), it fetches from the regional cache or origin and caches the result.
  4. Player parses the manifest and identifies all available quality levels. The ABR algorithm picks a starting quality (usually conservative, like 480p) to minimize initial buffering.
  5. Player requests the first segment at the chosen quality. The URL points to the CDN. The edge node serves from cache if possible.
  6. Player measures download time for the first segment, updates its bandwidth estimate, and selects the quality for the next segment using the ABR algorithm.
  7. This continues for every subsequent segment, with quality automatically adjusting up or down based on real-time conditions — all without the user doing anything.

📋 Quick Reference Card: CDN Concepts Summary

🔧 Concept 📚 What It Solves 🎯 Key Design Decision
🌍 CDN Edge PoP Latency from geographic distance Anycast DNS routing to nearest node
📄 ABR Manifest One quality doesn't fit all connections Client-side algorithm selects quality per segment
🗃️ Cache TTL Balance freshness vs. origin load Long TTL for immutable segments; short for live manifests
🛡️ Origin Shield Thundering herd on origin Single gateway node per region to origin
🔄 Request Coalescing Multiple misses for same content Hold and fan-out: one upstream request per unique content
⏱️ Cache Jitter Synchronized expiry stampedes Randomize TTL within a range to spread expirations
🔥 Cache Pre-warming Hotspots for predictable events Proactively push content before demand peaks

⚠️ Common Mistake: When designing this system in an interview, candidates often describe CDN as a simple "cache in front of the origin." This undersells the architecture significantly. Be specific: name the three-tier hierarchy (edge, regional, origin shield), mention request coalescing, and explain ABR with manifests and segments. The depth of your vocabulary signals experience.

🧠 Mnemonic: To remember the three tiers and their flow: ERO — Edge, Regional, Origin. Cache misses flow ERO (from near to far). Content flows back in the opposite direction, O-R-E, like ore being mined from the source and refined closer to the surface.

The combination of CDN edge delivery, adaptive bitrate streaming, and a well-designed cache hierarchy is what makes it physically possible for 2 billion YouTube users to watch video simultaneously without the system collapsing. Each piece addresses a specific failure mode: CDN solves geographic latency and origin bandwidth, ABR solves heterogeneous network conditions, and the cache hierarchy with request coalescing and origin shields solves the thundering herd. In the next section, we will tie these delivery mechanisms together with the upload and processing pipeline you learned about earlier, walking through a complete end-to-end interview design session.

Practical System Design Walkthrough: End-to-End Example

A system design interview is not a quiz — it is a conversation. The interviewer wants to watch you think, prioritize, and make defensible decisions under uncertainty. This section treats the YouTube-like system as a live design session, walking through every stage an experienced candidate would cover, from writing requirements on a whiteboard to justifying why a particular database engine beats its competitors for a specific access pattern. Read it as if you are in the room.

Step 1 — Nail the Requirements Before Drawing Anything

The single most costly mistake in a design interview is reaching for the marker too early. Before any box is drawn, spend four to six minutes clarifying functional requirements (what the system does) and non-functional requirements (how well it does it).

Functional Requirements

For a YouTube-like product the core features are:

  • 🎯 Users can upload videos of arbitrary length and format.
  • 🎯 The platform transcodes uploaded videos into multiple resolutions (360p, 720p, 1080p, 4K) and formats (H.264/AAC in MP4, VP9 in WebM).
  • 🎯 Users can stream on-demand videos with adaptive bitrate playback.
  • 🎯 A recommendation feed surfaces personalized content on the home screen.
  • 🎯 Basic metadata operations: titles, descriptions, likes, view counts.

Scope out live streaming for now; we will revisit it as a divergence point later.

Non-Functional Requirements
Requirement Target
🔒 Upload throughput 500 hours of video uploaded per minute (YouTube's real figure)
⚡ Playback start latency < 2 seconds on a healthy connection
📡 Availability 99.99% (≈ 52 minutes downtime/year)
🌍 Global reach Serve users on every continent with < 50 ms edge latency
🔄 Consistency View counts and likes can be eventually consistent; video availability must be strongly consistent after upload confirmation

💡 Pro Tip: Distinguishing which data needs strong consistency versus eventual consistency is an immediate signal to interviewers that you understand CAP theorem trade-offs in practice.

Step 2 — Capacity Estimation from First Principles

Capacity estimation is a back-of-the-envelope exercise that turns requirements into concrete numbers, which then drive architectural choices. Do it out loud so the interviewer can follow your reasoning.

Storage Estimation

Start with uploads:

  • 500 hours of raw video per minute = 30,000 hours per hour = 720,000 hours per day.
  • Assume average raw bitrate of 10 Mbps (a reasonable 1080p source file).
  • Raw storage per day: 720,000 hrs × 3,600 s/hr × 10 Mbps / 8 = 720,000 × 3,600 × 1.25 MB = ~3.24 PB/day.

After transcoding into five quality levels each averaging 40% of the original size, total stored video grows by roughly 2× (five streams, heavily overlapping compression gains):

  • Encoded storage per day ≈ 6.5 PB.
  • Over five years: 6.5 PB × 365 × 5 ≈ 11.9 EB.

This immediately tells you that object storage (Amazon S3, Google Cloud Storage) is the only viable tier — no traditional filesystem survives at exabyte scale.

Bandwidth Estimation

For reads (playback), YouTube serves roughly 1 billion hours watched per day:

  • Average stream bitrate for a mixed-resolution audience: ~2 Mbps.
  • Peak bandwidth: 1B hrs × 3,600 s × 2 Mbps / 86,400 s/day ≈ 83 Tbps sustained, with peaks 2–3× higher.
  • No single origin cluster can serve 83 Tbps. This single number justifies the entire CDN layer before you have drawn a single box.
Server Count Estimation

For transcoding workers:

  • A single CPU-bound worker transcodes approximately 1 hour of video per hour of wall-clock time at 1080p (a rough but safe assumption).
  • We need to process 30,000 hours per hour → 30,000 workers at minimum.
  • With GPU-accelerated NVENC encoding, throughput increases ~10×, dropping the fleet to roughly 3,000 GPU instances.

🤔 Did you know? YouTube uses a mix of commodity CPU workers for long-tail content and dedicated GPU pipelines for trending videos that need fast availability, optimizing cost without sacrificing time-to-availability on high-demand uploads.

Step 3 — Drawing the Full Architecture

Now you earn the marker. A well-structured diagram has three horizontal bands: client tier, platform tier, and data tier. Walk through each component as you draw it.

┌─────────────────────────────────────────────────────────────────────┐
│  CLIENT TIER                                                        │
│  ┌────────────┐   ┌────────────┐   ┌──────────────────────────┐   │
│  │ Web App    │   │ Mobile App │   │ Smart TV / Embed Player  │   │
│  └─────┬──────┘   └─────┬──────┘   └────────────┬─────────────┘   │
└────────┼────────────────┼───────────────────────┼─────────────────┘
         │                │                       │
         └────────────────▼───────────────────────┘
                          │ HTTPS
┌─────────────────────────▼───────────────────────────────────────────┐
│  PLATFORM TIER                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  API GATEWAY  (auth, rate limiting, routing)                  │  │
│  └───┬─────────────────┬───────────────────────┬───────────────┘  │
│      │                 │                       │                   │
│  ┌───▼──────┐   ┌──────▼──────┐   ┌───────────▼──────────────┐   │
│  │ Upload   │   │  Playback   │   │  Recommendation Service  │   │
│  │ Service  │   │  Service    │   │  (ML ranking, feed gen)  │   │
│  └───┬──────┘   └──────┬──────┘   └──────────────────────────┘   │
│      │                 │                                           │
│  ┌───▼──────┐   ┌──────▼──────┐                                   │
│  │ Message  │   │  CDN Origin │──────────────► CDN Edge Nodes     │
│  │ Queue    │   │  (S3/GCS)   │               (Akamai / CloudFront│
│  └───┬──────┘   └─────────────┘                /Fastly)           │
│      │                                                             │
│  ┌───▼──────────────────────────────┐                             │
│  │  Transcoding Worker Pool         │                             │
│  │  (CPU + GPU auto-scaled fleet)   │                             │
│  └───┬──────────────────────────────┘                             │
│      │ writes encoded segments                                     │
└──────┼─────────────────────────────────────────────────────────────┘
       │
┌──────▼─────────────────────────────────────────────────────────────┐
│  DATA TIER                                                         │
│  ┌───────────────┐  ┌──────────────┐  ┌────────────────────────┐  │
│  │ Metadata DB   │  │  Object      │  │  Search Index          │  │
│  │ (PostgreSQL + │  │  Storage     │  │  (Elasticsearch)       │  │
│  │  read replica)│  │  (S3 / GCS)  │  └────────────────────────┘  │
│  └───────────────┘  └──────────────┘                              │
│  ┌───────────────┐  ┌──────────────┐                              │
│  │ Redis Cache   │  │  Analytics   │                              │
│  │ (view counts, │  │  Warehouse   │                              │
│  │  session data)│  │  (BigQuery / │                              │
│  └───────────────┘  │  Redshift)   │                              │
│                     └──────────────┘                              │
└────────────────────────────────────────────────────────────────────┘

Talk through each component as you draw:

  • API Gateway centralizes auth (JWT validation), rate limiting, and protocol translation. It is the single ingress point, which simplifies client logic and isolates backend services.
  • Upload Service accepts the raw file in chunks (multipart upload), writes to a staging bucket, and publishes a job to the Message Queue (Kafka or SQS). It immediately returns a job ID to the client — the upload is accepted, not yet processed.
  • Transcoding Worker Pool consumes jobs from the queue, pulls raw video from staging, produces encoded segments, and writes them to the permanent object store. Workers are stateless and auto-scaled based on queue depth.
  • Metadata DB (PostgreSQL with read replicas) stores video titles, owners, status, segment manifests (HLS .m3u8 / DASH .mpd locations), and view counts.
  • Redis Cache sits in front of the metadata DB to absorb the massive read load on hot videos — view counts, thumbnails, and manifest pointers are perfect cache candidates.
  • Recommendation Service is intentionally separate. It reads from the analytics warehouse asynchronously and writes ranked feed lists back to Redis. Its latency tolerance is high (seconds, not milliseconds), and decoupling it prevents an ML pipeline hiccup from taking down playback.

Step 4 — API Contract Design

Defining explicit API contracts during an interview demonstrates backend maturity. Below are the three most important endpoints for the upload and playback flows.

Video Upload Initiation
## POST /v1/videos/upload/initiate
## Called before any bytes are transferred. Returns a pre-signed upload URL
## and a jobId for status polling.

from fastapi import FastAPI, Depends
from pydantic import BaseModel
import uuid, boto3

app = FastAPI()

class UploadInitiateRequest(BaseModel):
    filename: str          # e.g., "my_vacation.mp4"
    file_size_bytes: int   # used to validate against max upload size
    content_type: str      # e.g., "video/mp4"

class UploadInitiateResponse(BaseModel):
    job_id: str            # UUID the client uses for status polling
    upload_url: str        # pre-signed S3 URL; client PUTs directly here
    expires_at: str        # ISO-8601 timestamp; URL expires after 1 hour

@app.post("/v1/videos/upload/initiate", response_model=UploadInitiateResponse)
async def initiate_upload(req: UploadInitiateRequest, user_id: str = Depends(get_current_user)):
    MAX_BYTES = 50 * 1024 ** 3  # 50 GB hard limit
    if req.file_size_bytes > MAX_BYTES:
        raise HTTPException(status_code=413, detail="File exceeds maximum allowed size")

    job_id = str(uuid.uuid4())
    s3 = boto3.client("s3")

    # Pre-signed URL lets the client upload directly to S3,
    # bypassing our servers and saving egress cost.
    upload_url = s3.generate_presigned_url(
        "put_object",
        Params={"Bucket": "raw-uploads", "Key": f"{user_id}/{job_id}/{req.filename}"},
        ExpiresIn=3600,
    )

    # Persist job record with PENDING status
    await db.execute(
        "INSERT INTO upload_jobs (job_id, user_id, status, created_at) VALUES ($1, $2, 'PENDING', NOW())",
        job_id, user_id
    )

    return UploadInitiateResponse(job_id=job_id, upload_url=upload_url, expires_at="...")

The critical design decision here is the pre-signed URL pattern: the client uploads bytes directly to object storage rather than streaming them through your application servers. This eliminates a massive bandwidth bottleneck and reduces upload latency by one network hop.

Status Polling
## GET /v1/videos/upload/{job_id}/status
## Client polls this endpoint after completing the S3 PUT.
## Returns processing stage and, when complete, the permanent video ID.

from enum import Enum

class JobStatus(str, Enum):
    PENDING     = "PENDING"      # queued, not yet picked up
    PROCESSING  = "PROCESSING"   # transcoding in progress
    READY       = "READY"        # all quality levels available
    FAILED      = "FAILED"       # non-recoverable error

class StatusResponse(BaseModel):
    job_id: str
    status: JobStatus
    progress_pct: int | None     # 0-100 during PROCESSING
    video_id: str | None         # populated only when status == READY
    error_message: str | None    # populated only when status == FAILED

@app.get("/v1/videos/upload/{job_id}/status", response_model=StatusResponse)
async def get_upload_status(job_id: str, user_id: str = Depends(get_current_user)):
    row = await db.fetchrow(
        "SELECT * FROM upload_jobs WHERE job_id = $1 AND user_id = $2",
        job_id, user_id
    )
    if not row:
        raise HTTPException(status_code=404, detail="Job not found")

    return StatusResponse(
        job_id=job_id,
        status=row["status"],
        progress_pct=row["progress_pct"],
        video_id=row["video_id"] if row["status"] == JobStatus.READY else None,
        error_message=row["error_message"]
    )

⚠️ Common Mistake: Do not use WebSockets for upload status unless the client explicitly needs real-time push. Polling on a 2–5 second interval is simpler to implement, easier to reason about under load, and perfectly adequate for a process that takes minutes.

Playback URL Retrieval
## GET /v1/videos/{video_id}/playback
## Returns the manifest URL (HLS or DASH) that the video player
## uses to begin adaptive bitrate streaming via CDN.

class PlaybackResponse(BaseModel):
    video_id: str
    manifest_url: str   # CDN-hosted .m3u8 (HLS) or .mpd (DASH)
    thumbnail_url: str
    duration_seconds: int
    available_resolutions: list[str]  # e.g., ["360p","720p","1080p"]

@app.get("/v1/videos/{video_id}/playback", response_model=PlaybackResponse)
async def get_playback_url(video_id: str):
    # Try Redis cache first (TTL = 5 minutes for manifest pointers)
    cached = await redis.get(f"playback:{video_id}")
    if cached:
        return PlaybackResponse(**json.loads(cached))

    # Cache miss — query metadata DB read replica
    row = await db_replica.fetchrow(
        "SELECT * FROM videos WHERE video_id = $1 AND status = 'READY'", video_id
    )
    if not row:
        raise HTTPException(status_code=404, detail="Video not available")

    # CDN base URL prepended to the stored manifest path
    CDN_BASE = "https://cdn.example.com"
    response = PlaybackResponse(
        video_id=video_id,
        manifest_url=f"{CDN_BASE}/{row['manifest_path']}",
        thumbnail_url=f"{CDN_BASE}/{row['thumbnail_path']}",
        duration_seconds=row["duration_seconds"],
        available_resolutions=row["resolutions"]
    )

    await redis.setex(f"playback:{video_id}", 300, json.dumps(response.dict()))
    return response

The manifest URL points to a CDN edge node, not your origin. The player never contacts your application server again for segment data — all subsequent byte transfers happen over the CDN, achieving the sub-50ms edge latency required by our non-functional requirements.

Step 5 — Live Streaming vs. On-Demand: Where the Design Diverges

Once you have the on-demand design on the board, a strong interviewer will push: "Now how would you handle live streaming?" This is where you demonstrate depth.

On-Demand Characteristics

On-demand video has pre-encoded, immutable segments sitting in object storage. The CDN aggressively caches them — a segment requested by 10,000 concurrent viewers is fetched from origin once and served from edge 9,999 times. This is the ideal CDN use case.

Live Streaming Characteristics

Live streaming changes three fundamental properties:

  1. Segments are created in real time — new 2–4 second .ts or .m4s segments arrive continuously. The CDN cache TTL must be near zero for the live manifest (.m3u8 playlist), otherwise viewers see stale segment lists and buffering.
  2. No global pre-encoding — the broadcaster's encoder (OBS, FFmpeg, a mobile app) pushes an RTMP or SRT stream to an ingest edge node. A lightweight transcoding process converts this to HLS/DASH segments while the stream is in flight.
  3. Concurrent viewer spikes are unpredictable — a major live event (sports final, product launch) can produce 100× the per-video concurrent viewers of on-demand content.
ON-DEMAND FLOW:
User Upload → Transcode → Object Store → CDN Cache → Viewer
             (async,       (immutable,    (long TTL,
              minutes)      permanent)     hours/days)

LIVE STREAMING FLOW:
Broadcaster → RTMP Ingest → Real-Time     → HLS/DASH  → CDN Edge  → Viewer
              Edge Node     Transcoder       Segments    (near-zero
                            (seconds of      + Live      TTL on
                             latency)        Manifest    manifest)

The architectural divergences to call out explicitly:

  • 🔧 Ingest infrastructure: live streams require dedicated RTMP/SRT ingest servers at the edge, not the same upload service used for files.
  • 🔧 Segment storage: live segments go into a short-lived hot tier (e.g., a Redis-backed ring buffer or a dedicated live segment store) rather than permanent object storage. After the stream ends, segments can be archived to S3 for VOD replay.
  • 🔧 Manifest serving: the live .m3u8 is generated dynamically by the ingest system and served with Cache-Control: max-age=2. The CDN respects this, meaning every manifest request hits the origin — this is the correct trade-off for liveness.
  • 🔧 Fan-out at scale: for viral live events, a CDN shield (mid-tier caching layer between origin and edge) absorbs repeated segment requests. Without it, 1 million concurrent viewers each fetching a new segment every 2 seconds would generate 500,000 origin requests per second.

💡 Real-World Example: Amazon IVS (Interactive Video Service) uses exactly this architecture — regional ingest endpoints, GPU-accelerated real-time transcoding, and CloudFront with aggressive shield caching. Twitch, which Amazon acquired, pioneered many of these patterns at scale before IVS productized them.

🎯 Key Principle: In any system design, the read/write ratio and mutability of data determine your caching strategy. On-demand video is immutable and read-heavy (perfect for long-lived CDN caching). Live segments are ephemeral and write-heavy (require near-zero TTLs and a ring buffer, not permanent storage).

Step 6 — Justifying Trade-Offs Out Loud

The architecture is on the board. Now the interviewer asks the question that separates senior from junior candidates: "Why did you choose that?"

📋 Quick Reference Card: Key Trade-Off Decisions

Decision Chosen Approach Rejected Alternative Justification
🔒 Upload routing Pre-signed S3 URL Proxy through app servers Eliminates server bandwidth bottleneck; S3 can absorb burst uploads natively
⚡ Metadata DB PostgreSQL + read replicas MongoDB Relational model fits video metadata; ACID guarantees needed for upload status transitions
📡 Queue Kafka RabbitMQ Kafka's log retention allows transcoding jobs to be replayed on worker failure without re-upload
🌍 View counts Redis counter + async flush DB increment on every view At 1B views/day, synchronous DB writes would saturate connections; eventual consistency is acceptable
🔄 Recommendation Async, separate service Inline with feed request ML pipeline latency is variable; async pre-computation keeps feed API at < 100 ms P99

⚠️ Common Mistake: Saying "I'd use microservices" without explaining the boundary of each service or the communication pattern between them. Granularity is not a virtue in itself — each service boundary should correspond to an independent scaling requirement or fault isolation need.

Bringing It All Together

A complete design session for a YouTube-like system follows a clear arc: requirements first, numbers second, architecture third, APIs fourth, trade-offs throughout. The capacity estimation numbers — petabytes of daily storage, 83 terabits per second of egress, thousands of transcoding workers — are not trivia; they are the forcing functions that make every architectural decision feel inevitable rather than arbitrary.

When you articulate that pre-signed URLs exist because 83 Tbps cannot flow through application servers, or that Redis view counters exist because 1 billion daily increments would destroy a relational database, you are not reciting patterns — you are demonstrating that you understand why distributed systems look the way they do.

🧠 Mnemonic: "RACED"Requirements, Architecture, Capacity, Endpoints, Divergence. Use it as your interview structure to ensure you never skip a stage.

💡 Remember: The interviewer is not looking for a perfect system. They are looking for a candidate who makes explicit decisions, acknowledges trade-offs honestly, and can adapt the design when constraints change — exactly as real engineers do every day.

Common Mistakes and Pitfalls in Streaming System Interviews

Even candidates who have studied distributed systems deeply can stumble in a streaming system design interview — not because they lack knowledge, but because video streaming has a handful of domain-specific nuances that don't surface in generic system design prep. Interviewers at companies like Google, Meta, and Netflix have seen the same errors repeated across hundreds of sessions. Understanding these patterns before you walk into the room is the difference between a confident recovery and a silent spiral.

This section dissects the five most damaging mistakes candidates make, explains why each one signals a gap in production thinking, and shows you exactly how to course-correct mid-interview so the conversation stays on track.


Mistake 1: Skipping Transcoding and Assuming One Format Rules All ⚠️

❌ Wrong thinking: "The user uploads an MP4, we store it, and clients download it. Done."

✅ Correct thinking: "The uploaded file is raw source material. Before any client can watch it, we must transform it into multiple formats, resolutions, and bitrates optimized for every device and network condition."

This is the most common first-pass mistake and the one that most immediately signals inexperience with video infrastructure. When a candidate says "we store the uploaded file and serve it directly," they are implicitly assuming:

🔧 Every client device supports the same codec (H.264? HEVC? AV1?) 🔧 Every network connection is fast enough for the source file's bitrate 🔧 Mobile, desktop, and smart TV clients all render video identically

None of these assumptions hold in production. A raw video uploaded by a creator could be a 4K ProRes file at several gigabytes per minute. Streaming that directly to a viewer on a 3G mobile connection is physically impossible.

Transcoding is the process of converting a source video into multiple output renditions — combinations of resolution (360p, 720p, 1080p, 4K), codec (H.264, HEVC, VP9, AV1), and bitrate — that clients can switch between dynamically. The output of transcoding is a manifest file (in HLS, this is a .m3u8 playlist; in DASH, an .mpd file) that lists all available renditions and their segment URLs.

## Simplified representation of what a transcoding job looks like
## In production this would be submitted to a job queue (e.g., AWS MediaConvert)

import json

def build_transcoding_job(source_s3_uri: str, video_id: str) -> dict:
    """
    Constructs the configuration for a multi-rendition transcoding job.
    Each output target represents one quality tier clients can stream.
    """
    return {
        "job_id": f"transcode-{video_id}",
        "source": source_s3_uri,
        "outputs": [
            {
                "label": "360p",
                "resolution": "640x360",
                "codec": "H.264",
                "bitrate_kbps": 800,
                "destination": f"s3://processed-videos/{video_id}/360p/"
            },
            {
                "label": "720p",
                "resolution": "1280x720",
                "codec": "H.264",
                "bitrate_kbps": 3000,
                "destination": f"s3://processed-videos/{video_id}/720p/"
            },
            {
                "label": "1080p",
                "resolution": "1920x1080",
                "codec": "H.264",
                "bitrate_kbps": 6000,
                "destination": f"s3://processed-videos/{video_id}/1080p/"
            }
        ],
        "manifest_output": f"s3://processed-videos/{video_id}/master.m3u8"
    }

job = build_transcoding_job("s3://raw-uploads/video-abc123.mov", "abc123")
print(json.dumps(job, indent=2))

This code illustrates that transcoding is a job — a structured configuration submitted to a processing system — not a one-liner. In an interview, when you draw your architecture diagram, always include a Transcoding Service box between the upload storage bucket and the processed video storage bucket.

💡 Pro Tip: If you realize mid-interview that you skipped transcoding, say: "I want to go back and add a critical component I glossed over — the transcoding pipeline. Without it, we have no actual streamable content." Interviewers reward self-correction far more than they penalize the initial omission.


Mistake 2: A Single Monolithic Upload Endpoint Without Resumability ⚠️

❌ Wrong thinking: "The client sends a POST request with the video file, our server receives it, saves it to S3, and responds 200 OK."

✅ Correct thinking: "Uploads are long-running, network-fragile operations. We need chunked, resumable uploads so a 2 GB file upload doesn't have to restart from zero after a brief network drop."

Video files are enormous. A 10-minute 4K video can easily exceed 4–8 GB. A standard HTTP POST is a single TCP stream — if the connection drops at 95% completion, the entire upload fails and restarts. At scale, this creates both a terrible user experience and massive wasted bandwidth on your infrastructure.

The correct design uses multipart uploads (AWS S3's native multipart upload API, for example) combined with a resumable upload protocol. The flow looks like this:

Client                     Upload Service              Object Storage (S3)
  |                              |                           |
  |-- POST /uploads/initiate --> |                           |
  |                              |-- InitiateMultipartUpload |
  |                              |<-- UploadId --------------|  
  |<-- { upload_id, chunk_urls } |                           |
  |                              |                           |
  |-- PUT chunk_url[0] (0-5MB) ->|------------------------>  |
  |-- PUT chunk_url[1] (5-10MB)->|------------------------>  |
  |   ... (parallel uploads) ... |                           |
  |-- POST /uploads/complete --> |                           |
  |                              |-- CompleteMultipartUpload |
  |                              |<-- Final S3 URL ----------|
  |<-- { video_id, status } -----|                           |

With this design, each chunk (typically 5–25 MB) is uploaded independently. If a chunk fails, only that chunk is retried. The client tracks which chunks succeeded locally (or the server stores this state), enabling true resumability even after a browser refresh or app restart.

import boto3
from dataclasses import dataclass
from typing import List

@dataclass
class UploadSession:
    upload_id: str
    video_id: str
    total_parts: int
    completed_parts: List[dict]  # [{"PartNumber": 1, "ETag": "..."}]

def initiate_chunked_upload(bucket: str, video_id: str) -> UploadSession:
    """
    Starts a multipart upload session. The UploadId is the resumability token —
    store it client-side so uploads can resume after network interruptions.
    """
    s3 = boto3.client('s3')
    key = f"raw-uploads/{video_id}/source"
    
    response = s3.create_multipart_upload(
        Bucket=bucket,
        Key=key,
        ContentType='video/mp4',
        Metadata={'video_id': video_id}
    )
    
    return UploadSession(
        upload_id=response['UploadId'],  # This is your resumability token
        video_id=video_id,
        total_parts=0,
        completed_parts=[]
    )

def complete_upload(bucket: str, key: str, session: UploadSession) -> str:
    """Assembles all uploaded parts into the final object in S3."""
    s3 = boto3.client('s3')
    response = s3.complete_multipart_upload(
        Bucket=bucket,
        Key=key,
        UploadId=session.upload_id,
        MultipartUpload={'Parts': session.completed_parts}
    )
    return response['Location']  # Final S3 URL, triggers transcoding pipeline

🎯 Key Principle: The upload service should be stateless with respect to progress tracking. The client holds its upload ID and completed chunk list. If the upload service crashes and restarts, the client can resume by re-posting to /uploads/{upload_id}/status to discover which chunks are already stored.

🤔 Did you know? YouTube's upload infrastructure processes over 500 hours of video every single minute. Without chunked uploads and parallel processing pipelines, this throughput would be physically impossible to achieve reliably.


Mistake 3: Neglecting the CDN Layer and Routing Traffic to Origin ⚠️

❌ Wrong thinking: "Clients request video segments directly from our S3 bucket or video servers."

✅ Correct thinking: "Origin servers exist to populate the CDN. Clients should almost never talk directly to origin storage for playback."

This mistake has a deceptively simple fix in the diagram — add a CDN layer — but candidates frequently either omit it entirely or add it as an afterthought without explaining why it is load-bearing for the entire system.

Consider what happens without a Content Delivery Network (CDN): a popular video gets posted, 2 million users start watching simultaneously, and every one of them is downloading 2-second segments from the same S3 bucket in us-east-1. You've just created a thundering herd problem that will either bankrupt you in egress fees or trigger S3 rate limiting, or both.

A CDN solves this by maintaining edge nodes geographically distributed near end users. The first viewer in Tokyo who requests segment 720p/segment_00042.ts causes the Tokyo edge node to fetch it from origin and cache it locally. The next 50,000 viewers in Tokyo receive that same cached segment from the edge node — origin sees exactly one request.

                    ┌─────────────────────────────────────┐
                    │           Origin Storage             │
                    │         (S3 / Object Store)          │
                    └──────────────┬──────────────────────┘
                                   │ (cold cache misses only)
               ┌───────────────────┼───────────────────┐
               │                   │                   │
        ┌──────▼──────┐    ┌───────▼─────┐    ┌───────▼─────┐
        │  CDN Edge   │    │  CDN Edge   │    │  CDN Edge   │
        │  New York   │    │    Tokyo    │    │   London    │
        └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
               │                  │                  │
        ┌──────▼──────┐    ┌──────▼──────┐    ┌──────▼──────┐
        │   Viewers   │    │   Viewers   │    │   Viewers   │
        │    (US)     │    │   (Asia)    │    │  (Europe)   │
        └─────────────┘    └─────────────┘    └─────────────┘

Beyond caching, the CDN also handles TCP connection termination near the user, dramatically reducing latency from the sheer physics of packet travel time. A viewer in Singapore fetching from a Singapore CDN edge node experiences single-digit millisecond connection setup versus hundreds of milliseconds to a US-based origin.

💡 Real-World Example: Netflix built its own CDN called Open Connect. Netflix ships physical server appliances directly to ISPs and installs them inside the ISP's data centers. When a Netflix viewer in a Comcast region streams a popular show, those video bytes may never leave Comcast's own network — they come from an Open Connect appliance sitting in Comcast's facility. This is CDN thinking taken to its logical extreme.

⚠️ Common Mistake: Candidates sometimes say "we can just use CloudFront" without explaining cache TTL strategy. The interviewer will probe: how long do you cache video segments? Segments for published videos should have very long TTLs (days or weeks) since their content never changes. Manifest files (.m3u8) for live streams must have very short TTLs (2–5 seconds) since they update continuously with new segment URLs.


Mistake 4: Ignoring Read Scalability for the Metadata Service ⚠️

❌ Wrong thinking: "We have a PostgreSQL database that stores video metadata — title, description, view count, uploader. Clients query it directly when loading a video page."

✅ Correct thinking: "The metadata database is a read-heavy hotspot. We need read replicas, caching layers, and potentially separate read/write path services."

This mistake is subtle because candidates correctly identify that video content goes through CDN, but then route all metadata queries directly to a primary database. Video metadata — title, creator, like count, description — is fetched every single time any viewer opens a video. For a video with 10 million concurrent viewers, that's 10 million simultaneous reads against one database.

The fix requires layering multiple read-scaling strategies:

Read replicas — PostgreSQL, MySQL, and Aurora all support synchronous or asynchronous replication to read-only followers. Your video detail service routes all SELECT queries to replicas, reserving the primary for writes.

In-memory caching — A Redis or Memcached layer in front of the database caches the result of common metadata queries. Cache key: video_meta:{video_id}. TTL: 60 seconds to a few minutes. This deflects the overwhelming majority of reads.

Denormalization for hot paths — The video watch page needs creator name, avatar URL, video title, and description. Joining three normalized tables on every request at scale is expensive. Precompute and store a denormalized read model:

┌─────────────────────────────────────────────────────┐
│              Read Path for Video Metadata            │
└─────────────────────────────────────────────────────┘

Client Request: GET /api/v1/videos/{id}
        │
        ▼
┌───────────────┐    Cache HIT     ┌─────────────────┐
│  API Gateway  │ ───────────────► │   Redis Cache   │ ──► Return metadata
│               │                  │  (video_meta:X) │
└───────┬───────┘                  └─────────────────┘
        │ Cache MISS
        ▼
┌───────────────┐                  ┌─────────────────┐
│  Metadata     │ ───────────────► │  Read Replica   │
│  Service      │                  │  (PostgreSQL)   │
└───────────────┘                  └────────┬────────┘
        │                                   │
        │ Populate cache                    │ Sync from
        └──────────────────────────────────►│ Primary DB
                                            └─────────────

🎯 Key Principle: The cache hit rate is the most important metric for metadata service health. For a popular video, you expect 99%+ cache hits. If cache hit rate drops, every miss cascades directly to your database under maximum load — the worst possible time for increased database pressure.

💡 Mental Model: Think of the primary database as a vault and the Redis cache as a teller window. You don't make every customer walk into the vault for every transaction — the teller handles the common requests from their drawer, and only goes to the vault for rare operations.


Mistake 5: Failing to Address Consistency Trade-offs ⚠️

❌ Wrong thinking: "All data should be strongly consistent so users always see accurate information."

✅ Correct thinking: "Different data types have different consistency requirements. Conflating them leads to either incorrect behavior in critical flows or unnecessary bottlenecks in non-critical ones."

This is the mistake that separates senior-level candidates from the rest. Consistency is not a binary switch — it's a spectrum of choices you make deliberately based on the consequences of inconsistency.

Consider two types of writes in a streaming platform:

View count increments — When a user watches a video, we want to increment the view count. Does the viewer care if the displayed count is off by a few thousand? Almost certainly not. YouTube itself acknowledges that view counts are approximate and periodically reconciled. The correct architecture here uses eventual consistency: clients emit view events to a Kafka topic, a stream processor (Flink, Spark Streaming) aggregates them in windows, and the aggregated count is periodically written back to the database.

Payment processing — If a creator earns revenue based on view counts, or a viewer pays for a rental, the monetary calculation must be strongly consistent. You cannot have two nodes disagree about whether a payment succeeded, or a creator's revenue dashboard diverge from their actual payout. This demands transactions with ACID guarantees, potentially distributed transactions with two-phase commit or saga patterns.

## Eventual consistency: View count via event stream
## This runs in a stream processor (e.g., Apache Flink)
## View events arrive in bursts; we count them in tumbling windows

from collections import defaultdict
from typing import Iterator
import time

def aggregate_view_events(events: Iterator[dict], window_seconds: int = 60) -> dict:
    """
    Aggregates view count events into per-video totals over a time window.
    These totals are periodically flushed to the database — NOT on every event.
    This is intentionally eventually consistent: the displayed count lags reality
    by up to window_seconds, which is acceptable for a non-critical metric.
    """
    counts = defaultdict(int)
    window_start = time.time()
    
    for event in events:
        video_id = event['video_id']
        counts[video_id] += 1
        
        # Flush window to DB when window expires
        if time.time() - window_start >= window_seconds:
            flush_to_database(dict(counts))  # Single batched write
            counts.clear()
            window_start = time.time()
    
    return dict(counts)

def flush_to_database(view_increments: dict) -> None:
    """
    Uses atomic INCREMENT operations so concurrent flushes
    don't overwrite each other — they accumulate correctly.
    Each video_id gets its count incremented, not set.
    """
    for video_id, increment in view_increments.items():
        # SQL: UPDATE videos SET view_count = view_count + %s WHERE id = %s
        db.execute(
            "UPDATE videos SET view_count = view_count + %s WHERE id = %s",
            (increment, video_id)
        )

The contrast with a payment flow is stark. In a payment service, you would never use an event queue as the authoritative record of whether a charge succeeded. You'd use a database transaction with a lock, confirm the payment provider's response synchronously, and write a reconciliation record — all within a single ACID transaction.

📋 Quick Reference Card: Consistency by Data Type

📊 Data Type 🎯 Consistency Level 🔧 Implementation
👁️ View counts Eventual Kafka → stream aggregator → batch DB write
👍 Like counts Eventual Redis counter → periodic DB sync
💳 Payment records Strong ACID transaction, synchronous confirmation
🔐 Account access rights Strong Consistent read from primary DB
📝 Video metadata Weak (cached) Cache with TTL, stale reads acceptable
📡 Live stream viewer count Approximate HyperLogLog in Redis

🧠 Mnemonic: "Money = Must, Metrics = Maybe" — Any data that touches money or access control demands strong consistency. Metrics, counts, and analytics can tolerate eventual consistency.


Recovering Mid-Interview: A Framework for Self-Correction

Knowing the mistakes is half the battle. The other half is knowing how to recover gracefully when you realize you've made one mid-session. Interviewers aren't expecting perfection — they're evaluating how you think under pressure.

Here's a recovery framework you can use verbatim:

🔧 For a missing component (like transcoding or CDN):

"Actually, I want to revise what I said earlier. I described serving video directly from the upload store, but that's missing a critical stage. Let me add the transcoding pipeline between raw upload storage and the delivery layer..."

🔧 For a consistency oversight:

"I realize I applied the same consistency model to everything. Let me be more precise — for view counts, eventual consistency via event streaming is correct and actually preferable for throughput. But for the payment service, I'd require ACID transactions. Those are fundamentally different requirements."

🔧 For a scalability gap:

"I designed the metadata service as a single database with direct client access. That breaks under load. Let me add a Redis caching layer and separate the read path from the write path with read replicas."

💡 Pro Tip: Drawing a revised diagram box and saying "let me update this" is always more impressive than defending an incorrect design. Interviewers model your on-call behavior: can you identify problems in a running system and articulate fixes? Mid-interview self-correction is a live demonstration of exactly that skill.


Putting It All Together: The Pre-Interview Mental Checklist

Before you cap your pen at the end of a streaming system design session, run through this checklist mentally:

🎯 Transcoding: Is there a transcoding service between raw upload storage and the playback delivery path? Does it produce multiple renditions?

🎯 Upload Resumability: Is the upload endpoint chunked and stateless? Can a client resume after failure without restarting from zero?

🎯 CDN Layer: Are video segments served from edge nodes, not origin? Have you discussed cache TTL differences between VOD segments and live manifests?

🎯 Metadata Read Scaling: Is there a caching layer (Redis) in front of the metadata database? Are read replicas handling SELECT queries?

🎯 Consistency Intentionality: Have you explicitly stated which data is eventually consistent and which demands strong consistency, and why?

If you can answer yes to all five with a coherent justification for each, you have addressed the most damaging failure modes in streaming system design interviews. The goal is not to memorize a diagram — it's to internalize why each component exists, so you can reconstruct and defend the architecture from first principles in any variation of the question.

Key Takeaways and Interview Cheat Sheet for Streaming Design

You started this lesson without a mental model for how a video gets from a creator's laptop to a viewer's phone on the other side of the planet. You now have one — and more importantly, you have the vocabulary, the trade-off reasoning, and the structural instincts to explain it clearly under interview pressure. This final section locks in what you've learned, gives you a battle-tested reference card to review the night before an interview, and equips you with the senior-level talking points that separate candidates who "know the components" from candidates who get offers.


The Five Pillars of a Streaming System

Every streaming system interview, regardless of how the prompt is phrased, ultimately resolves into five distinct concerns. Think of these as load-bearing columns — remove any one of them and the architecture collapses.

┌─────────────────────────────────────────────────────────────────┐
│                 STREAMING SYSTEM: FIVE PILLARS                  │
├─────────────┬──────────────┬──────────────┬──────────┬──────────┤
│   UPLOAD    │ TRANSCODING  │   STORAGE    │   CDN    │METADATA  │
│  PIPELINE   │   ENGINE     │    LAYER     │DELIVERY  │  MGMT    │
│             │              │              │          │          │
│ - Chunked   │ - Parallel   │ - Object     │ - Edge   │ - Video  │
│   upload    │   workers    │   store      │   nodes  │   index  │
│ - Resume    │ - Format     │   (S3/GCS)   │ - ABR    │ - User   │
│   support   │   variants   │ - Warm/cold  │ - Cache  │   data   │
│ - Virus     │ - Thumbnail  │   tiering    │   TTLs   │ - Watch  │
│   scan      │   gen        │              │          │   history│
└─────────────┴──────────────┴──────────────┴──────────┴──────────┘

🧠 Mnemonic: "UTSCD"Upload, Transcoding, Storage, CDN, Data (metadata). Say "Uploads Turn Slowly into Content Delivered" to remember the order, which also mirrors the actual data flow.

The reason this five-pillar framing is so powerful in an interview is that it gives you an instant agenda. When the interviewer says "Design YouTube," you can respond: "I'd like to walk through the five core subsystems — the upload pipeline, transcoding, storage, CDN delivery, and metadata management. I'll cover each one and the trade-offs between them. Does that structure work for you?" That single sentence signals that you have a plan, you won't miss anything important, and you're collaborative. It sets the tone for the entire conversation.

💡 Pro Tip: Treat each pillar as an independent microservice boundary in your diagram. This makes it natural to discuss independent scaling, failure isolation, and team ownership — all topics senior engineers care about.


Decision Framework: Choosing Your Database Layer

One of the most commonly fumbled questions in streaming interviews is: "What database would you use and why?" Candidates either pick one database for everything (wrong) or list several without explaining the reasoning (unconvincing). Here is a repeatable decision framework.

Data Type Access Pattern Recommended Store Reasoning
🎬 Video metadata (title, description, tags) Read-heavy, keyword search Elasticsearch + PostgreSQL Full-text search + relational integrity
👤 User profiles and account data Point lookups by user ID PostgreSQL (or MySQL) ACID guarantees, join support for billing
📜 Watch history and view counts High write throughput, eventual consistency OK Cassandra or DynamoDB Wide-column, tunable consistency, horizontal scale
💬 Comments and likes Read-heavy at top level, fan-out on write Redis (hot) + PostgreSQL (cold) Cache recent comments, persist to relational
🔑 Session tokens and rate limiting Microsecond reads, TTL-based expiry Redis In-memory, built-in TTL, O(1) lookups

🎯 Key Principle: The right database is determined by three questions: (1) What is the read/write ratio? (2) Does this data require strong consistency or is eventual consistency acceptable? (3) How will this data be queried — by primary key, by range, or by full-text search?

Wrong thinking: "I'll use PostgreSQL for everything because it can handle it." ✅ Correct thinking: "Watch history has millions of writes per second and only needs eventual consistency, so Cassandra with a partition key of (user_id, month) and a clustering key of video_id gives us time-range queries at scale without locking."

Here is what that Cassandra schema looks like in practice:

-- Cassandra schema for watch history
-- Partition by user + month to bound partition size
-- Clustering by watched_at DESC to serve "recently watched" queries efficiently

CREATE TABLE watch_history (
    user_id     UUID,
    month       TEXT,          -- e.g., '2024-06' keeps partitions time-bounded
    watched_at  TIMESTAMP,
    video_id    UUID,
    duration_s  INT,           -- seconds watched (for resume point)
    completed   BOOLEAN,
    PRIMARY KEY ((user_id, month), watched_at, video_id)
) WITH CLUSTERING ORDER BY (watched_at DESC)
   AND default_time_to_live = 15552000;  -- auto-expire after 6 months

This schema handles three real product requirements: resuming a video (fetch latest duration_s for a video_id), showing "continue watching" (recent entries per user), and not accumulating unbounded data (TTL auto-expires old rows). Explaining why your schema maps to product requirements is exactly the kind of thinking that earns senior-level evaluations.


The Three Non-Functional Requirements You Cannot Skip

Every system design interview has an implicit contract: you are expected to address non-functional requirements (NFRs) without being asked. Candidates who only talk about features and components are designing applications. Candidates who also address NFRs are designing systems. The three you must cover in a streaming interview are latency, availability, and scalability.

Latency

For streaming systems, latency has two completely different meanings depending on context, and conflating them is a classic mistake.

  • Playback start latency (time-to-first-frame): Target under 2 seconds. Achieved via CDN edge caching, pre-buffered segments, and serving the first video chunk from the geographically closest edge node.
  • Upload processing latency (time-to-available): Target varies — live streams need seconds, VOD can tolerate minutes. Achieved via priority queues, where live stream jobs preempt batch transcoding jobs.

⚠️ Common Mistake — Mistake 1: Treating all latency as the same problem. Always clarify which latency you're discussing and name your target SLA.

Availability

For a platform like YouTube or Netflix, the availability target is typically 99.99% ("four nines"), which allows roughly 52 minutes of downtime per year. Hitting this requires:

  • 🔧 Multi-region active-active or active-passive deployment — At minimum, a secondary region that can accept traffic within seconds of a primary region failure.
  • 🔧 Health checks and circuit breakers on every service boundary — If the recommendation service is degraded, video playback must continue unaffected.
  • 🔧 Graceful degradation — If the comments service goes down, the player still works. If the metadata service is slow, serve stale cache rather than blocking.
Scalability

You must demonstrate that your design scales horizontally at every layer. Stateless services scale by adding instances behind a load balancer. Stateful services (databases, caches) scale via sharding, replication, and read replicas.

💡 Real-World Example: Netflix uses a microservices architecture where each service can be independently scaled. During peak hours (typically 8–10 PM in each time zone), they scale transcoding workers and CDN capacity dynamically. Their CDN, Open Connect, pre-positions popular content on edge appliances before peak traffic hits.


Quick Estimation Formulas for the First Five Minutes

Back-of-the-envelope estimation is a skill interviewers use to test your engineering intuition. The goal is not precision — it is to demonstrate that your design is grounded in reality. Here are the formulas you should commit to memory and be able to apply in under five minutes.

📋 Quick Reference Card: Estimation Cheat Sheet

📐 Metric 🔢 Formula 🎯 Example (YouTube scale)
Daily storage added uploads_per_day × avg_duration_min × 60 × bitrate_Mbps / 8 500h/min × 60min × 60s × 10Mbps/8 = ~22TB/day raw
Storage with transcoding raw_storage × transcoding_multiplier (≈3–5×) 22TB × 4 = ~88TB/day
Bandwidth for delivery concurrent_viewers × avg_bitrate_Mbps 10M × 4Mbps = 40Tbps peak
CDN cache hit savings total_bandwidth × cache_hit_rate 40Tbps × 0.95 = 38Tbps served from edge
Metadata DB size/year videos_per_day × metadata_bytes × 365 1M × 2KB × 365 = ~730GB/year

Here is how you perform this calculation live in an interview, narrating as you go:

## Back-of-envelope: YouTube-scale storage estimation
## Run this mentally (or on a whiteboard) in under 3 minutes

## --- INPUTS (ask the interviewer or make reasonable assumptions) ---
uploads_per_minute = 500          # ~500 hours of video uploaded per minute
avg_video_duration_min = 7        # average YouTube video length
raw_bitrate_mbps = 10             # typical 1080p raw upload bitrate
transcoding_factor = 4            # 360p + 480p + 720p + 1080p variants
retention_years = 10              # content kept for 10 years

## --- DAILY RAW STORAGE ---
minutes_uploaded_per_day = uploads_per_minute * 60 * 24   # 720,000 min/day
seconds_uploaded_per_day = minutes_uploaded_per_day * 60  # 43,200,000 sec/day
bits_per_day = seconds_uploaded_per_day * raw_bitrate_mbps * 1_000_000
bytes_per_day = bits_per_day / 8
tb_per_day_raw = bytes_per_day / 1e12

print(f"Raw storage per day: {tb_per_day_raw:.0f} TB")
## Output: Raw storage per day: 54 TB

## --- WITH TRANSCODING (multiple formats) ---
tb_per_day_total = tb_per_day_raw * transcoding_factor
print(f"Total storage per day (all formats): {tb_per_day_total:.0f} TB")
## Output: Total storage per day (all formats): 216 TB

## --- 10-YEAR FOOTPRINT ---
pb_total = (tb_per_day_total * 365 * retention_years) / 1000
print(f"10-year storage footprint: {pb_total:.0f} PB")
## Output: 10-year storage footprint: 788 PB

## This is why YouTube uses cold storage tiering —
## only ~5% of content drives ~95% of views.
## The long tail lives in cheaper, slower object storage.

When you walk through this calculation aloud, you demonstrate engineering judgment: you're thinking about tiering, you know that transcoding multiplies your storage requirement, and you're aware that a 10-year retention policy has enormous infrastructure implications. That is far more impressive than quoting a memorized number.


Senior-Level Talking Points That Change How You're Evaluated

Most candidates can name the components of a streaming system. Fewer can explain how those components fail and how they recover. The three talking points below are reliable signals of senior-level thinking. Work at least two of them naturally into your interview response.

Sharding Strategy

When you say "we'll shard the database," interviewers expect you to answer the follow-up: "On what key?" For streaming systems:

  • Video metadata: Shard on video_id (UUID). This distributes writes evenly and makes per-video reads a direct key lookup. Avoid sharding on creator_id — popular creators would create hot shards.
  • Watch history: Shard on user_id. Each user's history lives on one shard, making user-specific queries efficient.
  • View counts: Do NOT shard naively — use a counter aggregation pattern. Accept writes into Redis (with atomic increments), batch-flush to a database every 30 seconds. This prevents the database from becoming a write bottleneck during a viral video event.
## Redis-based view count aggregation pattern
## Prevents database write bottlenecks during viral events

import redis
import time

r = redis.Redis(host='redis-primary', port=6379, decode_responses=True)

def increment_view_count(video_id: str, viewer_id: str) -> None:
    """
    Increments view count in Redis.
    A background job flushes Redis counts to the DB every 30s.
    Deduplication window: one view per viewer per 30-min window.
    """
    dedup_key = f"viewed:{video_id}:{viewer_id}:{int(time.time()) // 1800}"
    
    # Only count if viewer hasn't watched in the last 30-min window
    if r.set(dedup_key, 1, ex=1800, nx=True):  # nx=True means "only set if not exists"
        count_key = f"views:pending:{video_id}"
        new_count = r.incr(count_key)           # atomic increment, O(1)
        return new_count
    return None  # duplicate view, not counted


def flush_pending_counts_to_db(db_connection) -> None:
    """
    Background job: runs every 30 seconds.
    Scans for pending view count keys and flushes to the database.
    """
    cursor = 0
    while True:
        cursor, keys = r.scan(cursor, match='views:pending:*', count=100)
        for key in keys:
            video_id = key.split(':')[2]
            count = r.getdel(key)              # atomic get-and-delete
            if count:
                # Upsert into DB: increment existing count
                db_connection.execute(
                    "UPDATE videos SET view_count = view_count + %s WHERE id = %s",
                    (int(count), video_id)
                )
        if cursor == 0:
            break

This pattern — buffer in Redis, flush to DB asynchronously — is used across systems at scale to handle sudden traffic spikes without causing database contention. Describing it in an interview shows you understand the gap between textbook design and production reality.

Failover Design

A complete design includes what happens when things go wrong. For streaming:

  • 🔧 CDN node failure: The client retries with a backup CDN endpoint. Video players should have a fallback URL list baked into the manifest file (HLS #EXT-X-MEDIA tags support this natively).
  • 🔧 Transcoding worker crash: Jobs are assigned with a visibility timeout in the message queue (SQS, Kafka). If a worker dies mid-job, the message becomes visible again after the timeout and another worker picks it up. Idempotency is critical — re-processing a job must not produce duplicate output.
  • 🔧 Regional database failure: Use a read replica in a secondary availability zone promoted to primary via automated failover (RDS Multi-AZ, for example). Target RTO (recovery time objective) under 60 seconds.

🎯 Key Principle: Every service should be designed with the assumption that its dependencies will fail. Failover is not an afterthought — it is a first-class architectural constraint.

Observability: Metrics, Logs, and Alerts

No system design is complete without a monitoring strategy. Interviewers rarely bring this up, which means volunteering it instantly marks you as someone who has operated systems in production.

📋 Quick Reference Card: Observability Signals for Streaming

📊 Metric ⚡ Alert Threshold 🔧 Remediation
🎬 Time-to-first-frame (p95) > 3 seconds Scale CDN edge capacity; check origin latency
📡 Buffering ratio > 1% of sessions Investigate ABR algorithm; check segment availability
🏗️ Transcoding queue depth > 10,000 jobs Auto-scale worker fleet
💾 CDN cache hit rate < 90% Review TTL configuration; check cache eviction policy
❌ Playback error rate > 0.1% Alert on-call; check segment URL integrity

💡 Pro Tip: When you mention observability in an interview, be specific. Don't say "we'd monitor the system" — say "we'd instrument the video player to emit a playback_start event with the time-to-first-frame duration, and we'd alert if the p95 exceeds 3 seconds across any CDN region."


What You Now Understand That You Didn't Before

Before this lesson, "design YouTube" probably felt like an impossible open-ended question with no clear structure. You might have known that CDNs exist, or that video needs to be encoded, but the connections between components were fuzzy, the trade-off reasoning was absent, and the senior-level vocabulary was missing.

Here is what has changed:

  • 🧠 You can decompose any streaming system into five pillars and use that decomposition as a live roadmap in an interview.
  • 📚 You understand why each component exists — not just what it does, but what problem it solves and what breaks if you remove it.
  • 🔧 You can justify database choices using a repeatable three-question framework instead of guessing or defaulting to a single technology.
  • 🎯 You can estimate storage and bandwidth within five minutes using formulas grounded in real assumptions.
  • 🔒 You know the senior-level talking points — sharding strategy, failover design, and observability — and you can introduce them naturally rather than waiting to be prompted.

⚠️ Critical Point to Remember: The difference between a good answer and a great answer in a system design interview is almost never about knowing more components. It is about demonstrating judgment — knowing which components to prioritize, which trade-offs to acknowledge, and which risks to design around. The candidate who says "here are 15 services I would use" and the candidate who says "here are 5 services, here is why I chose each one, here is how each one fails, and here is how we'd know if something was wrong" are evaluated completely differently.

⚠️ Critical Point to Remember: Always make your assumptions explicit. When you say "I'll assume 500 million daily active users and a 20:1 read/write ratio," you are not guessing — you are demonstrating that you know which variables drive the design. Interviewers expect and respect stated assumptions.


Practical Next Steps

This lesson gives you the mental model. Here is how to turn that into interview-ready fluency:

  1. Draw the architecture from memory. Close this document and sketch the five-pillar diagram on paper — upload pipeline, transcoding queue, object storage, CDN edge network, metadata services. Time yourself. If it takes more than 8 minutes, practice again. In a real interview, you need to sketch this while talking, which means the layout must be automatic.

  2. Practice the estimation out loud. Pick a number of daily active users (start with 100 million, then scale to 1 billion), and walk through the storage and bandwidth calculation speaking every step aloud. This feels awkward at first. It is essential. Estimation in interviews is a verbal exercise, not a written one.

  3. Read the Netflix Tech Blog and the YouTube Engineering Blog. Real post-mortems and architectural decisions from these teams will give you concrete examples to cite. When an interviewer asks "how would you handle a CDN node failure?" and you can say "Netflix Open Connect actually handles this by..." — that is the moment you separate yourself from every other candidate. Search specifically for posts on adaptive bitrate streaming, transcoding pipelines, and CDN edge architecture.

🤔 Did you know? Netflix serves approximately 15% of all global internet downstream traffic at peak hours. Understanding how they achieve this — through a combination of custom CDN hardware, pre-positioned content, and adaptive bitrate algorithms — is a genuine competitive advantage in any streaming design interview.

The architecture of streaming systems is not magic. It is a set of well-understood problems — bandwidth, latency, fault tolerance, scale — solved with deliberate engineering decisions. You now have the map. Go build the confidence to navigate it.