Design URL Shortener & Rate Limiter

Tackle foundational interview questions covering hashing, storage, and traffic throttling strategies.

Last generated Apr 9, 2026 UTC

Why These Systems Are Interview Staples

You've probably used a shortened URL today without thinking twice about it. Maybe you clicked a bit.ly link in a tweet, or scanned a QR code that quietly redirected you through some invisible intermediary before landing you on the right page. Behind that effortless experience is a deceptively interesting engineering challenge — one that has become a cornerstone of technical interviews at companies like Google, Meta, Amazon, and Stripe. Grab our free flashcards at the end of this section to lock in the vocabulary before diving deeper. But first, ask yourself: what does it actually take to shorten a URL reliably for a billion users? And what stops those billion users from hammering your system into oblivion? That second question leads us directly to rate limiters — the unsung guardians of every web-scale service you've ever used.

These two systems — URL shorteners and rate limiters — show up in system design interviews more than almost any other problem. This is not a coincidence. Interviewers are deliberate about their choices, and these two problems earn their place on the whiteboard for very specific reasons. Understanding why they appear so frequently is itself a form of interview preparation, because it tells you exactly what skills are being evaluated and where you should focus your energy.

The Power of a Bounded Problem

One of the most underappreciated qualities of a great interview problem is that it must be bounded in scope while remaining unbounded in depth. A question like "design the internet" is too open-ended to be useful — there's no natural stopping point, no way to assess whether a candidate has genuinely solved anything. On the other end of the spectrum, "write a function to reverse a string" is too narrow — it tests only one skill and resolves in minutes.

URL shorteners and rate limiters live in a sweet spot. Each system can be sketched at a high level in fifteen minutes, yet each hides layers of genuine complexity that can fill an entire hour of nuanced conversation. A URL shortener is, on the surface, just a mapping from a long string to a short one. But underneath that simplicity lurks a tangle of questions:

🧠 How do you generate short codes that are unique at scale without becoming a coordination bottleneck? 📚 How do you store hundreds of billions of URL mappings cost-effectively? 🔧 How do you serve redirects with sub-millisecond latency to users on the other side of the world? 🎯 What happens when the same long URL is submitted by a million different users — do you deduplicate? 🔒 How do you handle expiration, analytics, and abuse prevention without sacrificing performance?

Rate limiters open an equally rich can of worms. The concept seems obvious — just count requests and block them when they exceed a threshold. But the implementation questions cascade quickly: which algorithm do you use? Where does the state live? How do you enforce limits consistently across dozens of distributed servers? What happens when your rate limit store goes down?

💡 Pro Tip: When an interviewer picks a "simple" problem, they're rarely interested in the simple version of it. They're handing you an on-ramp to a highway of deeper questions. The best candidates recognize this and start driving.

What Interviewers Are Actually Measuring

Let's be direct about what's happening on the other side of the table. When an interviewer asks you to design a URL shortener, they're running a mental checklist. They want to see how you navigate trade-offs — not whether you know the "right" answer, because for most system design questions, there isn't one.

The canonical tension in distributed systems is captured by the CAP theorem: a distributed system can guarantee at most two of Consistency, Availability, and Partition tolerance simultaneously. URL shorteners and rate limiters force you to confront this tension in concrete, practical ways.

Consider a rate limiter deployed across ten application servers. If each server tracks request counts locally, you get blazing-fast performance but lose consistency — a user could send ten requests to each of the ten servers and bypass a limit of ten requests total. If you centralize the count in a shared store like Redis, you get consistency but introduce a network hop on every single request and a potential single point of failure. Neither answer is wrong. Both involve trade-offs. What the interviewer wants to see is whether you understand the trade-off and can articulate why you'd choose one approach given a specific set of requirements.

## Simplified illustration of the local-vs-centralized rate limiter trade-off

## Option 1: Local in-memory counter (fast, but inconsistent across servers)
class LocalRateLimiter:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.counts = {}  # user_id -> (count, window_start_time)

    def is_allowed(self, user_id: str, current_time: float) -> bool:
        if user_id not in self.counts:
            self.counts[user_id] = (1, current_time)
            return True
        
        count, window_start = self.counts[user_id]
        
        # Reset window if expired
        if current_time - window_start > self.window_seconds:
            self.counts[user_id] = (1, current_time)
            return True
        
        # Check against limit
        if count < self.max_requests:
            self.counts[user_id] = (count + 1, window_start)
            return True
        
        return False  # Rate limit exceeded

## ⚠️ Problem: With 10 servers using this approach, a user could
## make max_requests * 10 calls before being blocked globally.
## This is fast but NOT globally consistent.

This snippet isn't meant to be production-ready — it's meant to make a point viscerally. The moment you write this code in an interview context and explain its limitation, you've demonstrated exactly the kind of trade-off thinking that separates strong candidates from weak ones.

🎯 Key Principle: In system design interviews, articulating trade-offs clearly is more valuable than proposing the "perfect" solution. There is no perfect solution — only solutions that are better or worse for a given set of constraints.

Scalability Thinking: From One Server to One Million Requests Per Second

Another reason these problems are interview favorites is that they naturally stress-test your ability to think across multiple orders of magnitude. A URL shortener that handles 100 requests per day is trivially easy to build — a SQLite database and a Flask app would do it. A URL shortener that handles 100,000 requests per second is an entirely different beast, and the distance between those two points is where most of the interesting engineering happens.

Interviewers use this scalability dimension deliberately. They'll often start with a simple version of the problem and then add constraints: "Now imagine this needs to handle 10 million daily active users." Or: "Your service needs to be available in Europe, Asia, and South America with under 100ms latency." These follow-up constraints reveal whether a candidate is designing a real system or just sketching an architecture diagram they memorized from a blog post.

## A naive URL shortener — perfectly fine for 100 users, catastrophic at scale
import hashlib
import sqlite3

class NaiveURLShortener:
    def __init__(self, db_path: str = "urls.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute(
            "CREATE TABLE IF NOT EXISTS urls "
            "(short_code TEXT PRIMARY KEY, long_url TEXT)"
        )
    
    def shorten(self, long_url: str) -> str:
        # MD5 hash truncated to 6 characters — collision-prone at scale!
        short_code = hashlib.md5(long_url.encode()).hexdigest()[:6]
        
        try:
            self.conn.execute(
                "INSERT INTO urls VALUES (?, ?)", (short_code, long_url)
            )
            self.conn.commit()
        except sqlite3.IntegrityError:
            pass  # Code already exists — but is it the SAME long_url? 🤔
        
        return f"https://short.ly/{short_code}"
    
    def resolve(self, short_code: str) -> str | None:
        cursor = self.conn.execute(
            "SELECT long_url FROM urls WHERE short_code = ?", (short_code,)
        )
        row = cursor.fetchone()
        return row[0] if row else None

## ⚠️ Problems at scale:
## 1. Single SQLite file — cannot scale horizontally
## 2. MD5 truncation causes hash collisions at ~77k entries (birthday problem)
## 3. No caching — every resolve hits the database
## 4. No geographic distribution — latency degrades for distant users

When you look at this code, you should feel a small alarm going off. It's not that this implementation is wrong — for a weekend project with a handful of users, it's perfectly reasonable. The alarm is your scalability instinct activating, and cultivating that instinct is exactly what this lesson is designed to do.

🔧 The questions this code raises are the same questions a senior engineer at Bitly or TinyURL had to answer in real life:

How do we generate codes without collisions at 10 billion entries?
How do we serve reads without touching the database on every request?
How do we distribute writes across multiple database nodes?
How do we handle a single URL that suddenly goes viral and receives 500,000 redirects per minute?

🤔 Did you know? At its peak, Bitly was processing over 600 million clicks per month — roughly 275 redirects every second, 24 hours a day. The engineering decisions baked into their system had to hold up under that load without flinching.

Building Intuition That Transfers

Here's what makes URL shorteners and rate limiters especially valuable as learning vehicles: the design decisions you make for these systems appear in virtually every distributed system you'll ever work on.

The read-heavy caching strategy you develop for URL resolution — where you keep hot short codes in memory and only fall back to the database on a cache miss — is the same strategy used by content delivery networks, DNS resolvers, and social media feeds. The token bucket algorithm you implement for rate limiting is the same mechanism that cloud providers use to throttle API calls and that payment processors use to prevent fraud. The consistent hashing technique that lets you distribute URL records across database shards without catastrophic rebalancing is the backbone of distributed caches like Memcached and data stores like Cassandra.

💡 Mental Model: Think of URL shorteners and rate limiters as flight simulators for distributed systems thinking. They're safe, controlled environments where you can practice the maneuvers you'll need in the wild — horizontal scaling, fault tolerance, consistency trade-offs — without the stakes of a production outage.

This transfer of intuition is not hypothetical. Engineers who deeply understand these two systems consistently perform better when asked to design recommendation engines, notification services, payment APIs, and real-time analytics pipelines. The patterns repeat. The vocabulary carries over. The instincts sharpen.

📋 Quick Reference Card: What These Systems Test

🎯 Skill Area	🔧 How URL Shortener Tests It	🔒 How Rate Limiter Tests It
📚 Data Modeling	Mapping scheme design, key generation strategies	Counter storage, time window representation
🧠 Scalability	Read/write ratio optimization, sharding strategies	Distributed state, coordination overhead
🔧 Caching	Cache-aside patterns for redirect hot paths	In-memory counters vs. external stores
🎯 Trade-off Analysis	Collision probability vs. code length	Precision vs. performance in counting
🔒 Fault Tolerance	Database replication, failover for redirects	Graceful degradation when limiter store fails

A Preview of the Journey Ahead

This lesson is your foundation. It's where we establish the shared vocabulary and mental models that every subsequent deep-dive lesson will build upon. Here's what you can expect as the lesson unfolds:

Foundational System Design Concepts at Play — Before we can design anything, we need a common language. We'll cover the building blocks that both systems rely on: databases, caches, load balancers, message queues, and the consistency models that govern how they interact. If you've heard terms like eventual consistency, idempotency, or horizontal scaling and nodded along without full confidence, that section will give you solid ground to stand on.

Estimating Scale: Back-of-the-Envelope Calculations — One of the most frequently tested and most frequently fumbled interview skills is capacity estimation. How many servers do you need? How much storage? What's the read-to-write ratio? We'll work through these calculations step by step for both URL shorteners and rate limiters, giving you a repeatable framework you can apply to any system design problem.

High-Level Architecture Patterns for Web-Scale Services — Here we zoom out and look at the blueprints. What does the full system diagram look like? How do the components connect? We'll explore the architectural patterns — like read replicas, CDN integration, and sidecar proxies — that make these systems work at global scale.

Common Pitfalls When Approaching These Design Problems — This is where we get honest about the mistakes candidates make. Not surface-level mistakes like "forgetting to mention caching," but the deeper conceptual errors: designing for the wrong bottleneck, over-engineering the wrong component, or failing to clarify requirements before diving into solutions.

🧠 Mnemonic: FEAPC — Fundamentals, Estimation, Architecture, Pitfalls, Consolidation. That's the shape of this lesson. Each layer builds on the one before it.

⚠️ Common Mistake:

Mistake 1: Jumping straight into architecture diagrams without establishing requirements. Candidates often start drawing boxes before asking critical questions like "How many users do we expect?" or "Does the short code need to be human-readable?" The best system designers ask before they answer. ⚠️

❌ Wrong thinking: "I know the URL shortener problem — let me jump to the solution I memorized."

✅ Correct thinking: "Let me understand the specific constraints and requirements first, because those constraints will drive my design decisions in ways a generic solution won't address."

Why This Moment Matters

If you're preparing for system design interviews, the instinct might be to jump straight to the specific designs — to memorize the URL shortener solution and the rate limiter solution and call it done. Resist that instinct. The engineers who perform best in these interviews aren't the ones who've memorized the most solutions. They're the ones who've internalized the thinking process deeply enough to reason about systems they've never seen before.

URL shorteners and rate limiters are your gymnasium. They're where you build the mental muscle that will carry you through every system design question you'll face — at the interview table and throughout your engineering career. The hours you invest in truly understanding these two systems will pay dividends far beyond anything you'd get from a surface-level survey of a dozen different designs.

💡 Remember: Every production system you'll ever build is, at its core, a collection of the same patterns showing up in different costumes. Learn the patterns with URL shorteners and rate limiters, and you'll recognize them everywhere.

Let's build that foundation together — starting with the concepts that make all of this possible.

Foundational System Design Concepts at Play

Before diving into the specifics of URL shorteners and rate limiters, you need a shared vocabulary — a set of building blocks that appear again and again in system design interviews. These aren't abstract theoretical concepts. They're the practical levers engineers pull every day to make systems faster, more reliable, and easier to scale. Understanding them deeply means you can reason through any design problem, not just the two we focus on here.

Think of this section as building your mental toolkit. Each concept is a tool. The more fluent you become with each tool, the faster and more confidently you can sketch out a design under interview pressure.

Hashing and Encoding: The Art of Generating Short, Unique Identifiers

Hashing is the process of mapping an input of arbitrary size to a fixed-size output using a deterministic function. When you type https://www.example.com/very/long/path?with=query&params=everywhere into a URL shortener, something needs to transform that unwieldy string into something like bit.ly/x7kPq2. That transformation relies on hashing and encoding strategies.

There are two primary approaches:

1. Hash-then-encode: Run the long URL through a cryptographic or non-cryptographic hash function (like MD5 or SHA-256), then encode a portion of the resulting bytes into a URL-safe character set. Base62 — which uses [a-z, A-Z, 0-9] — is the most common encoding for this purpose.

2. Auto-increment with encoding: Maintain a global counter in your database. Each new URL gets the next integer ID, which is then Base62-encoded into the short code.

Let's look at a concrete Base62 encoding implementation:

import hashlib

CHARACTERS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
BASE = len(CHARACTERS)  # 62

def encode_base62(number: int) -> str:
    """Convert a non-negative integer to a Base62 string."""
    if number == 0:
        return CHARACTERS[0]
    result = []
    while number > 0:
        result.append(CHARACTERS[number % BASE])
        number //= BASE
    return ''.join(reversed(result))

def hash_url(long_url: str, length: int = 7) -> str:
    """Hash a URL with MD5 and take the first `length` Base62 characters."""
    digest = hashlib.md5(long_url.encode()).hexdigest()  # 32 hex chars
    # Convert first 8 hex chars to an integer, then Base62-encode
    number = int(digest[:8], 16)
    return encode_base62(number)[:length]

## Example usage
long_url = "https://www.example.com/very/long/path?with=query&params=everywhere"
short_code = hash_url(long_url)
print(f"Short code: {short_code}")  # e.g., "dK3mPqR"

This snippet shows the hash-then-encode approach. Notice the [:length] slice — you're deliberately truncating the output to keep the short code compact. This introduces a critical risk:

⚠️ Common Mistake: Truncating a hash output dramatically increases collision probability — two different long URLs producing the same short code. A 7-character Base62 string gives you 62^7 ≈ 3.5 trillion combinations, which sounds enormous, but the birthday paradox means collisions become likely well before you exhaust that space. Always include a collision-detection strategy: check if the code exists, and if so, append a salt or increment a suffix.

Uniqueness guarantees matter because a collision in a URL shortener is catastrophic — a user could be redirected to the wrong website entirely. In a rate limiter, the analogous concept is key generation for tracking request counts per user or IP, where key collisions lead to incorrect throttling decisions.

🎯 Key Principle: For any system where identifiers must be globally unique, build collision detection into your design from day one — not as an afterthought.

Storage Layer Decisions: Picking the Right Database for the Job

Not all databases are created equal, and choosing the wrong one is one of the most common architectural mistakes in interviews. The decision framework comes down to your system's read/write patterns, consistency requirements, and the shape of your data.

SQL vs. NoSQL vs. In-Memory

Relational databases (SQL) like PostgreSQL or MySQL give you ACID guarantees, flexible querying with joins, and strong consistency. They're excellent when your data has clear relationships — for example, a URL shortener that associates short codes with user accounts, expiration dates, and click analytics.

NoSQL databases like Cassandra, DynamoDB, or MongoDB sacrifice some consistency guarantees in exchange for horizontal scalability and flexible schemas. A URL shortener at massive scale (think billions of entries) often reaches for DynamoDB or Cassandra because the primary access pattern is a simple key-value lookup: short_code → long_url. No joins needed.

In-memory stores like Redis or Memcached are not primary storage — they're acceleration layers. They keep hot data in RAM, delivering sub-millisecond reads. A rate limiter almost always relies heavily on Redis because tracking request counts requires extremely fast, atomic increments.

Here's a mental model to anchor the decision:

┌─────────────────────────────────────────────────────────┐
│              Storage Layer Decision Tree                 │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Do you need complex queries / joins?                   │
│       YES ──────────────────────► SQL (Postgres, MySQL) │
│       NO                                                │
│        │                                                │
│  Is access pattern key-value at massive scale?          │
│       YES ──────────────────────► NoSQL (DynamoDB,      │
│       NO                          Cassandra)            │
│        │                                                │
│  Do you need sub-ms latency on hot data?                │
│       YES ──────────────────────► In-Memory (Redis)     │
│                                   [NOT primary storage] │
└─────────────────────────────────────────────────────────┘

💡 Real-World Example: Twitter's URL shortener (t.co) uses a combination approach: a NoSQL store for the raw short_code → URL mapping (billions of entries, simple lookups) and a separate analytics pipeline for tracking click counts and geographic data. Mixing storage technologies for different access patterns is entirely normal at scale.

🤔 Did you know? Redis stores all data in RAM by default but also supports persistence via snapshotting (RDB) and append-only files (AOF). This makes it viable as a semi-durable store for rate limiter counters — even a server restart won't lose your rate limit state.

For a rate limiter, the read/write pattern is particularly demanding: every incoming request triggers both a read (what's the current count?) and a write (increment the count). This must happen atomically — meaning no two concurrent requests can both read the same stale value and both decide they're under the limit. Redis's INCR command is atomic by design, making it the go-to choice.

import redis
import time

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def is_rate_limited(user_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    """
    Fixed-window rate limiter using Redis INCR.
    Returns True if the user has exceeded the rate limit.
    """
    # Key is unique per user per time window (e.g., per minute)
    current_window = int(time.time() // window_seconds)
    key = f"rate_limit:{user_id}:{current_window}"

    # INCR atomically increments and returns the new value
    current_count = r.incr(key)

    # Set expiry only on the first request in this window
    if current_count == 1:
        r.expire(key, window_seconds * 2)  # TTL with buffer

    return current_count > limit

## Usage
if is_rate_limited("user_42", limit=100, window_seconds=60):
    print("429 Too Many Requests")
else:
    print("200 OK - process request")

This snippet demonstrates the atomic increment pattern. The INCR + EXPIRE combo is a classic Redis idiom for rate limiting. Notice the window_seconds * 2 buffer on the TTL — this prevents the key from expiring exactly when the window flips, which could cause brief under-counting.

Horizontal vs. Vertical Scaling: Why Stateless Services Win

Vertical scaling means giving your existing server more resources — more CPU cores, more RAM, faster disks. It's simple but has a hard ceiling: there's only so much hardware you can cram into one machine, and the cost grows non-linearly as you approach the limits.

Horizontal scaling means adding more servers and distributing load across them. It's theoretically unbounded, but it introduces a new constraint: your service must be stateless for horizontal scaling to work cleanly.

A stateless service is one where any instance can handle any request without needing to know about previous requests. No session data, no in-memory counters, no local caches that differ between instances. All shared state lives in an external system — a database, a cache cluster, a message queue.

❌ Stateful Architecture (hard to scale horizontally)

  Request ──► Server A (has local counter: 47 requests)
  Request ──► Server B (has local counter: 12 requests)
  Request ──► Server C (has local counter: 83 requests)
  
  ☠️ Problem: No server knows the true total. Rate limit is
     enforced per-server, not per-user across the system.

✅ Stateless Architecture (scales horizontally)

  Request ──► Server A ──┐
  Request ──► Server B ──┼──► Redis Cluster (true count: 142)
  Request ──► Server C ──┘
  
  ✓ Any server reads/writes the same Redis counter.
    Rate limit is enforced globally, correctly.

This distinction is crucial for both of our target systems. A URL shortener's redirect service is naturally stateless — each instance just looks up a short code in the database and redirects. A rate limiter is stateless at the application layer precisely because all mutable state (the counters) lives in Redis.

💡 Mental Model: Think of stateless services like vending machines. Any vending machine dispenses the same product regardless of what the previous customer bought. The inventory database (an external store) tracks what's been sold. The machine itself has no memory.

🎯 Key Principle: Design your application tier to be stateless. Push all shared, mutable state to dedicated external services (databases, caches). This single design decision is what makes horizontal scaling tractable.

Caching Layers: Slashing Latency and Protecting Your Database

A cache is a high-speed storage layer that sits in front of slower, canonical storage. When a request arrives, the system checks the cache first. If the data is there (a cache hit), it returns immediately without touching the database. If not (a cache miss), it fetches from the database and stores the result in the cache for future requests.

The performance implications are staggering:

📋 Quick Reference Card: Typical Latency by Storage Tier

🔧 Storage Type	⏱️ Typical Latency	📊 Use Case
🧠 CPU L1/L2 Cache	~1 nanosecond	Hardware-level
💾 RAM / Redis	~100 microseconds	Hot data caching
💿 SSD Database	~1 millisecond	Warm data
🌐 Network DB Call	~10 milliseconds	Standard queries
🐢 HDD / Cold Storage	~10 milliseconds+	Archival

For a URL shortener, the cache hit rate can be extremely high because URL access follows a power law distribution — a small number of popular URLs account for the vast majority of traffic. Caching the top 10-20% of URLs by access frequency might yield a 90%+ cache hit rate, meaning 90% of redirects never touch the database at all.

For a rate limiter, Redis itself is the primary data store for counters — it's not a cache in front of something slower. This is an important distinction: Redis can play two different roles depending on the system.

Cache eviction policies matter when your cache fills up. The two most common:

LRU (Least Recently Used): Evict the item not accessed for the longest time. Good for URL shorteners.
LFU (Least Frequently Used): Evict the item accessed least often overall. Better when you want to protect consistently popular items.

⚠️ Common Mistake: Treating your cache as a guaranteed store. Caches can evict entries at any time. If your URL shortener's redirect logic requires the short code to resolve, you must always fall back to the database on a cache miss — never assume the cache is the source of truth.

🧠 Mnemonic: C.A.C.H.E. — Check cache first, Avoid DB when possible, Consider eviction policy, Handle misses gracefully, Expire stale entries with TTLs.

CAP Theorem: The Fundamental Trade-off of Distributed Systems

No discussion of distributed system design is complete without the CAP theorem, formulated by Eric Brewer. It states that a distributed data store can guarantee at most two of the following three properties simultaneously:

C — Consistency: Every read receives the most recent write or an error. All nodes see the same data at the same time.
A — Availability: Every request receives a response (not necessarily the most recent data). The system stays operational.
P — Partition Tolerance: The system continues operating even when network partitions cause some nodes to lose communication with others.

              Consistency
                   △
                   │
                   │
         CA ───────┼─────── CP
        (SQL)      │    (HBase, Zookeeper)
                   │
─────────────────[C+A+P impossible]─────────────────
                   │
        AP ────────┴
    (Cassandra,
     DynamoDB)
        
  Network Partition ◄──────────────► Availability

In practice, network partitions are unavoidable in any distributed system. A cable gets cut, a router fails, a data center loses connectivity. This means the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance) systems.

How does this apply to our two systems?

URL Shortener — slight inconsistency is acceptable: If a newly created short URL isn't immediately visible on all nodes for a few hundred milliseconds, that's fine. The user just created it; they'll wait a second and try again. An AP system like DynamoDB (with eventual consistency) is perfectly suited. However, the redirect mapping for existing URLs must be eventually consistent — you never want two nodes to have different long URLs for the same short code.

Rate Limiter — inconsistency can be exploited: If your rate limiter allows 100 requests per minute and you have 5 nodes with stale local counters, a user could potentially send 100 requests to each node — 500 total — before any node's count reaches the limit. This is a security-critical inconsistency. You need strong coordination, which is why centralized Redis (with replication) or a CP system is preferred for rate limiting.

💡 Pro Tip: Interviewers love when candidates explicitly acknowledge CAP trade-offs. Don't just name-drop the theorem — explain which properties you're trading away, why that's acceptable for the specific use case, and what failure modes that trade-off introduces. That's the mark of senior-level thinking.

❌ Wrong thinking: "I'll use Cassandra because it's NoSQL and fast."

✅ Correct thinking: "I'll use Cassandra (AP system) for the URL mapping store because eventual consistency is acceptable for redirect lookups, and I need the horizontal write scalability. For the rate limiter counters, I'll use Redis with a single primary to maintain strong consistency where it matters for security."

🤔 Did you know? The CAP theorem was later refined by the PACELC theorem, which adds that even without partitions, there's a trade-off between latency and consistency. Systems like DynamoDB let you tune consistency per-read operation, giving you per-request control over this spectrum.

Putting the Building Blocks Together

These five concepts — hashing/encoding, storage selection, horizontal scaling, caching, and CAP trade-offs — don't exist in isolation. They're deeply interconnected. Your choice of storage layer influences your caching strategy. Your consistency requirements (derived from CAP reasoning) influence whether you can scale horizontally without coordination overhead. Your hashing strategy influences your collision rate and therefore your storage access patterns.

Here's how they all connect in a simplified URL shortener read path:

User Browser
     │
     │  GET /x7kPq2
     ▼
┌─────────────┐
│  Load       │  ◄── Stateless app tier (horizontal scaling)
│  Balancer   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  App Server │  Step 1: Check Redis cache
│  (Stateless)│  ────────────────────────────► Redis Cache
└──────┬──────┘                                   │
       │                                    Cache HIT? ──► 301 Redirect
       │  Cache MISS?                             │
       ▼                                          │
┌─────────────┐                                   │
│  Database   │  ◄── DynamoDB/Cassandra (AP)       │
│  (NoSQL)    │  short_code → long_url lookup      │
└──────┬──────┘                                   │
       │  Found? Store in Redis + 301 Redirect ───┘
       │  Not Found? 404 Not Found
       ▼
   Response

Every box in that diagram represents a concept from this section. The load balancer enables horizontal scaling of the stateless app tier. Redis provides the caching layer. DynamoDB reflects an AP storage choice. The short_code itself was generated using Base62 encoding.

When you walk into a system design interview, your job is to reason about these layers explicitly, justify each choice, and acknowledge the trade-offs. That fluency — not memorizing specific answers — is what separates strong candidates from exceptional ones.

🎯 Key Principle: System design interviews test your reasoning process, not your ability to recall a fixed architecture. Use these foundational concepts as your reasoning framework, apply them to the constraints given, and always explain the why behind your choices.

Estimating Scale: Back-of-the-Envelope Calculations

Before a single line of architecture gets drawn on the whiteboard, the best system designers do something that separates junior from senior engineers: they pull out a mental calculator and stress-test their assumptions with numbers. Interviewers are not just watching you sketch boxes and arrows — they want to see whether you can reason about scale. Can you tell them roughly how much storage you'll need in five years? Can you estimate how many database reads per second your service will handle at peak? This section teaches you exactly that discipline, using URL shorteners and rate limiters as the training ground.

Back-of-the-envelope calculation is the practice of making rapid, order-of-magnitude estimates using simplified math and reasonable assumptions. The goal is not precision — it's ballpark correctness that informs architectural decisions. Getting within a factor of 2 or 3 is usually sufficient. What matters is the reasoning chain, not the final digit.

🎯 Key Principle: Every number you produce in an estimation should trace back to an explicit assumption. State your assumptions out loud. Interviewers reward this transparency because it shows structured thinking.

The Read-to-Write Ratio: Why It Matters Enormously

The single most important characteristic of a URL shortener from a scaling perspective is that it is overwhelmingly read-heavy. Think about how the system gets used in the real world. A marketing campaign produces one shortened link — that's one write. But that link gets included in emails, tweets, text messages, and web pages. It might be clicked 50,000 times over its lifetime — those are 50,000 reads. A viral link could easily see 1,000,000 reads from a single write.

The conventional teaching assumption in interviews is a read-to-write ratio of at least 100:1, meaning for every URL created, there are at least 100 redirect lookups. In practice, for popular services like Bitly or TinyURL, the real ratio is far higher, but 100:1 is a conservative and defensible starting point.

Why does this ratio matter so profoundly for your architecture? Because it tells you where to invest:

📚 Read path needs to be blazing fast — this is the path that serves users clicking links, so latency here directly impacts user experience. You'll want aggressive caching here.
🔧 Write path can tolerate more latency — creating a short URL is a deliberate action; users expect it to take a moment.
🎯 Read replicas become more valuable than write-optimized storage when reads dominate so heavily.
🧠 Cache hit rates can be extremely high because the most popular URLs (the long tail being rare) fit comfortably in memory.

Write Path (rare)
  User ──► API Server ──► Primary DB (write)

Read Path (100x more frequent)
  User ──► Cache (Redis) ──► API Server ──► Read Replica ──► Primary DB
         [~90% cache hit]                   [~10% DB reads]

A rate limiter, by contrast, is a read-modify-write workload — every incoming request requires reading the current count, incrementing it, and writing it back. This changes the architecture calculus significantly, which is why we study both systems together.

💡 Real-World Example: Bitly reported processing over 600 million clicks per month at their earlier scale — roughly 230 clicks per second on average, with peaks far higher. If their write rate was even 10 million new URLs per month, that's a 60:1 ratio. Real production systems confirm the read-heavy intuition.

Calculating Storage Requirements

Storage estimation follows a straightforward formula once you decompose what you're actually storing. For a URL shortener, each record contains several fields:

Original URL: The average length of a URL in the wild is roughly 100 characters. Long URLs from e-commerce sites or analytics platforms can exceed 500 characters, but short blog posts or homepages bring the average down. Using 100 bytes per URL is a reasonable assumption (UTF-8 encoding means 1 byte per ASCII character for standard URLs).

Short code: A 6–8 character alphanumeric code. At 7 characters, that's 7 bytes.

Metadata overhead: Creation timestamp (8 bytes), expiration timestamp (8 bytes), user ID (8 bytes), click count (8 bytes), and miscellaneous flags — call it roughly 50–100 bytes of metadata.

Putting it together, each record costs approximately:

Storage per URL record:
  Original URL:  ~100 bytes
  Short code:      7 bytes
  Metadata:       ~80 bytes
  ─────────────────────────
  Total:         ~200 bytes per record (round up to 500 bytes with indexes and overhead)

Now apply a retention policy — the decision about how long to keep URL mappings alive. If your service retains URLs for 5 years and you write 200 new URLs per second (a reasonable assumption for a mid-scale service), the total record count is:

200 writes/sec × 86,400 sec/day × 365 days/year × 5 years
= 200 × 86,400 × 1,825
≈ 31.5 billion records

At 500 bytes each:

31.5 billion × 500 bytes ≈ 15.75 TB of storage

Fifteen terabytes over five years is entirely manageable with modern distributed databases. This calculation immediately tells you that storage is not your bottleneck — throughput and latency are. That insight shapes every subsequent architectural decision.

⚠️ Common Mistake: Forgetting to account for indexes. A database index on the short code (needed for fast lookups) and on the original URL (needed for deduplication) can easily double your storage requirement. Always add a multiplier of 1.5–2× for index overhead.

## url_shortener_estimation.py
## A simple capacity estimation model — run this to make numbers concrete

def estimate_url_shortener_capacity(
    writes_per_second: float = 200,
    read_to_write_ratio: int = 100,
    bytes_per_record: int = 500,
    retention_years: int = 5,
    cache_hit_rate: float = 0.90,
    avg_url_bytes: int = 100
):
    """
    Back-of-the-envelope capacity estimator for a URL shortener service.
    All numbers are approximations — the goal is order-of-magnitude correctness.
    """
    seconds_per_year = 365 * 24 * 3600  # ~31.5 million
    
    # ── Throughput ────────────────────────────────────────────────────────────
    reads_per_second = writes_per_second * read_to_write_ratio
    peak_reads_per_second = reads_per_second * 5  # assume 5× peak multiplier
    
    # ── Storage ───────────────────────────────────────────────────────────────
    total_records = writes_per_second * seconds_per_year * retention_years
    storage_bytes = total_records * bytes_per_record
    storage_tb = storage_bytes / (1024 ** 4)  # convert to terabytes
    
    # Account for index overhead (1.5× multiplier)
    storage_with_indexes_tb = storage_tb * 1.5
    
    # ── Bandwidth ─────────────────────────────────────────────────────────────
    # Each redirect response: short code lookup → original URL returned
    # Outbound: avg_url_bytes per redirect (the original URL in a 302 response)
    outbound_bandwidth_mbps = (reads_per_second * avg_url_bytes * 8) / 1_000_000
    
    # ── Memory (Cache Sizing) ─────────────────────────────────────────────────
    # Cache the top N% of URLs that account for 90% of traffic (Pareto principle)
    # Assume top 20% of URLs get 80% of traffic; cache those hot URLs
    hot_url_fraction = 0.20
    cache_entries = total_records * hot_url_fraction
    cache_memory_gb = (cache_entries * bytes_per_record) / (1024 ** 3)
    
    # ── Results ───────────────────────────────────────────────────────────────
    print("=" * 55)
    print(" URL Shortener Capacity Estimation")
    print("=" * 55)
    print(f"\n📝 WRITES")
    print(f"   Writes per second:          {writes_per_second:>10,.0f} req/s")
    print(f"\n📖 READS")
    print(f"   Reads per second (avg):     {reads_per_second:>10,.0f} req/s")
    print(f"   Reads per second (peak 5x): {peak_reads_per_second:>10,.0f} req/s")
    print(f"\n💾 STORAGE (over {retention_years} years)")
    print(f"   Total records:              {total_records:>10,.0f}")
    print(f"   Raw storage:                {storage_tb:>10.2f} TB")
    print(f"   Storage + indexes (1.5x):   {storage_with_indexes_tb:>10.2f} TB")
    print(f"\n🌐 BANDWIDTH")
    print(f"   Outbound (avg):             {outbound_bandwidth_mbps:>10.2f} Mbps")
    print(f"\n🧠 CACHE (Redis)")
    print(f"   Hot URL entries (top 20%):  {cache_entries:>10,.0f}")
    print(f"   Memory needed:              {cache_memory_gb:>10.2f} GB")
    print("=" * 55)


if __name__ == "__main__":
    # Run with default assumptions
    estimate_url_shortener_capacity()
    
    print("\n--- High-traffic scenario (e.g., Bitly scale) ---\n")
    estimate_url_shortener_capacity(
        writes_per_second=500,
        read_to_write_ratio=200,
        retention_years=10
    )

Run this script and you'll get concrete numbers printed to your terminal. The act of translating assumptions into code forces you to be explicit about every variable — exactly the discipline interviews reward.

💡 Pro Tip: In an actual interview, you won't write Python — but mentally running through these same steps in a structured order (throughput → storage → bandwidth → memory) demonstrates the same rigor. Practice the mental flow until it becomes automatic.

Estimating Throughput in Requests Per Second

Throughput is the number of requests a system can handle per unit of time, typically expressed in requests per second (RPS). Estimating it correctly is the bridge between abstract scale requirements and concrete infrastructure decisions.

The anchor number to establish first is daily active users (DAU) or monthly active users (MAU). For a hypothetical mid-scale URL shortener, let's assume 10 million DAU. Now ask: what fraction of them create URLs versus just click links?

Assumptions:
  DAU:                      10,000,000 users
  % who create URLs/day:    1% → 100,000 URL creations/day
  % who click URLs/day:     50% → 5,000,000 clicks/day
  Read-to-write ratio:      5,000,000 / 100,000 = 50:1 (conservative)

Translating to RPS:
  Write RPS (avg): 100,000 / 86,400 ≈ 1.2 writes/sec
  Read RPS (avg):  5,000,000 / 86,400 ≈ 58 reads/sec

Peak traffic (assume 2x daily average concentrated in 10% of the day):
  Write RPS (peak): 1.2 × 10 ≈ 12 writes/sec
  Read RPS (peak):  58 × 10 ≈ 580 reads/sec

This is a small service. Scale it up by assuming 100 million DAU and the read RPS peaks around 58,000 — now you need a proper caching layer. At 1 billion DAU (Google-scale), peak reads approach 580,000 RPS, which requires a distributed cache cluster and potentially thousands of application servers.

🤔 Did you know? A single modern Redis instance can handle approximately 100,000–200,000 operations per second. This means that at 580,000 peak RPS, you'd need at least 3–6 Redis nodes even before considering replication for redundancy. One number cascades into your entire caching infrastructure design.

The peak-to-average multiplier is a critical concept. Traffic is never uniformly distributed. Social media links get clicked in bursts immediately after posting. Black Friday sales send e-commerce short links into overdrive. A conservative multiplier of 5–10× average is standard in interviews. Using this multiplier ensures your architecture can handle the real world, not just steady-state averages.

Bandwidth Estimation

Bandwidth measures the volume of data transferred per unit of time, expressed in megabits per second (Mbps) or gigabits per second (Gbps). For a URL shortener, bandwidth estimation is relatively simple because the payloads are tiny — but it's still worth calculating to avoid surprises.

For incoming writes:

Payload: ~500 bytes (short code request + original URL + metadata)
At 12 writes/sec (peak):  12 × 500 bytes = 6,000 bytes/sec ≈ 0.05 Mbps

For outgoing reads (redirect responses):

Payload: ~200 bytes (HTTP 302 response with Location header)
At 580 reads/sec (peak): 580 × 200 bytes = 116,000 bytes/sec ≈ 0.9 Mbps

Bandwidth is not a bottleneck for URL shorteners at moderate scale. Compare this to a video streaming service that might need 10 Gbps of outbound bandwidth — the difference illustrates why system design requires domain-specific estimation.

For rate limiters, the bandwidth calculation is different. A rate limiter itself doesn't serve content — it sits in front of other services and makes pass/block decisions. The throughput concern is latency per decision, not bandwidth. You want each rate-limit check to add less than 1 millisecond of overhead. This design constraint (latency budget, not bandwidth) shapes the architecture toward in-memory solutions like Redis rather than disk-backed databases.

Memory Estimation and the Caching Tier

Memory estimation for a URL shortener centers on sizing the cache. The fundamental insight is the Pareto principle (also known as the 80/20 rule): roughly 20% of URLs account for 80% of traffic. You don't need to cache everything — just the hot set.

Total records (5-year horizon): ~31.5 billion
Top 20% (hot set):              ~6.3 billion entries
Memory per entry:               ~500 bytes
Raw cache memory:               6.3B × 500 = ~3.15 TB

Three terabytes of RAM is expensive. But here's where you apply a second insight: the hot set isn't 20% of all-time records — it's 20% of recently active records. A URL created three years ago and never clicked again doesn't belong in cache. If you scope to the last 30 days of active URLs:

Last-30-day writes: 200 writes/sec × 86,400 × 30 ≈ 518 million URLs
Top 20% active:     ~100 million URLs
Memory needed:      100M × 500 bytes = 50 GB

Fifty gigabytes is entirely cacheable on a small Redis cluster. This is a powerful result: a time-scoped hot set dramatically reduces your cache footprint.

## cache_sizing.py
## Demonstrates how retention window affects cache memory requirements

def cache_memory_gb(
    writes_per_second: float,
    retention_days: int,
    hot_fraction: float = 0.20,
    bytes_per_entry: int = 500
) -> float:
    """
    Estimate cache memory for the 'hot set' of URLs.
    
    Parameters:
    -----------
    writes_per_second : float
        Average URL creation rate
    retention_days : int
        How many days of URLs to keep in the active cache window
    hot_fraction : float
        Fraction of entries in the window that are actually "hot" (default: top 20%)
    bytes_per_entry : int
        Memory cost per cached URL record in bytes
    
    Returns:
    --------
    float: Estimated cache memory in gigabytes
    """
    seconds_in_window = retention_days * 24 * 3600
    total_urls_in_window = writes_per_second * seconds_in_window
    hot_urls = total_urls_in_window * hot_fraction
    memory_bytes = hot_urls * bytes_per_entry
    memory_gb = memory_bytes / (1024 ** 3)
    return memory_gb


## Compare different retention windows
windows = [1, 7, 30, 90, 365]
writes_per_sec = 200

print(f"Cache sizing at {writes_per_sec} writes/sec (top 20% of URLs)")
print("-" * 50)
print(f"{'Retention Window':<20} {'Cache Memory':>15}")
print("-" * 50)

for days in windows:
    mem = cache_memory_gb(writes_per_sec, days)
    label = f"{days} day{'s' if days > 1 else ''}"
    # Choose the most readable unit
    if mem < 1:
        display = f"{mem * 1024:.1f} MB"
    else:
        display = f"{mem:.1f} GB"
    print(f"{label:<20} {display:>15}")

This script makes a key insight tangible: the difference between caching 1 day vs. 1 year of URLs is roughly a 365× difference in memory. Choosing a 30-day active window is a deliberate, data-driven architectural decision — not an arbitrary one.

⚠️ Common Mistake: Candidates often estimate memory for the entire dataset rather than the working set. If you tell an interviewer you need 15 TB of RAM, they'll correctly question your judgment. Always scope cache estimates to the active, hot portion of your data.

Putting It All Together: An Estimation Template

The most reliable way to perform back-of-the-envelope calculations in an interview is to follow a consistent template. Here's one that works for virtually any web-scale service:

┌─────────────────────────────────────────────────────────┐
│            ESTIMATION TEMPLATE                          │
├─────────────────────────────────────────────────────────┤
│  STEP 1: Traffic (Throughput)                           │
│    a. Start with DAU or MAU                             │
│    b. Estimate action frequency per user                │
│    c. Compute average RPS                               │
│    d. Apply peak multiplier (5–10×)                     │
│    e. Calculate read-to-write split                     │
├─────────────────────────────────────────────────────────┤
│  STEP 2: Storage                                        │
│    a. Define data schema, estimate bytes per record     │
│    b. Multiply by total record count (write RPS × time) │
│    c. Apply index overhead multiplier (1.5–2×)          │
│    d. Apply replication factor (typically 3×)           │
├─────────────────────────────────────────────────────────┤
│  STEP 3: Bandwidth                                      │
│    a. Inbound: write RPS × payload size                 │
│    b. Outbound: read RPS × response size                │
│    c. Express in Mbps or Gbps                           │
├─────────────────────────────────────────────────────────┤
│  STEP 4: Memory (Cache)                                 │
│    a. Define active/hot window (e.g., 30 days)          │
│    b. Apply hot fraction (20% rule)                     │
│    c. Multiply by bytes per entry                       │
│    d. Size Redis cluster accordingly                    │
└─────────────────────────────────────────────────────────┘

📋 Quick Reference Card: Estimation Rules of Thumb

📊 Metric	🔢 Rule of Thumb
🕐 Seconds per day	~86,400
🕐 Seconds per year	~31.5 million
💾 Bytes in 1 GB	10⁹ bytes
💾 Bytes in 1 TB	10¹² bytes
⚡ Single Redis ops/sec	~100K–200K
🔄 Peak traffic multiplier	5–10× average
📚 Read-to-write ratio (URL shortener)	100:1
🔥 Hot set fraction (Pareto)	Top 20% of URLs
🏗️ Storage index overhead	1.5–2× raw data
🔁 Replication factor	3× (standard)

🧠 Mnemonic: Remember T-S-B-M — Throughput first, then Storage, then Bandwidth, then Memory. This order mirrors the natural dependency chain: you need to know how many requests per second before you can derive everything else.

Why This Skill Is Non-Negotiable in Interviews

Interviewers at top technology companies use capacity estimation as a signal filter — it quickly separates candidates who have designed real systems from those who have only read about them. When you instinctively ask "what's our DAU?" before sketching any architecture, you signal that you understand that design choices are driven by numbers, not preferences.

❌ Wrong thinking: "Let me just add a cache in front of the database and we'll be fine."

✅ Correct thinking: "Given 58,000 peak reads per second and a 90% cache hit rate, we need to handle 5,800 database reads per second from our replica pool. At 10ms per query, a single replica can handle roughly 100 QPS under standard conditions — so we'd need approximately 60 read replicas, or we need to revisit our caching strategy to push that hit rate higher."

The second answer demonstrates that you can turn estimation into infrastructure decisions. That is the skill that gets you hired.

💡 Mental Model: Think of back-of-the-envelope estimation the way a structural engineer thinks before designing a bridge. They don't start sketching cable arrangements — they first calculate the expected load, material tolerances, and safety margins. Your load is traffic. Your materials are servers, databases, and caches. Estimation is your structural analysis.

As you move into the architecture section of this lesson, you'll see how every structural decision — whether to shard the database, how many cache nodes to deploy, whether to use consistent hashing — traces directly back to the numbers you establish in this estimation phase. The estimates are not a preamble to the design. They are the foundation of it.

High-Level Architecture Patterns for Web-Scale Services

Before you can design a URL shortener or a rate limiter with confidence, you need a shared blueprint — a mental scaffolding for how web-scale services are actually structured. Interviewers aren't just looking for the right answer; they're watching whether you can navigate a system holistically, moving fluently from the client's initial request all the way down to the database and back. This section builds that scaffold, piece by piece.

The Three-Tier Architecture: Your Default Starting Point

Almost every web-scale service you'll be asked to design in an interview — URL shorteners and rate limiters very much included — can be initially sketched as a three-tier architecture. These three tiers are the API layer, the application/business logic layer, and the storage layer. Think of them as the public face, the brain, and the memory of your system, respectively.

The API layer (sometimes called the presentation layer or gateway layer) is the edge of your system. It receives raw HTTP requests from clients — browsers, mobile apps, third-party services — and hands them off in a structured way. It handles concerns like authentication, request validation, TLS termination, and rate limiting enforcement. For a URL shortener, this is where GET /abc123 arrives. For a rate limiter, this is where each incoming request is intercepted and checked against a policy.

The application/business logic layer is where decisions happen. This layer implements your core algorithms. In a URL shortener, it encodes and decodes short codes, checks for collisions, and resolves redirects. In a rate limiter, it runs the token bucket or sliding window algorithm, queries counters, and makes the allow/deny decision. This layer is typically stateless, meaning any instance of it can handle any request — a property that becomes critical once you start scaling horizontally.

The storage layer persists everything that needs to survive a server restart. This includes mapping tables (short code → original URL), user data, rate limit counters, and analytics. The storage layer introduces the hardest design tradeoffs: consistency vs. availability, read performance vs. write performance, cost vs. durability.

┌─────────────────────────────────────────────────────────┐
│                        CLIENT                           │
│              (Browser / Mobile / API consumer)          │
└──────────────────────────┬──────────────────────────────┘
                           │ HTTP Request
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   TIER 1: API LAYER                     │
│         Load Balancer → API Gateway → Auth/Validation   │
└──────────────────────────┬──────────────────────────────┘
                           │ Structured Request Object
                           ▼
┌─────────────────────────────────────────────────────────┐
│           TIER 2: APPLICATION / BUSINESS LOGIC          │
│    Stateless App Servers (URL encoding, rate checking)  │
│    In-memory caches (Redis) for hot data                │
└──────────────────────────┬──────────────────────────────┘
                           │ Read/Write
                           ▼
┌─────────────────────────────────────────────────────────┐
│                  TIER 3: STORAGE LAYER                  │
│    Primary DB (MySQL / PostgreSQL / Cassandra)          │
│    Replicas, Shards, Blob Storage, Analytics Store      │
└─────────────────────────────────────────────────────────┘

💡 Mental Model: Think of the three tiers like a restaurant. The API layer is the front-of-house staff taking orders. The application layer is the kitchen deciding how to cook each dish. The storage layer is the pantry and refrigerator holding all the raw ingredients. Each tier has a distinct role and can be scaled independently.

Load Balancers: Traffic Distribution and Fault Tolerance

Once you have multiple instances of your application layer (which you will, at any meaningful scale), you need a way to distribute incoming requests across them. This is the job of a load balancer. A load balancer sits in front of your fleet of app servers and acts as a single entry point for clients, hiding the complexity of the server pool behind one IP address or hostname.

Load balancers use several strategies to route traffic. Round-robin distributes requests sequentially across servers — server 1 gets request 1, server 2 gets request 2, and so on. Least connections routes each new request to whichever server currently has the fewest active connections, which is helpful when requests vary significantly in processing time. IP hash (or sticky sessions) routes all requests from a given client IP to the same server, which matters when session state is stored locally — though well-designed stateless backends should avoid needing this.

Beyond traffic distribution, load balancers provide fault tolerance through health checks. They periodically probe each backend server (e.g., GET /health) and automatically remove unresponsive servers from the pool. When a downed server recovers, it's quietly re-added. This gives you high availability (HA) — the system keeps serving requests even when individual nodes fail.

⚠️ Common Mistake: Many candidates draw a single load balancer in their architecture diagram and call it a day. But a single load balancer is itself a single point of failure (SPOF). In production systems, load balancers are deployed in active-passive or active-active pairs. Always mention this when discussing HA in an interview.

🎯 Key Principle: Load balancers decouple the number of clients from the number of application servers, allowing you to scale the app tier horizontally without changing anything the client sees.

Consistent Hashing: Distributing Data Without Chaos

Now imagine your storage layer isn't a single database but a cluster of cache nodes or database shards. When a request arrives, how do you decide which node holds (or should hold) the relevant data? The naive answer is modular hashing: node = hash(key) % N, where N is the number of nodes. This works until you add or remove a node — at which point N changes, and nearly every key maps to a different node, invalidating your entire cache or forcing a massive data migration. At web scale, this is catastrophic.

Consistent hashing solves this elegantly. Imagine arranging all possible hash values on a circle (a hash ring) from 0 to 2³² − 1. Each node is assigned a position on the ring by hashing its identifier (e.g., its IP address). When you need to store or retrieve a key, you hash the key, find its position on the ring, and walk clockwise until you hit the first node. That node is responsible for the key.

          Node A (hash: 12)
               *
          ___/   \___
  (0) ───/           \──── (45)
       \               /
        \             /
    Node D *         * Node B
    (hash:105)    (hash: 67)
          \       /
           \     /
        Node C (hash: 89)

Key "abc123" hashes to 70 → walks clockwise → lands on Node C
Key "xyz789" hashes to 20 → walks clockwise → lands on Node B (at 67)

When you add a new node, it only takes over a fraction of keys from its clockwise neighbor — all other nodes are unaffected. When you remove a node, its keys are redistributed to the next clockwise node. In both cases, the disruption is proportional to 1/N of the keyspace rather than the entire keyspace.

🔧 Virtual nodes (vnodes) are an important refinement. With only one position per physical node, the ring distribution can be uneven — one node might get far more keys than another depending on hash values. By assigning each physical node multiple positions on the ring (e.g., 150 virtual nodes each), the distribution becomes much more uniform. This is how systems like Apache Cassandra and Amazon DynamoDB implement consistent hashing in practice.

💡 Real-World Example: Redis Cluster uses a variant of this concept with 16,384 hash slots distributed across nodes. When you add a Redis node to the cluster, only the slots it takes over are migrated — the rest of the cluster keeps running without interruption.

Database Replication and Sharding

Your storage layer faces two distinct scaling challenges: read volume (too many reads overwhelming a single node) and write volume + data size (too much data to fit on one machine, too many writes for one node to handle). Replication and sharding address these challenges in complementary ways.

Replication: Scaling Reads and Adding Redundancy

Database replication involves maintaining multiple copies of your data across different nodes. In the most common pattern, leader-follower replication (also called primary-replica), one node is designated the leader and accepts all writes. Changes are then propagated asynchronously or synchronously to one or more follower nodes, which serve read traffic.

For a URL shortener, this maps perfectly to the workload: writes (creating a new short URL) are relatively rare, but reads (resolving a short code to its destination) are extremely frequent. By routing reads to followers and writes to the leader, you can scale read capacity almost linearly by adding more replicas.

⚠️ Common Mistake: Asynchronous replication introduces replication lag — a window of time during which followers might serve stale data. In a URL shortener, this means a brand-new short URL might return a 404 if resolved within milliseconds of creation. Always acknowledge this tradeoff in your interview and propose mitigations (e.g., read-your-own-writes consistency, or routing writes and their immediate reads to the leader).

Sharding: Scaling Writes and Storage Capacity

Sharding (also called horizontal partitioning) divides your dataset into smaller subsets (shards), each stored on a different node. Together, all shards hold the complete dataset, but no single node holds all of it.

The choice of shard key is crucial and often the most intellectually interesting part of a storage design discussion. For a URL shortener, you might shard by the hash of the short code. For a rate limiter, you might shard by user ID or API key. A good shard key distributes data evenly (avoiding hot spots) and aligns with your most common access patterns (minimizing cross-shard queries).

Strategy	Best For	Tradeoff
🔀 Hash-based sharding	Even distribution	Range queries require hitting all shards
📊 Range-based sharding	Range queries (e.g., by date)	Uneven distribution if data isn't uniform
📍 Directory-based sharding	Flexible reassignment	Directory becomes a bottleneck/SPOF

Putting It All Together: The Request Lifecycle

With these architectural components defined, let's trace a complete request through the system. Consider a user clicking a short link like https://short.ly/abc123. Here is the journey that request takes:

Client
  │
  │  GET https://short.ly/abc123
  ▼
DNS Resolution → Load Balancer IP
  │
  ▼
Load Balancer
  │  Health-checked routing (Round Robin)
  ▼
API Gateway / App Server Instance #3
  │  1. Parse short code: "abc123"
  │  2. Check rate limit (Redis call)
  │  3. Cache lookup (Redis): HIT? → return 301 redirect immediately
  │                           MISS? → continue to DB
  ▼
Database Shard (determined by consistent hash of "abc123")
  │  SELECT original_url FROM urls WHERE short_code = 'abc123'
  ▼
App Server
  │  4. Write result to cache (Redis) with TTL
  │  5. Return HTTP 301 Redirect to original URL
  ▼
Client
  │  Follows redirect to original URL
  ▼
Destination Website

Now let's ground this flow in pseudocode. The following snippet represents the core logic inside an app server handling a redirect request:

## URL Shortener: Request lifecycle for resolving a short code
## Assumes: redis_client, db_client, rate_limiter are initialized

def handle_redirect(request):
    short_code = extract_short_code(request.path)  # e.g., "abc123"
    client_ip = request.headers.get("X-Forwarded-For")

    # Step 1: Rate limiting check (runs at API layer)
    if not rate_limiter.allow(client_ip, limit=100, window_seconds=60):
        return HTTPResponse(status=429, body="Too Many Requests")

    # Step 2: Check the cache first (fast path)
    cached_url = redis_client.get(f"url:{short_code}")
    if cached_url:
        return HTTPResponse(status=301, headers={"Location": cached_url})

    # Step 3: Cache miss — query the database (slow path)
    # Consistent hashing determines which DB shard to query
    shard = get_shard_for_key(short_code)  # returns shard connection
    row = shard.query(
        "SELECT original_url, is_active FROM urls WHERE short_code = %s",
        (short_code,)
    )

    if not row or not row["is_active"]:
        return HTTPResponse(status=404, body="Short URL not found")

    original_url = row["original_url"]

    # Step 4: Populate cache to speed up future requests
    # TTL of 3600 seconds (1 hour) balances freshness and cache hit rate
    redis_client.setex(f"url:{short_code}", 3600, original_url)

    # Step 5: Return redirect response
    return HTTPResponse(status=301, headers={"Location": original_url})

This pseudocode shows the cache-aside pattern (also called lazy loading): data is loaded into the cache only on a cache miss. This is the dominant caching strategy for URL shorteners because it's simple, avoids caching data that is never accessed, and degrades gracefully if the cache goes down (requests just hit the database more frequently).

Next, let's look at how the rate limiter component within this flow might be implemented at a high level:

## Rate Limiter: Fixed window counter algorithm
## Uses Redis atomic operations for thread-safe counting

import time

class FixedWindowRateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client

    def allow(self, identifier: str, limit: int, window_seconds: int) -> bool:
        """
        Returns True if the request should be allowed, False if rate limited.
        identifier: unique key for this rate limit subject (e.g., IP, user_id, api_key)
        limit: maximum number of requests allowed per window
        window_seconds: length of the time window in seconds
        """
        # Build a key that encodes the current time window
        # e.g., "ratelimit:192.168.1.1:1700000060"
        window_start = int(time.time() // window_seconds) * window_seconds
        redis_key = f"ratelimit:{identifier}:{window_start}"

        # Atomically increment the counter for this window
        # INCR returns the new value after incrementing
        current_count = self.redis.incr(redis_key)

        if current_count == 1:
            # First request in this window — set expiry so Redis cleans up old keys
            self.redis.expire(redis_key, window_seconds * 2)

        if current_count > limit:
            # Optionally: set Retry-After header based on window_start + window_seconds
            return False  # ← DENY this request

        return True  # ← ALLOW this request

Notice how both snippets lean on Redis heavily. Redis is the connective tissue of most high-throughput web services — it serves as a cache, a rate limit counter store, and a fast key-value store. Its single-threaded, atomic command execution (e.g., INCR) makes it ideal for operations that must be race-condition-free across many concurrent app server instances.

💡 Pro Tip: In an interview, when you say "check the cache" or "increment the counter," follow up with why you're using Redis specifically. Mentioning atomicity, sub-millisecond latency, and built-in TTL support shows you understand the tool, not just the pattern.

🧠 Mnemonic: Remember ALReS for the request lifecycle layers: API Gateway → Load Balancer routing → Redis cache check → Storage (database). When walking through a design, trace every request through ALReS before discussing optimizations.

How These Patterns Interact

The beauty of the three-tier architecture is that each tier scales independently using the tools appropriate to its role. The API layer scales via more load balancer capacity and more app server instances. The application layer stays stateless so horizontal scaling is trivial. The storage layer scales via replication (for reads) and sharding (for writes and capacity), with consistent hashing making node additions and removals seamless.

For a URL shortener, this means:

🔧 API layer handles TLS, authentication, and request routing
🧠 App layer runs encoding/decoding logic and cache lookups
📚 Storage layer uses read replicas for URL resolution (read-heavy) and shards for URL creation (write distribution)

For a rate limiter embedded in that same architecture:

🔧 API layer is where rate limit decisions are enforced (before requests reach business logic)
🧠 App layer runs the rate limit algorithm using Redis counters
🎯 Storage layer uses Redis Cluster (consistent hashing built in) for distributing counters

📋 Quick Reference Card: Architecture Patterns Cheat Sheet

Component	🎯 Primary Role	🔧 Key Technology	⚠️ Watch Out For
🌐 Load Balancer	Traffic distribution, HA	Nginx, AWS ALB, HAProxy	Single point of failure if not paired
🔀 Consistent Hashing	Even key distribution	Custom impl, Redis Cluster	Need vnodes for uniform distribution
📖 Replication	Read scaling, redundancy	MySQL replicas, Cassandra	Replication lag on async setups
🗄️ Sharding	Write scaling, capacity	Vitess, Cassandra, DynamoDB	Hot spots from bad shard keys
⚡ Caching	Latency reduction	Redis, Memcached	Cache invalidation complexity

Why This Architecture Foundation Matters in Interviews

When an interviewer asks you to design a URL shortener or rate limiter, they aren't really asking about those specific systems. They're asking whether you can apply a universal set of architectural principles to a concrete problem under pressure. By anchoring your design in the three-tier architecture, explicitly reasoning about your load balancer's role in fault tolerance, choosing shard keys thoughtfully, and tracing a complete request lifecycle before diving into optimizations, you signal exactly the kind of structured, holistic thinking that distinguishes senior engineers.

❌ Wrong thinking: "I'll jump straight to the database schema and the encoding algorithm — that's the interesting part."

✅ Correct thinking: "Let me first sketch the three tiers, identify where state lives, explain how traffic gets routed, and then drill into the specific algorithms — that way the interviewer can follow my reasoning at every level."

The architectural patterns covered here — three-tier layering, load-balanced stateless app servers, consistent hashing for distribution, replication for read scaling, sharding for write scaling — aren't specific to URL shorteners or rate limiters. They are the vocabulary of distributed systems design. Master them here, and they'll serve you in every system design problem you encounter, from designing a Twitter feed to a distributed job scheduler.

In the next section, we'll examine the most common pitfalls candidates fall into when tackling these specific problems, so you can walk into your interview knowing exactly what not to do.

Common Pitfalls When Approaching These Design Problems

Even experienced developers stumble in system design interviews — not because they lack technical knowledge, but because they fall into predictable traps that interviewers have seen hundreds of times. When it comes to URL shorteners and rate limiters specifically, the pitfalls are remarkably consistent. Understanding them before you walk into the room is one of the highest-leverage things you can do to improve your performance. This section dissects the five most common mistakes, explains why they happen, and gives you concrete strategies to avoid each one.

Mistake 1: Jumping Into Implementation Before Clarifying Requirements ⚠️

The single most common mistake candidates make is hearing "design a URL shortener" and immediately launching into database schema, hash function choices, and redirect logic. This is backwards. An interviewer who watches you do this sees someone who writes code before understanding the problem — a red flag in any engineering context, but especially in system design.

Functional requirements define what the system does. Non-functional requirements define how well it does it. Both are essential before drawing a single box on the whiteboard.

Consider how wildly different two URL shorteners can be:

A personal bookmarking tool might need to handle 100 URLs/day, retain links forever, and never needs analytics.
A marketing campaign platform might need 10 million URL creations/day, link expiration after 30 days, click tracking, geographic analytics, and 99.99% uptime SLAs.

These are the same type of system but require completely different architecture decisions. If you don't ask, you're designing a phantom system.

❌ Wrong thinking: "I know what a URL shortener is — I'll just start designing the canonical version."

✅ Correct thinking: "Let me ask five clarifying questions before I draw anything, because the answers will fundamentally change my design."

Questions Worth Asking Before You Design

For a URL shortener:

What is the expected read-to-write ratio? (Typically 100:1, but confirm)
Should short URLs expire? If so, who controls the TTL — the user or the system?
Do we need custom aliases (e.g., /promo2024)?
Is click analytics a requirement?
What's the target latency for redirects?

For a rate limiter:

Are we limiting per user, per IP, per API key, or some combination?
Should limits be fixed (100 req/min) or adaptive?
What happens when the limit is exceeded — hard block, soft throttle, or queuing?
Is this for internal services or public-facing APIs?
Do we need rate limiting to be globally consistent across data centers, or is per-region acceptable?

💡 Pro Tip: Interviewers often intentionally leave requirements vague to see if you'll ask. Staying silent and "making assumptions" without stating them is not the same as clarifying. State every assumption you make out loud, and invite the interviewer to correct you.

Mistake 2: Ignoring Failure Scenarios ⚠️

Candidates love drawing the happy path. The data flows in, gets processed, gets stored, gets returned — smooth and clean. Interviewers love asking "what happens when X fails?" and watching the candidate's expression change. Failure handling is not a bonus round; it's a core part of the design.

Three failure scenarios come up most frequently in these two systems:

Cache Stampedes

A cache stampede (also called a thundering herd) occurs when a cached item expires and many concurrent requests all simultaneously miss the cache, hammer the database, and race to repopulate the same key. In a URL shortener under high traffic, a popular short URL expiring from Redis could trigger thousands of simultaneous database reads.

Time 0ms:  cached entry for /abc123 expires
Time 1ms:  1,000 requests arrive simultaneously
Time 1ms:  all 1,000 see a cache miss
Time 1ms:  all 1,000 query the database
Time 50ms: database becomes overloaded
Time 80ms: some DB queries time out
Time 90ms: cascading failures begin

The fix is a pattern called probabilistic early expiration or a simpler mutex/lock-based repopulation where only one request rebuilds the cache while others wait or receive a slightly stale value.

import redis
import threading

cache = redis.Redis()
lock = threading.Lock()  # In distributed systems, use a Redis-based lock

def get_long_url(short_code: str) -> str:
    # Attempt cache lookup first
    cached = cache.get(short_code)
    if cached:
        return cached.decode()

    # Cache miss: use a distributed lock to prevent stampede
    lock_key = f"lock:{short_code}"
    # SET NX EX: set only if not exists, with 5-second expiry
    acquired = cache.set(lock_key, "1", nx=True, ex=5)

    if acquired:
        try:
            # This process won the lock — fetch from DB and repopulate
            long_url = db.query("SELECT long_url FROM urls WHERE short_code = %s", short_code)
            cache.setex(short_code, 3600, long_url)  # Cache for 1 hour
            return long_url
        finally:
            cache.delete(lock_key)  # Always release the lock
    else:
        # Another process is repopulating — wait briefly and retry
        import time
        time.sleep(0.05)
        return get_long_url(short_code)  # Retry once cache is warm

This snippet demonstrates a lock-based cache repopulation strategy. The SET NX EX Redis command atomically acquires a lock only if it doesn't already exist, ensuring only one process hits the database. All other concurrent requests retry after a short sleep, finding the cache warm on the second attempt.

Single Points of Failure

A single point of failure (SPOF) is any component whose failure brings the whole system down. In URL shortener designs, candidates frequently draw a single Redis node or a single application server without discussing replication or failover. In rate limiter designs, a centralized counter store is a classic SPOF.

⚠️ Common Mistake: Drawing a Redis node in the middle of your architecture without mentioning Redis Sentinel, Redis Cluster, or a replica read strategy. If an interviewer asks "what happens if Redis goes down?" and you haven't thought about it, the design feels incomplete.

🎯 Key Principle: Every stateful component in your architecture should prompt you to ask: "What is the failure mode here, and what is the recovery path?"

Clock Skew in Distributed Rate Limiters

Clock skew refers to the difference in time reported by clocks on different servers. In a distributed rate limiter using a sliding window algorithm, you need timestamps to be accurate. If Server A thinks it's 12:00:00.000 and Server B thinks it's 12:00:00.300, the window boundaries are inconsistent. A user might get 100 requests through on Server A and another 100 through on Server B within the same actual second.

The pragmatic solution is to use Redis as the single source of time for window calculations, or to use a token bucket algorithm that is inherently more tolerant of minor timing inconsistencies because it operates on rates rather than absolute window boundaries.

Mistake 3: Underestimating the Redirect Flow's Latency Impact ⚠️

Candidates routinely treat the redirect as an afterthought — "the user hits the short URL, we look up the long URL, we redirect them." But the redirect is the critical path for users. Every millisecond of latency here is felt directly. In a marketing campaign scenario, a slow redirect can measurably reduce conversion rates.

Let's trace what actually happens in a naive implementation:

User Browser          App Server           Database
    |                     |                    |
    |--- GET /abc123 ----->|                    |
    |                     |--- SELECT long_url->|
    |                     |<-- long_url --------|  (~10-30ms DB round trip)
    |<-- 302 Redirect -----|                    |
    |                                           |
    |--- GET https://very-long-original-url.com/page ----> (final destination)

The database query adds 10–30ms of latency before the user even starts loading the destination page. At scale, this also means the database is hammered by read traffic for every single redirect.

The optimized flow uses an in-memory cache layer (Redis) as the primary lookup:

User Browser          App Server           Redis          Database
    |                     |                  |                |
    |--- GET /abc123 ----->|                  |                |
    |                     |--- GET abc123 -->|                |
    |                     |<-- long_url -----|  (~0.5-2ms)    |
    |<-- 302 Redirect -----|                  |                |

Cache hit latency in Redis is typically under 1ms on a local network. This is a 20-30x improvement over a database query. Mentioning this comparison explicitly in your interview demonstrates that you think in terms of performance tradeoffs, not just functional correctness.

Beyond caching, there's a subtler choice: 301 vs. 302 redirects.

A 301 (Permanent Redirect) tells browsers and CDNs to cache the redirect. Future visits skip your servers entirely — great for latency, terrible for analytics since you lose visibility into clicks.
A 302 (Temporary Redirect) routes every click through your server, enabling click counting and analytics, but at the cost of latency and server load.

⚠️ Common Mistake: Defaulting to 301 without discussing the analytics tradeoff. If the requirements include click tracking (and they often will in an interview), 302 is the correct choice — but you should articulate why.

from flask import Flask, redirect, abort
import redis

app = Flask(__name__)
cache = redis.Redis(host='localhost', port=6379, db=0)

@app.route('/<short_code>')
def redirect_url(short_code: str):
    # Step 1: Check cache first (fast path, ~1ms)
    long_url = cache.get(short_code)

    if long_url:
        long_url = long_url.decode('utf-8')
    else:
        # Step 2: Cache miss — fall back to database (slow path, ~20ms)
        result = db.execute(
            "SELECT long_url FROM urls WHERE short_code = %s", (short_code,)
        ).fetchone()

        if not result:
            abort(404)  # Short URL does not exist

        long_url = result['long_url']
        # Repopulate cache with 1-hour TTL
        cache.setex(short_code, 3600, long_url)

    # Step 3: Async click tracking (non-blocking, doesn't add latency)
    track_click.delay(short_code)  # Celery async task

    # Use 302 to preserve analytics visibility
    return redirect(long_url, code=302)

This Flask snippet illustrates the cache-aside pattern for the redirect flow. Notice that click tracking is dispatched as an asynchronous task — it does not block the redirect response. This is a critical design detail: you never want analytics logic sitting in the critical path of a user redirect.

Mistake 4: Treating Rate Limiting as a Local Problem ⚠️

Many candidates design a perfectly reasonable rate limiter that works beautifully on a single server — and then forget that their system will run on dozens or hundreds of servers in production.

Here's the problem in concrete terms. Suppose you allow 100 requests per minute per user. With a local in-memory counter:

Server A:  user_123 → 60 requests  ✅ (under limit)
Server B:  user_123 → 60 requests  ✅ (under limit)
Server C:  user_123 → 60 requests  ✅ (under limit)

Actual total: 180 requests — 80% over the limit! ❌

This is the distributed rate limiting problem. Load balancers distribute traffic across servers, and without a shared state store, each server applies limits independently. The user effectively gets limit × server_count requests through.

The solution requires centralized state, typically in Redis. Every counter increment is a Redis operation rather than an in-memory operation. But this introduces a new tradeoff: network overhead. Every request now requires a Redis round-trip (~1ms) that didn't exist before.

import redis
import time

cache = redis.Redis(host='redis-cluster', port=6379)

def is_rate_limited(user_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    """
    Sliding window rate limiter using Redis sorted sets.
    Each request is stored as a member with its timestamp as the score.
    Members outside the window are trimmed on each check.
    """
    now = time.time()
    window_start = now - window_seconds
    key = f"rate_limit:{user_id}"

    pipe = cache.pipeline()  # Use pipeline to batch Redis commands atomically

    # Remove all entries outside the current window
    pipe.zremrangebyscore(key, 0, window_start)

    # Add the current request with its timestamp as the score
    pipe.zadd(key, {str(now): now})

    # Count requests in the current window
    pipe.zcard(key)

    # Set key expiry so Redis cleans up inactive users automatically
    pipe.expire(key, window_seconds)

    results = pipe.execute()
    request_count = results[2]  # zcard result

    return request_count > limit  # True means the request should be blocked

This implements a sliding window log using a Redis sorted set. The timestamp serves as both the member and the score, allowing efficient range queries. Using a pipeline batches all four Redis commands into a single round-trip, minimizing network overhead.

The important nuance to raise in an interview: this centralized approach introduces coordination overhead and a new dependency. If Redis becomes unavailable, what does your rate limiter do? Options include:

🔒 Fail closed: Block all requests (safe but damaging to availability)
🔓 Fail open: Allow all requests (risky but preserves user experience)
🔧 Graceful degradation: Fall back to local in-memory limiting (approximate but functional)

💡 Real-World Example: Stripe's rate limiter uses Redis with a fail-open strategy, accepting that a Redis outage creates a brief window of unrestricted traffic rather than blocking legitimate customers entirely. This reflects a business decision — downtime is worse than a temporary limit bypass for their use case.

🤔 Did you know? Some large-scale systems use a token bucket with periodic synchronization — each server maintains a local token bucket but periodically syncs with Redis to adjust for global consumption. This reduces Redis call frequency by 10-100x while still providing reasonably accurate global limits.

Mistake 5: Neglecting Data Expiry, Cleanup, and Storage Growth ⚠️

System design interviews frequently focus on the ingestion side of a system — how data gets in, how it gets read. Candidates rarely think about what happens to data over time. For long-running systems, storage growth is a slow-moving existential threat.

Consider a URL shortener operating at modest scale:

URL creations per day:     100,000
Average record size:       ~500 bytes (short code + long URL + metadata)
Storage per day:           ~50 MB
Storage per year:          ~18 GB
Storage after 10 years:    ~180 GB

That's manageable. But at web scale (100M URL creations/day), you're looking at 50 TB/year for URL records alone. Without a cleanup strategy, this becomes a serious operational problem.

Three concepts deserve attention here:

TTL-Based Expiry

TTL (Time-to-Live) is an expiry duration attached to each record. Redis supports TTL natively on every key. For databases, you'll need either an expiry timestamp column with a background cleanup job, or a time-partitioned table structure.

URLs Table
+-----------+-------------------+-------------------+-----------+
| short_code| long_url          | created_at        | expires_at|
+-----------+-------------------+-------------------+-----------+
| abc123    | https://example.. | 2024-01-01 10:00  | 2024-02-01|
| xyz789    | https://other...  | 2024-01-15 09:30  | NULL      | ← never expires
+-----------+-------------------+-------------------+-----------+

The expires_at column being NULL for some records indicates those are permanent links. A background job runs nightly to DELETE rows where expires_at < NOW().

⚠️ Common Mistake: Relying entirely on application-level expiry checks ("if the URL is expired, return 404") without actually deleting expired rows. This is the soft delete trap — your table keeps growing, query performance degrades, and you've solved nothing.

Cascading Cleanup

Deletion is rarely isolated. When a URL record is deleted:

The cache entry in Redis should also be invalidated
Click analytics associated with that short code may need to be archived or deleted
The short code itself may need to be returned to a pool for reuse

Cascading cleanup means thinking through every downstream data artifact when a primary record is deleted. Missing this in an interview suggests you haven't thought about the system's full lifecycle.

Storage Tiering for Rate Limiter Data

Rate limiters generate ephemeral data — request timestamps, token counts, window states. This data is only valuable for the duration of the rate-limiting window (seconds to minutes). Redis TTL handles this automatically if you set it correctly, but candidates frequently forget to set expiry on rate-limiting keys at all.

## ❌ Wrong: No expiry set — keys accumulate in Redis forever
cache.incr(f"rate:{user_id}")

## ✅ Correct: Set expiry equal to window duration
pipe = cache.pipeline()
pipe.incr(f"rate:{user_id}")
pipe.expire(f"rate:{user_id}", 60)  # Expire after 60 seconds
pipe.execute()

The fix is trivially simple — but only if you've thought about it. A Redis instance managing millions of users will accumulate gigabytes of stale rate-limiting keys if TTLs are omitted.

📋 Quick Reference Card: Data Lifecycle Checklist

	Component	⏱️ Expiry Strategy	🔧 Cleanup Mechanism	📈 Growth Risk
🔗 Short URLs	Database rows	`expires_at` column	Nightly batch DELETE	Medium
⚡ URL Cache	Redis keys	`SETEX` with TTL	Automatic eviction	Low
📊 Click Analytics	Time-series DB	Archive after 90 days	Partition pruning	High
🚦 Rate Limit Keys	Redis keys	`EXPIRE` on write	Automatic eviction	Low
🔑 Short Code Pool	In-memory/DB	Reclaim on URL delete	Explicit reclamation	Medium

🧠 Mnemonic: Think CLEAR when designing data lifecycle:

Create: How does data enter the system?
Live: How long does data stay valid?
Expire: What triggers or schedules expiry?
Archive: Does old data need to be preserved somewhere cheaper?
Reclaim: Are there finite resources (like short codes) that need to be returned to a pool?

Putting It All Together: The Interview Mental Checklist

Before you start drawing architecture diagrams in any system design interview, run through this mental checklist:

Before designing:

🎯 Did I ask about functional and non-functional requirements?
🎯 Do I know the scale, latency targets, and consistency requirements?
🎯 Did I state all assumptions explicitly?

While designing:

🔧 Is every component that stores state replicated?
🔧 What is the failure mode of each component?
🔧 Am I thinking globally, not just for a single server?

After the happy path:

🧠 What happens under load (cache stampedes, hot keys)?
🧠 What happens when dependencies fail (Redis down, DB slow)?
🧠 What does this system look like after 3 years of data accumulation?

The candidates who stand out in system design interviews are not the ones who know the most algorithms or memorize the most architecture patterns. They're the ones who demonstrate systematic thinking — who treat design as a conversation about tradeoffs rather than a recitation of solutions. The pitfalls covered in this section are, at their core, all symptoms of the same root cause: rushing to answer before fully understanding the problem. Slow down, ask questions, and let the constraints shape the design. That's the skill interviewers are actually evaluating.

Key Takeaways and What to Expect Next

You have just covered a lot of ground. Before this lesson, a system design question about a URL shortener or a rate limiter might have felt like an open-ended puzzle with no clear starting point. Now you have a structured mental framework, a shared vocabulary, and a set of repeatable techniques that apply far beyond these two systems. This final section pulls everything together, hands you a practical checklist you can use in any interview, and points the way toward the deeper dives that follow.

What You Now Understand That You Didn't Before

Let's be explicit about the knowledge delta — the gap between where you started and where you are now — because naming it will help you retain it.

You came in knowing that both URL shorteners and rate limiters are real services. You're leaving with an understanding of why they are interview staples: they are small enough to design end-to-end in 45 minutes, rich enough to reveal how a candidate thinks about scale, trade-offs, and failure modes, and representative enough to test skills that transfer to much larger systems.

More concretely, you now understand:

🧠 Hashing as a first-class design tool. You know the difference between cryptographic hashing (MD5, SHA-256) and non-cryptographic hashing (MurmurHash, xxHash), and you know when each is appropriate. For a URL shortener, you reach for a fast non-cryptographic hash or a base-62 encoding of an auto-incremented ID. For a rate limiter, you use hashing to distribute key lookups across nodes.

📚 Caching as a latency weapon. You can articulate the difference between a read-through cache and a write-through cache, explain why LRU eviction makes sense for URL redirect lookups, and identify the risk of cache stampede on a cold start or after a cache flush.

🔧 Storage choice as a first trade-off conversation. You can compare relational databases (ACID guarantees, complex queries), key-value stores (low-latency point lookups), and in-memory stores (sub-millisecond reads, volatile by default) and match each to the access patterns of a given system.

🎯 Scalability patterns as a toolkit, not a checklist. Horizontal scaling, consistent hashing, read replicas, sharding by key range versus hash, asynchronous write-back — you understand what problem each pattern solves and what new problem it introduces.

🔒 Back-of-the-envelope estimation as a communication tool. You can turn vague requirements into concrete numbers — QPS, storage per year, memory budget — and use those numbers to justify architectural decisions rather than making them seem arbitrary.

The Shared Design Primitives: A Consolidated View

One of the most powerful insights from this lesson is that URL shorteners and rate limiters, despite solving completely different user-facing problems, share the same underlying design vocabulary. Understanding this overlap is what lets an experienced engineer approach a novel system quickly.

          URL Shortener                  Rate Limiter
          ─────────────                  ────────────
  Hashing  base-62(ID) → short key   hash(user_id+endpoint) → counter key
  Caching  short→long URL in Redis   token count / window in Redis
  Storage  PostgreSQL or DynamoDB     Redis (primary) + DB (audit log)
  Scale    read-heavy, cache-first    write-heavy, atomic increments
  Failure  serve stale redirect       fail open OR fail closed
  Metric   redirect latency (p99)     false-positive block rate

When you see this side by side, a clear pattern emerges: the tools are the same; the knobs are tuned differently. This is the mental model that lets senior engineers discuss unfamiliar systems confidently. They are not memorizing solutions. They are applying a portable toolkit.

💡 Mental Model: Think of system design like cooking. A URL shortener and a rate limiter use the same kitchen — hashing, caching, storage, replication — but follow different recipes. Learning the kitchen is more valuable than memorizing a single dish.

Quick-Reference Checklist for Any System Design Interview

The following checklist distills the structured approach described throughout this lesson. Print it, memorize it, practice it until it becomes instinctive. The goal is not to follow it rigidly but to internalize it so thoroughly that you never freeze in an interview.

📋 Quick Reference Card: System Design Interview Answer Structure

🔢 Step	📌 What to Do	⏱️ Time Budget
🎯 Clarify Requirements	Functional (what it does) vs. non-functional (scale, latency, consistency)	3–5 min
🔧 Estimate Scale	QPS, storage, bandwidth — pick the dominant constraint	3–5 min
🗺️ Define the API	Inputs, outputs, error cases	2–3 min
🏗️ High-Level Design	Draw the happy path end-to-end with 3–5 components	5–7 min
🔍 Deep Dive	Pick 1–2 hard sub-problems; go deep on the interviewer's interest	10–15 min
⚖️ Trade-offs	For every major decision, name the alternative and why you didn't pick it	Ongoing
🚨 Failure Modes	What breaks under load? What breaks when a node dies?	3–5 min
📈 Bottlenecks	Where is the next scaling wall? How would you break through it?	2–3 min

⚠️ The single most common mistake candidates make is skipping the requirements clarification step and jumping straight to drawing boxes. An interviewer asking you to design a rate limiter might have a simple per-user API quota in mind, or they might be thinking about a distributed DDoS mitigation system. These require very different architectures. Always clarify first.

A Reusable Code Pattern: The Design Skeleton

Just as you might open a coding interview with a function stub before filling in the logic, you can open a system design answer with a mental skeleton. Here is that skeleton expressed as pseudocode — not because you would write this in an interview, but because seeing it written out helps you internalize the structure.

## System Design Mental Skeleton — apply to any problem

class SystemDesign:
    def clarify_requirements(self, problem_statement):
        """
        Ask before assuming.
        Return: functional_reqs, non_functional_reqs, out_of_scope
        """
        functional = ["what the system must DO"]
        non_functional = ["latency targets", "availability SLA", "scale"]
        out_of_scope = ["what you are explicitly NOT designing"]
        return functional, non_functional, out_of_scope

    def estimate_scale(self, non_functional_reqs):
        """
        Turn qualitative scale into numbers.
        Return: qps, storage_per_year, memory_budget
        """
        qps = self.peak_writes + self.peak_reads
        storage = self.record_size_bytes * self.records_per_day * 365
        # Identify the DOMINANT constraint — usually QPS or storage
        return qps, storage

    def design_api(self, functional_reqs):
        """
        Define the contract before the implementation.
        Inputs → Outputs → Errors
        """
        pass

    def high_level_design(self):
        """
        Client → Load Balancer → Service → Cache → DB
        Draw the happy path. Label each arrow with protocol.
        """
        pass

    def deep_dive(self, hard_sub_problems):
        """
        Pick the 1-2 hardest sub-problems and go deep.
        For URL shortener: ID generation strategy
        For rate limiter: distributed counter consistency
        """
        pass

    def articulate_trade_offs(self, every_major_decision):
        """
        For each decision: name the alternative, explain why you chose differently.
        Never present a design without acknowledging its weaknesses.
        """
        for decision in every_major_decision:
            chosen = decision.choice
            alternative = decision.what_else_you_considered
            reason = decision.why_not_the_alternative

This skeleton is deliberately abstract. Its value is as a checklist you can run in your head at the start of every design session, whether in an interview or on the job.

What to Expect in the URL Shortener Deep Dive

The dedicated URL shortener lesson builds directly on the foundations you have established here. You already know the high-level shape of the problem. The next lesson goes three levels deeper on each of the hard sub-problems.

ID Generation and the Short Key Problem

This is the algorithmic heart of the URL shortener. You will learn the full spectrum of strategies:

Auto-increment + base-62 encoding — simple, sequential, but exposes enumeration risk and creates a single-point-of-failure at the database sequence generator.
Pre-generated key pool — a background worker generates short keys in advance and stores them in a "ready" table; the application pops one atomically at write time. This decouples key generation from the write path.
Distributed ID generators — Snowflake-style IDs that embed a timestamp, a worker ID, and a sequence number, enabling decentralized generation without coordination.

You will see working code for base-62 encoding and decoding, which is the most frequently asked implementation detail in URL shortener interviews.

import string

ALPHABET = string.digits + string.ascii_lowercase + string.ascii_uppercase
## '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
## Length: 62 characters

def encode_base62(number: int) -> str:
    """Convert a positive integer to a base-62 string (short key)."""
    if number == 0:
        return ALPHABET[0]
    result = []
    while number > 0:
        number, remainder = divmod(number, 62)
        result.append(ALPHABET[remainder])
    return ''.join(reversed(result))

def decode_base62(short_key: str) -> int:
    """Convert a base-62 string back to its original integer ID."""
    result = 0
    for char in short_key:
        result = result * 62 + ALPHABET.index(char)
    return result

## Example: database row with ID 125,000,000
## encodes to a 6-character short key
row_id = 125_000_000
short = encode_base62(row_id)   # e.g., 'dnh2S'
original = decode_base62(short) # 125000000

print(f"ID {row_id} → short key '{short}' → decoded {original}")
## With 6 characters: 62^6 = ~56 billion unique keys

This snippet shows exactly what most interviewers expect when they ask "how do you generate the short code?" Being able to write this cleanly — and explain that 6 characters gives you 56 billion unique keys — demonstrates both coding ability and quantitative reasoning.

Redirect Mechanics and HTTP Status Codes

You will learn why the choice between 301 (Moved Permanently) and 302 (Found / Temporary Redirect) is not arbitrary. A 301 tells browsers to cache the redirect forever, reducing load on your servers but preventing you from tracking click-through analytics. A 302 forces every redirect through your server, enabling analytics at the cost of latency and server load. This is a genuine trade-off that interviewers love to probe.

Analytics and the Write Amplification Problem

Every click on a short URL is a potential analytics event: timestamp, referrer, user-agent, geolocation. Writing all of this synchronously on the critical redirect path would unacceptably increase latency. The lesson covers the pattern of publishing events to a message queue (Kafka, SQS) and processing them asynchronously — decoupling the user-facing latency from the analytics pipeline throughput.

🎯 Key Principle: In high-traffic systems, never let analytics writes block user-facing reads. Decouple with a queue.

What to Expect in the Rate Limiter Deep Dive

The dedicated rate limiter lesson takes the algorithmic complexity one step further, because the correctness of a rate limiter depends on subtle mathematical properties of each algorithm.

The Algorithm Spectrum

You will study four algorithms in depth, understanding not just how each works but when to choose each one:

Token Bucket — tokens accumulate at a fixed rate up to a maximum bucket size. Allows controlled bursting. Simple to implement with a Redis DECRBY and a timestamp. Preferred when you want to allow short spikes.
Leaky Bucket — requests flow in at any rate but are processed at a fixed rate, like water draining through a hole. Smooths out bursts entirely. Preferred when downstream systems need uniform input.
Fixed Window Counter — count requests in a fixed time window (e.g., the current minute). Cheap to implement, but suffers from the boundary spike problem: a user can make 2× the allowed requests by hitting the boundary between two windows.
Sliding Window Log — store a timestamp for every request in the last window. Perfectly accurate but memory-intensive at high QPS.
Sliding Window Counter — a weighted hybrid that approximates the sliding window using two fixed-window counters. Accurate enough for most use cases at a fraction of the memory cost.

import time
import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def is_allowed_token_bucket(
    user_id: str,
    max_tokens: int = 10,
    refill_rate: float = 1.0,  # tokens per second
) -> bool:
    """
    Token bucket rate limiter using Redis.
    Returns True if the request is allowed, False if rate-limited.
    """
    key = f"rate_limit:token_bucket:{user_id}"
    now = time.time()

    # Lua script ensures atomic read-modify-write — critical for correctness
    lua_script = """
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local data = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1]) or max_tokens
    local last_refill = tonumber(data[2]) or now

    -- Add tokens based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

    if tokens >= 1 then
        tokens = tokens - 1  -- consume one token
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)  -- auto-expire idle keys
        return 1  -- allowed
    else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        return 0  -- rate limited
    end
    """
    result = r.eval(lua_script, 1, key, max_tokens, refill_rate, now)
    return result == 1

This snippet demonstrates three interview-critical concepts at once: the token bucket algorithm, the use of Redis for fast shared state, and the critical role of Lua scripting to guarantee atomic read-modify-write operations. Without atomicity, two simultaneous requests could both read a token count of 1, both decrement it, and both be allowed — breaking the rate limit guarantee.

Distributed Rate Limiting: The Hard Part

The lesson covers what happens when your rate limiter runs on multiple nodes. A local in-memory counter is perfectly accurate but only sees the traffic hitting that one node. A centralized Redis counter is accurate across nodes but introduces network latency on every request and creates a single point of failure. The lesson explores eventual consistency approaches — where each node keeps a local counter and periodically syncs with a central store — and explains the trade-off: you may briefly over-allow requests near the limit, but you gain resilience and reduced latency.

🤔 Did you know? Cloudflare's public post-mortem on their rate limiter revealed they use a sliding window counter approximation that is accurate to within 0.003% of the true sliding window — a margin of error so small it is imperceptible in practice, achieved at dramatically lower memory cost.

The Practice Imperative: Speak Your Design Out Loud

Here is a truth that most interview preparation resources underemphasize: reading about system design and doing system design interviews are completely different cognitive activities. Reading activates recognition. Interviews require production — generating structured thoughts, verbalizing trade-offs, drawing diagrams, and responding to pushback in real time.

The most effective practice is to close this lesson, open a blank whiteboard (physical or digital), set a 45-minute timer, and design a URL shortener as if an interviewer is watching. Say everything out loud. "I'm going to start by clarifying requirements. I want to know whether analytics is in scope, what the expected QPS is, and whether we need custom aliases." Then keep talking.

💡 Pro Tip: Record yourself on your phone. Watch it back. You will immediately notice the moments where you went silent, jumped to a conclusion without justification, or skipped the trade-off discussion. Those gaps are exactly what interviewers notice.

❌ Wrong thinking: "I'll read more resources and practice interviews after I feel ready." ✅ Correct thinking: "I'll practice out loud now, while the material is fresh, and use the discomfort to identify gaps."

A structured solo practice session might look like:

Minute 0–5: Clarify requirements (talk to yourself as if the interviewer is there)
Minute 5–10: Estimate scale — QPS, storage, memory — and write the numbers down
Minute 10–15: Define the API endpoints with inputs and outputs
Minute 15–25: Draw the high-level architecture and label each component
Minute 25–35: Deep dive on the hardest sub-problem (ID generation or algorithm choice)
Minute 35–40: Articulate trade-offs for your top 3 decisions
Minute 40–45: Identify failure modes and scaling bottlenecks

🧠 Mnemonic: RASED-TF — Requirements, API, Scale, Estimate, Design, Trade-offs, Failures. Run this sequence in every practice session until it becomes automatic.

Summary: The Three Shifts This Lesson Creates

Let's close by naming three concrete shifts in how you now think about system design problems — because articulating them will help cement them.

Shift 1: From answers to frameworks. Before this lesson, you might have tried to memorize "the answer" to a URL shortener question. Now you know there is no single answer — there are trade-off-laden decisions, and the interviewer is evaluating your reasoning process, not your memorized solution.

Shift 2: From components to constraints. Before, you might have described a system as "a load balancer, a few application servers, and a database." Now you lead with the dominant constraint — read-heavy versus write-heavy, latency-sensitive versus throughput-optimized — and let that constraint drive the component choices.

Shift 3: From fear of the unknown to curiosity about trade-offs. When an interviewer says "but what if Redis goes down?" it can feel like an attack. Now you recognize it as an invitation to demonstrate that you have thought about failure modes. The answer is not "it won't go down" — it is "here's how we handle it, and here's the trade-off that approach introduces."

⚠️ Final critical point to remember: In a system design interview, a well-reasoned wrong answer beats a confidently stated right answer with no justification. Interviewers are hiring engineers who will make thousands of design decisions they can defend. Show your reasoning at every step.

Practical Next Steps

🔧 Next Step 1: Do the sub-topic lessons. The URL Shortener lesson and the Rate Limiter lesson that follow this one are structured to build directly on what you have learned here. Approach them not as isolated topics but as concrete instantiations of the shared framework you now carry.

📚 Next Step 2: Build a toy version of each system. Nothing cements understanding like implementation. Spend two hours building a URL shortener in your preferred language — even without a proper database, just enough to see the base-62 encoding, the redirect, and the cache lookup work end-to-end. Then build a token bucket rate limiter middleware for a simple web server. The experience of debugging the atomicity issue will make it unforgettable.

🎯 Next Step 3: Find a practice partner. The most effective preparation involves another person playing the role of interviewer — asking "why not X?" and "what happens when Y fails?" If you don't have a colleague, structured mock interview platforms or community forums can serve the same function. The goal is to practice producing explanations under social pressure, which is the actual condition of a real interview.

You now have the vocabulary, the mental models, the estimation techniques, the architectural patterns, and the pitfall awareness to approach URL shortener and rate limiter questions with genuine confidence. The rest is practice.

📝

Ready to practice?

This lesson has 15 questions to help you learn