Core Building Blocks
Learn the essential infrastructure components used repeatedly in system design solutions.
Why Building Blocks Are the Foundation of System Design
Imagine you're handed a blank whiteboard and asked to design Twitter from scratch. Millions of users. Billions of tweets. Real-time feeds. Where do you even begin? If you've ever felt that surge of panic — or quietly wondered what interviewers are actually looking for when they ask these questions — you're in exactly the right place. Grab our free flashcards at the end of each section to reinforce what you learn; they'll make these concepts stick far faster than re-reading alone. The secret that separates strong candidates from great ones isn't knowing every detail of every system. It's knowing the building blocks — the proven, reusable components that engineers have used to solve the same categories of problems for decades.
System design interviews aren't pop quizzes about trivia. They're conversations about how you think. And the way you demonstrate strong thinking is by reaching into a mental toolkit of well-understood components — load balancers, caches, CDNs, API gateways, message queues, databases, and more — and assembling them thoughtfully into a coherent architecture. This section explains why that toolkit exists, why interviewers care about it so deeply, and how you can start building your own fluency with it.
What System Design Interviews Are Really Testing
Before we dive into the components themselves, it's worth pausing on a question that most preparation guides skip entirely: what is a system design interview actually measuring?
It is not measuring whether you've memorized the internal architecture of Amazon DynamoDB. It is not measuring whether you know the exact default timeout on an Nginx proxy. What it is measuring is your engineering judgment — your ability to look at a set of requirements and reason clearly about which tools to reach for, which trade-offs matter, and how components interact under load or failure.
🎯 Key Principle: System design interviews assess your ability to compose well-known components into cohesive architectures, not your ability to invent new ones from scratch.
This distinction matters enormously. It means that the best preparation is not learning obscure facts — it's developing a clear, confident mental model of what each building block does, when to use it, and what you give up when you choose it. When a senior engineer at a top company designs a new feature, they don't reinvent caching or load balancing. They pull those solved problems off the shelf and focus their creative energy on the parts of the problem that are genuinely new.
💡 Real-World Example: When Netflix needed to handle streaming video to hundreds of millions of users worldwide, their engineers didn't build a novel data delivery mechanism. They reached for Content Delivery Networks (CDNs) — a well-understood building block — and combined them with smart caching strategies and adaptive bitrate streaming. The innovation was in how they composed known components, not in the components themselves.
Building Blocks Are Battle-Tested Primitives
The term primitive is borrowed from programming languages, where primitives are the simplest, most fundamental values: integers, booleans, strings. In system design, building blocks play the same role at the architectural level. They are the atoms from which more complex systems are constructed.
What makes a component earn the status of a building block? Three qualities:
🧠 Generality — It solves a problem that recurs across many different systems and domains. 📚 Proven reliability — It has been deployed at scale, battle-tested in production, and has well-understood failure modes. 🔧 Composability — It can be combined with other building blocks without requiring deep knowledge of their internals.
Consider a cache. Whether you're designing a social media platform, an e-commerce checkout system, a ride-sharing app, or an online learning platform, you will almost certainly need to reduce repeated expensive computations or database reads by storing results temporarily closer to the requester. The cache building block is the same across all of these contexts. The configuration changes. The data model changes. But the concept — store frequently accessed data in fast, temporary storage — is universal.
🤔 Did you know? The concept of caching dates back to the 1960s, when CPU designers began using small, fast memory regions to reduce access times to slower main memory. The same principle that speeds up a CPU cache is what powers Redis serving millions of web requests per second today.
Here's a simple illustration of how a cache sits in a request path. This is the kind of diagram you might sketch on a whiteboard during an interview:
Client Request
│
▼
┌─────────────┐
│ API Server │
└──────┬──────┘
│
▼
┌─────────────┐ Cache HIT ┌─────────────┐
│ Cache │◄──────────────│ Return │
│ (Redis) │ │ Cached Val │
└──────┬──────┘ └─────────────┘
│ Cache MISS
▼
┌─────────────┐
│ Database │
│ (Postgres) │
└──────┬──────┘
│
▼
Store in Cache
+ Return to Client
This flow — check cache, fall through to database on miss, store result, return response — is something you'll recognize in virtually every backend system you encounter. That's the power of a building block. Learn it once, and you'll see it everywhere.
Trade-Offs Matter More Than API Knowledge
Here's a mistake many candidates make when preparing for system design interviews: they spend their study time memorizing how to configure specific technologies rather than understanding why those technologies exist and what they cost you.
❌ Wrong thinking: "I need to memorize Redis commands so I can talk confidently about caching."
✅ Correct thinking: "I need to understand why caching improves performance, what data is safe to cache, how long to keep it, and what happens when cached data goes stale."
An interviewer at Google or Stripe doesn't care whether you know the syntax for SET key value EX 3600 in Redis. They care whether you can reason about cache invalidation — one of the famously hard problems in computer science — and whether you understand the trade-off between serving stale data (which is fast) and always fetching fresh data (which is slow but accurate).
💡 Mental Model: Think of every building block as a contract with two sides. On one side is what it gives you: speed, scalability, fault tolerance, simplified code. On the other side is what it costs you: operational complexity, potential for stale data, additional failure modes, latency overhead. Your job in a system design interview is to show you understand both sides of that contract.
Let's make this concrete with a small pseudocode example showing the trade-off you navigate when adding a cache:
## WITHOUT caching: every request hits the database
def get_user_profile(user_id):
# This is slow — could take 20-100ms per query
return database.query("SELECT * FROM users WHERE id = %s", user_id)
## WITH caching: fast for repeated reads, but requires a strategy for stale data
CACHE_TTL_SECONDS = 300 # 5 minutes — a judgment call with real trade-offs
def get_user_profile_cached(user_id):
cache_key = f"user:profile:{user_id}"
# Check cache first (sub-millisecond)
cached = cache.get(cache_key)
if cached:
return cached # Fast path — no DB hit
# Cache miss: fetch from DB and populate cache
profile = database.query("SELECT * FROM users WHERE id = %s", user_id)
cache.set(cache_key, profile, ttl=CACHE_TTL_SECONDS)
# Trade-off: user profile changes won't be visible for up to 5 minutes
# Is that acceptable? Depends on your system requirements!
return profile
Notice that the second version isn't just "better" — it introduces a real trade-off. If a user updates their display name, their old name might be served from cache for up to 5 minutes. Is that acceptable? It depends on the product. In a tweet feed? Probably fine. In a billing system showing payment information? Absolutely not. This is the kind of reasoning interviewers want to hear.
⚠️ Common Mistake 1: Treating building blocks as inherently good choices. Every building block adds complexity. Only reach for one when the trade-off is clearly worth it for your specific requirements.
A Mental Model That Scales to Unfamiliar Systems
One of the most powerful benefits of deeply understanding building blocks is that it gives you a framework for reasoning about systems you've never seen before. This is the skill that senior engineers develop over years of experience — and it's something you can accelerate deliberately.
Here's the core idea: almost every large-scale system is a composition of a small set of recurring building blocks, connected by well-understood patterns. When you encounter a new system, your job is to identify which building blocks are in use, why they were chosen, and how they interact.
🧠 Mnemonic — "CLAD-MQ": The six most common building block categories you'll encounter are Caches, Load Balancers, API Gateways, Databases, Message Queues, and CDNs (Content Delivery Networks). CLAD-MQ. Commit that to memory and you'll have an instant starting vocabulary for any system design conversation.
Let's see what this looks like in practice. Suppose you're asked in an interview to design a system you've never thought about before — say, a real-time collaborative document editor (think Google Docs). You've never built one. You might not even have thought deeply about how it works. But with a strong building block vocabulary, you can immediately start structuring your thinking:
Collaborative Editor — Mental Model Walkthrough
┌────────────────────────────────────────────────────────┐
│ Where do I route user traffic? │
│ → Load Balancer (distribute across app servers) │
├────────────────────────────────────────────────────────┤
│ How do I handle auth and routing logic centrally? │
│ → API Gateway (auth, rate limiting, routing) │
├────────────────────────────────────────────────────────┤
│ Where do I store document content? │
│ → Database (likely document store like MongoDB) │
├────────────────────────────────────────────────────────┤
│ How do I broadcast edits to all active collaborators? │
│ → Message Queue / Pub-Sub (Kafka or Redis Pub/Sub) │
├────────────────────────────────────────────────────────┤
│ How do I serve static assets (fonts, icons, JS)? │
│ → CDN (cache static assets at edge nodes) │
├────────────────────────────────────────────────────────┤
│ How do I store recent document state for fast access? │
│ → Cache (Redis, avoid DB reads for active docs) │
└────────────────────────────────────────────────────────┘
You haven't designed Google Docs — but you've started designing a collaborative editor, in real time, from first principles, using a handful of recognizable building blocks. That is exactly what a strong system design interview performance looks like. The building block vocabulary gives you traction on an unfamiliar slope.
💡 Pro Tip: When you get a system design prompt you've never seen before, your first move should be to mentally walk through your building block checklist. "Do I need to distribute load? Do I need to cache? Do I need async processing? Do I need global content delivery?" This structured approach signals maturity and prevents the blank-whiteboard panic.
How This Lesson Is Structured — and How It Connects
This lesson is designed to give you a complete, practical foundation in system design building blocks. Here's how the six sections work together:
📋 Quick Reference Card: Lesson Structure
| 🔧 Section | 📚 What You'll Learn |
|---|---|
| 🎯 Section 1 (this section) | Why building blocks matter and how to think about them |
| 🧠 Section 2 | The core building blocks — what each one is and does |
| 📈 Section 3 | How building blocks enable scalability and reliability |
| 🏗️ Section 4 | Composing building blocks in realistic design scenarios |
| ⚠️ Section 5 | Common pitfalls and misconceptions to avoid |
| 📋 Section 6 | Cheat sheet and interview quick-reference |
It's also worth understanding how this lesson connects to the broader roadmap. The building blocks you'll learn here don't exist in isolation — they interact deeply with databases (how you store and retrieve data at scale) and messaging systems (how components communicate asynchronously). When you design a system that needs to fan out a notification to millions of users, you're combining your knowledge of message queues (a building block) with your understanding of database write patterns. The building blocks are the vocabulary; databases and messaging are where that vocabulary gets put to its most sophisticated use.
Here's a simplified view of how the concepts connect across the roadmap:
Core Building Blocks
│
├──────────────────────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌────────────────────┐
│ Databases │ │ Messaging Systems │
│ │ │ │
│ • SQL vs NoSQL│◄────────────────│ • Queues │
│ • Sharding │ (cache sits │ • Pub/Sub │
│ • Replication │ in front) │ • Event Streaming │
└───────────────┘ └────────────────────┘
│ │
└──────────────┬───────────────────┘
│
▼
Complete System Architecture
(URL Shortener, Twitter, etc.)
The arrows in this diagram reflect a key insight: building blocks are often the glue between databases and messaging systems. A cache reduces load on your database. A message queue decouples your API servers from your database writers. An API gateway routes requests to the right service before they ever touch a database. Understanding building blocks first gives you the context you need to understand why databases and messaging systems are designed the way they are.
The Mindset Shift That Changes Everything
Before we move into the details of individual building blocks, there's one final mindset shift worth making explicit — because it will color everything else in this lesson.
Most developers, early in their careers, think about system design in terms of technologies: "I'll use Redis for caching, PostgreSQL for the database, and Nginx for the proxy." This is technology-first thinking, and while it's not wrong, it puts the cart before the horse.
Strong system designers think in terms of problems and properties first, technologies second:
- What property does this part of the system need? (Low latency reads? High write throughput? Eventual consistency? Strong durability?)
- Which building block provides that property? (A cache for low-latency reads; a message queue for high write throughput; a CDN for globally fast static delivery.)
- Which specific technology implements that building block best for my constraints? (Redis vs. Memcached for caching; Kafka vs. RabbitMQ for queuing; Cloudflare vs. Fastly for CDN.)
💡 Mental Model: Think of building blocks as roles in a play. Redis, Memcached, and even an in-process hash map can all play the role of "cache." Kafka, RabbitMQ, and SQS can all play the role of "message queue." When you're designing a system, cast the roles first, then decide who plays them based on your specific production requirements.
This three-step approach — property → building block → technology — is precisely what interviewers see when a candidate is operating at the senior engineer level. It shows that you're not just pattern-matching to familiar tools; you're actually reasoning from requirements.
🎯 Key Principle: Know what problem each building block solves before you know which technology implements it. The abstraction is more durable than any specific tool.
With that foundation in place, you're ready to meet the building blocks themselves. In the next section, we'll survey each core component — load balancers, caches, CDNs, API gateways, proxies, and more — exploring what each one does, when to reach for it, and the trade-offs it introduces. By the end of that section, you'll have a complete mental inventory to draw from whenever a whiteboard appears in front of you.
The Core Building Blocks: What They Are and What They Do
Before you can design systems at scale, you need a working vocabulary — a toolkit of proven components whose behavior, trade-offs, and appropriate use cases you can reason about confidently. Just as a civil engineer doesn't reinvent the beam or the arch for every bridge, a software engineer draws on a set of battle-tested building blocks when composing distributed systems. In this section we'll walk through the five most important ones: load balancers, caches, CDNs, API gateways, and proxies. By the end, you'll be able to say not just what each component is, but why it exists and when to reach for it.
Load Balancers: Distributing the Weight
Imagine a single cashier at a grocery store. On a quiet Tuesday morning, one cashier is plenty. But on the day before Thanksgiving, a single cashier creates a catastrophic bottleneck. The solution is to open more checkout lanes — and then to direct customers to the shortest one. That is precisely what a load balancer does in a distributed system.
A load balancer sits in front of a pool of servers and distributes incoming requests across them. This serves two critical purposes: it prevents any single server from becoming overwhelmed (horizontal scaling), and it provides fault tolerance by routing traffic away from servers that have failed or become unhealthy.
Client Requests
│
▼
┌─────────────┐
│ Load Balancer│
└──────┬──────┘
│
┌────┴────┐
│ │
▼ ▼
┌───┐ ┌───┐ ┌───┐
│ S1│ │ S2│ │ S3│
└───┘ └───┘ └───┘
App Servers (backend pool)
Load balancers make routing decisions using a load balancing algorithm. The three most important to know are:
Round-Robin
Requests are sent to each server in sequence: S1, S2, S3, S1, S2, S3... This is simple and works well when all servers are equally capable and request processing time is roughly uniform. It fails when requests vary wildly in cost — a slow database-heavy request and a fast cached response are treated the same way.
Least Connections
The load balancer routes each new request to whichever server currently has the fewest active connections. This adapts better to variable request duration. If S2 is handling five long-running uploads and S3 is idle, new connections go to S3.
Consistent Hashing
Consistent hashing maps both servers and requests to positions on a conceptual ring. A request is routed to the closest server clockwise on the ring. The elegant property of this approach is that when a server is added or removed, only a fraction of requests need to be remapped — crucial for stateful systems like caches, where you want the same key to reliably land on the same node.
💡 Real-World Example: Netflix uses multiple layers of load balancing. At the DNS level, requests are routed to a regional cluster. Within that cluster, a software load balancer (Netflix's own Zuul) routes traffic to individual microservices. This layered approach lets them handle hundreds of millions of streams simultaneously.
⚠️ Common Mistake: Assuming a load balancer solves all scaling problems. A load balancer distributes traffic, but if your database is a single node being hammered by all your app servers, you've just moved the bottleneck downstream. Load balancing is one piece of the puzzle, not the whole picture.
Load balancers can be hardware-based (physical appliances like F5) or software-based (HAProxy, NGINX, AWS ALB/NLB). In modern cloud architectures, software load balancers are nearly universal due to their flexibility and cost.
Caches: Trading Space for Speed
Fetching data from a database involves disk I/O, network round-trips, query parsing, and execution — all of which are slow relative to reading from memory. A cache short-circuits this process by storing a copy of frequently-requested data in fast, in-memory storage so subsequent requests can be served without touching the database at all.
🎯 Key Principle: Caching works because of locality of reference — in most real-world systems, a small percentage of data is responsible for the vast majority of reads. The popular tweet, the trending product page, the home city weather — these are read thousands of times more often than the long tail of content.
The two dominant in-memory cache stores are Redis and Memcached. Redis is richer, supporting data structures like sorted sets, pub/sub messaging, and persistence. Memcached is simpler and sometimes faster for pure key-value caching. In practice, Redis has become the default choice for most new systems.
import redis
import json
## Connect to Redis cache
cache = redis.Redis(host='localhost', port=6379, db=0)
def get_user_profile(user_id: str) -> dict:
cache_key = f"user:profile:{user_id}"
# 1. Try the cache first
cached = cache.get(cache_key)
if cached:
print("Cache HIT")
return json.loads(cached) # Deserialize and return immediately
# 2. Cache MISS — fetch from database
print("Cache MISS — querying database")
profile = db.query("SELECT * FROM users WHERE id = %s", user_id)
# 3. Store in cache with a 5-minute TTL (time-to-live)
cache.setex(cache_key, 300, json.dumps(profile))
return profile
This pattern — check cache, fall back to database, populate cache on miss — is called the cache-aside (or lazy loading) pattern. It's the most common caching strategy because it only loads data that's actually requested, keeping the cache lean.
The TTL (time-to-live) is critical: it controls how long a cached entry is considered fresh. A TTL that's too short means the cache rarely helps. A TTL that's too long risks serving stale data — showing a user an outdated balance, an old product price, or deleted content.
⚠️ Common Mistake: Forgetting cache invalidation. When data changes in the database, stale cache entries must be explicitly removed or updated. Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." This is a warning, not a joke.
🧠 Mnemonic: Think of a cache as a sticky note on your monitor. It's faster to glance at the note than to search your filing cabinet. But if the underlying facts change, the note becomes wrong until you update it.
Beyond cache-aside, two other strategies appear frequently in interviews:
- Write-through cache: Every database write also updates the cache. Data is always fresh, but every write incurs the cache update cost.
- Write-behind (write-back) cache: Writes go to the cache first, and the database is updated asynchronously. Faster writes, but risk of data loss if the cache fails before the write persists.
CDNs: Bringing Content Closer to Users
Latency is physics. A request from a user in Sydney to a server in Virginia must travel roughly 16,000 kilometers. At the speed of light, that alone takes over 50 milliseconds each way — before accounting for network hops, routing, and processing. A Content Delivery Network (CDN) solves this by caching content at edge nodes distributed geographically around the world, so the user's request never has to travel to the origin server.
User in Tokyo User in London
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Tokyo PoP│ │London PoP│ ← CDN Edge Nodes (Points of Presence)
└────┬─────┘ └────┬─────┘
│ (cache miss only) │
└──────────┬───────────┘
▼
┌─────────────┐
│ Origin Server│ ← Only consulted on cache miss
│ (Virginia) │
└─────────────┘
CDNs are purpose-built for static assets: images, videos, CSS files, JavaScript bundles, and downloadable files. These assets don't change per-request and can safely be cached at the edge for hours or days. When a user in Tokyo requests logo.png, the CDN's Tokyo edge node responds directly — the request never crosses the Pacific.
💡 Real-World Example: Cloudflare operates over 300 edge locations worldwide. When a website behind Cloudflare serves a static image, that image is cached in dozens of cities simultaneously. A user in São Paulo gets the image from a nearby Brazilian edge node in milliseconds, not from a server in New York.
Modern CDNs like Cloudflare, Akamai, and AWS CloudFront have expanded beyond static assets. They now support edge computing — running lightweight functions at the CDN edge to handle dynamic logic. But in a system design interview, the core mental model remains: CDNs offload static delivery and reduce latency through geographic distribution.
🤔 Did you know? Netflix encodes each video in dozens of quality levels and stores copies across CDN edge nodes globally. Over 90% of Netflix traffic is served directly from CDN nodes, never touching Netflix's origin servers.
The trade-off to understand: CDNs introduce a cache invalidation challenge at global scale. If you push a new version of your JavaScript bundle, you need to either bust the cache (by changing the filename, e.g., app.v2.js) or issue a purge command to invalidate the old cached version across all edge nodes.
API Gateways: The Front Door of Your System
As systems grow into collections of microservices, a new problem emerges: clients shouldn't need to know about every individual service, its location, its authentication mechanism, or its protocol. An API gateway acts as the single entry point for all client traffic — a smart front door that handles the cross-cutting concerns that every service would otherwise have to implement independently.
Clients (Web, Mobile, Third-party)
│
▼
┌─────────────────┐
│ API Gateway │
│ ┌───────────┐ │
│ │ Auth │ │ ← JWT validation, OAuth
│ │ Rate Limit│ │ ← 1000 req/min per user
│ │ Routing │ │ ← /users → User Service
│ │ Transform│ │ ← REST → gRPC
│ └───────────┘ │
└────────┬────────┘
┌────────────┼────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐
│User Service│ │OrderSvc │ │PaymentSvc│
└────────────┘ └──────────┘ └──────────┘
An API gateway typically handles:
- 🔒 Authentication & Authorization: Validates tokens (JWT, OAuth) before requests reach backend services, so services don't each implement auth independently.
- 🎯 Rate Limiting: Enforces usage quotas (e.g., 100 requests per minute per API key), protecting backend services from abuse.
- 🔧 Routing: Maps incoming paths and methods to the appropriate microservice (
GET /orders/123→ Order Service). - 📚 Protocol Translation: Converts between client-facing protocols (REST/HTTP) and internal protocols (gRPC, Thrift) — clients don't need to know about internal communication protocols.
- 🧠 Request/Response Transformation: Aggregates responses from multiple services into a single response (the Backend for Frontend pattern).
// Simplified API Gateway route configuration (Express.js style pseudocode)
const gateway = express();
// Middleware: Authenticate all requests
gateway.use(validateJWT);
// Middleware: Rate limiting — 100 requests per minute per IP
gateway.use(rateLimit({ windowMs: 60_000, max: 100 }));
// Route: /api/users/* → User Service (internal gRPC)
gateway.all('/api/users/*', async (req, res) => {
const grpcResponse = await userServiceClient.handleRequest({
method: req.method,
path: req.path,
body: req.body,
userId: req.user.id // Injected by JWT middleware
});
res.json(grpcResponse); // Translate gRPC response back to JSON/REST
});
// Route: /api/orders/* → Order Service
gateway.all('/api/orders/*', proxyTo('http://order-service:8080'));
This example shows how an API gateway centralizes authentication and rate limiting, then routes requests to appropriate backend services — including protocol translation from REST to gRPC.
💡 Pro Tip: In interviews, when you add an API gateway to your design, explicitly name the responsibilities you're assigning to it. Don't just say "there's an API gateway here." Say "the API gateway handles authentication, rate limiting, and routes traffic to our three microservices." This shows you understand its role, not just its existence.
⚠️ Common Mistake: Treating the API gateway as a service that can do heavy business logic. API gateways are infrastructure — they should be thin, fast, and focused on cross-cutting concerns. Business logic belongs in your services. An overloaded gateway becomes a bottleneck and a maintenance nightmare.
Popular API gateway solutions include Kong, AWS API Gateway, NGINX, and Apigee. In cloud-native environments, service meshes like Istio sometimes absorb some gateway responsibilities at the infrastructure level.
Proxies: Forward and Reverse
The word "proxy" means acting on behalf of another — and in networking, proxies do exactly that. The confusion arises because there are two very different kinds, and they serve opposite directions of traffic.
Forward Proxies: Acting for the Client
A forward proxy sits between clients and the internet, acting on behalf of the client. The client configures its traffic to go through the proxy, which then makes requests to external servers. The external server sees the proxy's IP, not the client's.
[Clients in corporate network]
│
▼
┌──────────────┐
│ Forward Proxy│ ← Acts on behalf of clients
└──────┬───────┘
│
▼
[Internet / External Servers]
Forward proxies are used to:
- Control outbound traffic in corporate environments (blocking social media, enforcing content policies)
- Anonymize client identity (VPNs are a form of forward proxy)
- Cache outbound requests to reduce bandwidth usage
Reverse Proxies: Acting for the Server
A reverse proxy sits in front of servers, acting on behalf of the server. Clients send requests to the proxy, believing they're talking to the server directly. The proxy forwards requests to backend servers and returns responses.
[External Clients / Internet]
│
▼
┌──────────────┐
│ Reverse Proxy│ ← Acts on behalf of servers
└──────┬───────┘
│
┌────┴────┐
▼ ▼
┌───┐ ┌───┐
│ S1│ │ S2│ ← Backend servers (hidden from clients)
└───┘ └───┘
Reverse proxies are ubiquitous in production systems. They:
- Protect backend servers by hiding their identities and absorbing direct internet exposure
- Terminate SSL/TLS, offloading cryptographic overhead from application servers
- Cache responses, reducing load on backend servers
- Compress responses before delivery to clients
- Provide load balancing (in fact, many load balancers are implemented as reverse proxies)
💡 Mental Model: Think of a forward proxy as a travel agent — you tell the agent where you want to go, and they handle the interaction with the airline. Think of a reverse proxy as a hotel front desk — guests interact with the desk, not directly with housekeeping, the kitchen, or maintenance. The desk routes requests internally and presents a unified face to guests.
🤔 Did you know? NGINX is simultaneously one of the most popular web servers, load balancers, and reverse proxies in the world. The same tool serves all three roles in different configurations. This is why understanding the conceptual distinction matters more than memorizing tool names.
❌ Wrong thinking: "A reverse proxy and a load balancer are different things." ✅ Correct thinking: Load balancing is a function that a reverse proxy can perform. A reverse proxy may also do SSL termination, caching, and compression. A load balancer in the traditional sense is often a reverse proxy with load balancing as its primary focus.
Putting It Together: A Quick Reference
Before we move on to how these building blocks combine to enable scalability and reliability, here's a consolidated view of each component's primary role and key trade-off:
📋 Quick Reference Card: Core Building Blocks
| Component | 🎯 Primary Role | ⚠️ Key Trade-off |
|---|---|---|
| 🔧 Load Balancer | Distribute traffic across servers | Adds a network hop; can itself be a SPOF if not redundant |
| 🧠 Cache | Reduce latency by serving data from memory | Stale data risk; cache invalidation complexity |
| 🌐 CDN | Deliver static assets from nearby edge nodes | Cache invalidation at global scale; not suited for dynamic content |
| 🔒 API Gateway | Single entry point for routing, auth, rate limiting | Can become a bottleneck; must be highly available |
| 📚 Reverse Proxy | Protect and represent backend services | Added complexity; SSL termination point |
Each of these components has a well-defined job. They are composable — in a real system, you'll use several of them together. A user request might pass through a CDN edge node, then a load balancer, then an API gateway, before hitting an application server that reads from a cache. Understanding what each layer contributes — and what problem it was designed to solve — is the foundation of all system design reasoning.
With these five building blocks clearly in view, we're ready to explore how they work together to achieve the core goals of system design: scalability, reliability, and fault tolerance — which is exactly where we're headed in the next section.
How Building Blocks Enable Scalability and Reliability
Understanding individual building blocks is only half the battle. The deeper insight — the one that separates strong system design candidates from great ones — is understanding why those building blocks exist and what engineering problems they are designed to solve. Every load balancer, every cache cluster, every replicated database is an answer to the same underlying questions: How do we handle more traffic? How do we survive failures? How do we keep the system running when things go wrong?
This section explores the core principles — horizontal scaling, replication, redundancy, fault tolerance, and the algorithms that power them — that give building blocks their purpose. Along the way, you'll see how these principles manifest in real architectural decisions and in code.
Scaling Up vs. Scaling Out
When a system starts struggling under load, the first instinct is often to throw better hardware at it: more RAM, faster CPUs, larger disks. This is called vertical scaling (or "scaling up"), and it has a ceiling. Machines have maximum specs, hardware upgrades require downtime, and a single powerful machine is still a single point of failure.
Horizontal scaling (or "scaling out") takes the opposite approach: instead of making one machine bigger, you add more machines of roughly the same size. This model is the architectural backbone of nearly every large-scale distributed system in use today.
Vertical Scaling (Scale Up):
[Small Server] → [Bigger Server] → [Biggest Server] → ❌ Ceiling
Horizontal Scaling (Scale Out):
[Server] → [Server][Server] → [Server][Server][Server][Server] → ♾️ (near-unlimited)
The key enabler of horizontal scaling is the load balancer. Without something to distribute incoming requests across multiple servers, adding more machines would be invisible to clients — they'd still all hammer the same endpoint. A load balancer sits in front of your server pool and routes each request to an available instance, making the entire cluster appear as a single service to the outside world.
🎯 Key Principle: Horizontal scaling only works if your application is stateless — meaning no individual server stores session state that other servers can't access. If a user's session lives on Server A and their next request goes to Server B, they might appear logged out. This is why session data is typically moved to a shared store (like Redis) when horizontal scaling is introduced.
Replication: Eliminating Single Points of Failure
Horizontal scaling solves the capacity problem. Replication solves the availability problem. A single point of failure (SPOF) is any component whose failure brings down the entire system. Replication eliminates SPOFs by copying data or services across multiple nodes so that no single failure is catastrophic.
There are two primary replication models:
- 🔧 Leader-follower (primary-replica) replication: One node accepts all writes and propagates changes to one or more read-only replicas. Reads can be distributed across replicas to reduce load on the leader.
- 🔧 Multi-leader (or leaderless) replication: Multiple nodes can accept writes, which introduces complexity around conflict resolution but enables higher write availability and geographic distribution.
Leader-Follower Replication:
┌───────────────┐
│ Application │
└──────┬────────┘
│
┌───────▼────────┐
│ Load Balancer │
└──┬─────────────┘
│
Writes │ Reads
┌─────▼──────┐ ┌──────────────┐ ┌──────────────┐
│ Leader │──►│ Replica 1 │ │ Replica 2 │
│ (Primary) │──►│ (Follower) │ │ (Follower) │
└────────────┘ └──────────────┘ └──────────────┘
Replication stream →
If the leader crashes, one of the replicas can be promoted to leader — a process called failover. Modern databases like PostgreSQL, MySQL, and MongoDB have tooling to automate this. But it introduces subtle dangers:
⚠️ Common Mistake: Assuming replication is synchronous by default. Most systems use asynchronous replication for performance, which means a follower may lag behind the leader. If a leader fails and a follower is promoted before catching up, recently committed writes may be lost. This is a real-world trade-off between availability and durability.
Consistent Hashing: The Algorithm Behind Scalable Distribution
Imagine you have a cache cluster with 4 nodes and 1 million keys distributed across them using simple modulo hashing: node = hash(key) % 4. This works fine — until you add a 5th node. Now hash(key) % 5 maps almost every key to a different node than before. Suddenly, nearly 80% of your cache is invalidated and every request falls through to the database. Under heavy traffic, this can cause a thundering herd problem that takes down your system.
Consistent hashing solves this elegantly. Instead of mapping keys directly to nodes by count, it arranges both keys and nodes on a conceptual ring from 0 to 2^32. Each key is assigned to the first node clockwise from its position on the ring. When a node is added or removed, only the keys that were assigned to that node's section of the ring need to be remapped — typically 1/N of all keys, where N is the number of nodes.
Consistent Hash Ring (4 nodes):
0
┌──────┐
Node D │ │ Node A
│ │
270 ────────┤ ├──────── 90
│ │
Node C │ │ Node B
└──────┘
180
Key "user:42" hashes to position 105 → assigned to Node B (next clockwise)
Key "session:7" hashes to position 220 → assigned to Node C (next clockwise)
Add Node E at position 130:
- Only keys between Node B (90°) and Node E (130°) are remapped to Node E
- All other keys are unaffected
Let's simulate this in Python to make it concrete:
import hashlib
import bisect
class ConsistentHashRing:
"""
A consistent hash ring that distributes keys across nodes.
Virtual nodes (replicas) improve balance across the ring.
"""
def __init__(self, virtual_nodes=150):
self.virtual_nodes = virtual_nodes # more virtual nodes = better balance
self.ring = {} # hash position → node name
self.sorted_keys = [] # sorted list of hash positions
def _hash(self, key: str) -> int:
"""Produce a consistent integer hash for any string key."""
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node: str):
"""Add a node (and its virtual copies) to the ring."""
for i in range(self.virtual_nodes):
virtual_key = f"{node}:vnode:{i}"
position = self._hash(virtual_key)
self.ring[position] = node
bisect.insort(self.sorted_keys, position) # keep list sorted
print(f"Added node '{node}' with {self.virtual_nodes} virtual nodes")
def remove_node(self, node: str):
"""Remove a node and its virtual copies from the ring."""
for i in range(self.virtual_nodes):
virtual_key = f"{node}:vnode:{i}"
position = self._hash(virtual_key)
del self.ring[position]
self.sorted_keys.remove(position)
print(f"Removed node '{node}'")
def get_node(self, key: str) -> str:
"""Find which node is responsible for a given key."""
if not self.ring:
raise Exception("No nodes in the ring")
position = self._hash(key)
# Find the first node position >= key's position (clockwise)
idx = bisect.bisect(self.sorted_keys, position)
if idx == len(self.sorted_keys):
idx = 0 # wrap around the ring
return self.ring[self.sorted_keys[idx]]
## --- Demonstration ---
ring = ConsistentHashRing(virtual_nodes=150)
ring.add_node("cache-1")
ring.add_node("cache-2")
ring.add_node("cache-3")
## Sample 10 keys and record their assigned nodes
test_keys = [f"user:{i}" for i in range(10)]
before = {key: ring.get_node(key) for key in test_keys}
print("\nBefore adding cache-4:")
for key, node in before.items():
print(f" {key} → {node}")
## Now add a 4th node and see how few keys are remapped
ring.add_node("cache-4")
after = {key: ring.get_node(key) for key in test_keys}
remapped = [k for k in test_keys if before[k] != after[k]]
print(f"\nAfter adding cache-4:")
print(f" Keys remapped: {len(remapped)} / {len(test_keys)}")
print(f" Remapped keys: {remapped}")
This code creates a hash ring with three cache nodes, maps 10 sample keys, then adds a fourth node and counts how many keys moved. With consistent hashing, you'll typically see only 2–3 of the 10 keys remapped — compared to almost all of them with naive modulo hashing. The virtual nodes parameter is critical: more virtual nodes means each physical node is represented at more positions on the ring, leading to more even key distribution.
💡 Real-World Example: Amazon's DynamoDB, Apache Cassandra, and Memcached all use consistent hashing (or close variants) to distribute data across nodes. When an interview asks "how would you scale your cache layer?", mentioning consistent hashing demonstrates architectural depth.
Health Checks and Automatic Failover
Adding machines and replicating data only helps if the system can detect failures and respond to them automatically. This is where health checks and automatic failover come in — and they are built into almost every modern building block.
A health check is a periodic probe that a component (like a load balancer or service mesh) sends to a downstream node to verify it is alive and capable of handling requests. There are two primary types:
- 📚 Active health checks: The load balancer itself sends a request (typically an HTTP GET to
/health) to each server on a schedule. If a server fails to respond within a timeout window a certain number of times, it is marked unhealthy and removed from the rotation. - 📚 Passive health checks: The load balancer watches real traffic. If responses from a server are consistently slow or returning 5xx errors, it starts routing traffic elsewhere.
Load Balancer Health Check Flow:
┌──────────────────────────────────────────────┐
│ Load Balancer │
│ │
│ Every 10s: GET /health → each backend │
└──────┬──────────────┬────────────────┬───────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌─────▼───┐
│ Server A│ │ Server B│ │ Server C│
│ ✅ OK │ │ ✅ OK │ │ ❌ DOWN│
└─────────┘ └─────────┘ └─────────┘
│ │
Receives traffic Receives traffic
Server C removed from pool
Alert sent to ops team
Traffic redistributed
Automatic failover is the broader mechanism that kicks in when a critical component fails. In a database context, this means a replica is automatically promoted to leader when the primary goes down. In a microservices context, it might mean traffic is rerouted to a backup region.
Here's a simplified simulation of health-check-based routing logic:
import random
import time
class SimpleLoadBalancer:
"""
Simulates a round-robin load balancer with health check logic.
In production, health checks would run in a background thread.
"""
def __init__(self, servers: list):
# Each server starts healthy
self.servers = {s: {"healthy": True, "failures": 0} for s in servers}
self.round_robin_idx = 0
self.failure_threshold = 3 # mark unhealthy after 3 consecutive failures
def health_check(self, server: str) -> bool:
"""Simulate a health probe — 20% chance a server is unavailable."""
is_up = random.random() > 0.2
state = self.servers[server]
if is_up:
state["failures"] = 0 # reset on success
state["healthy"] = True
else:
state["failures"] += 1
if state["failures"] >= self.failure_threshold:
state["healthy"] = False
print(f" ⚠️ Server '{server}' marked UNHEALTHY after {state['failures']} failures")
return state["healthy"]
def run_health_checks(self):
"""Probe all servers (would normally run on a timer)."""
for server in self.servers:
self.health_check(server)
def get_next_server(self) -> str:
"""Round-robin selection, skipping unhealthy servers."""
healthy = [s for s, info in self.servers.items() if info["healthy"]]
if not healthy:
raise Exception("No healthy servers available!")
server = healthy[self.round_robin_idx % len(healthy)]
self.round_robin_idx += 1
return server
## --- Demonstration ---
lb = SimpleLoadBalancer(["web-1", "web-2", "web-3"])
for round_num in range(5):
print(f"\n--- Round {round_num + 1}: Running health checks ---")
lb.run_health_checks()
healthy = [s for s, info in lb.servers.items() if info["healthy"]]
print(f" Healthy servers: {healthy}")
server = lb.get_next_server()
print(f" Routing request to: {server}")
This simulation shows the core logic behind any production load balancer: periodically probe servers, track consecutive failures, and exclude unhealthy nodes from the routing pool. The failure_threshold parameter is important — a single blip shouldn't remove a server, but sustained failures definitely should.
🤔 Did you know? NGINX uses a concept called upstream blocks with configurable max_fails and fail_timeout parameters to implement exactly this kind of passive health checking. HAProxy supports both active and passive checks and can even check application-level metrics, not just TCP connectivity.
Redundancy: Designing for "When," Not "If"
Redundancy is the practice of duplicating critical components so that a backup is ready to take over when the primary fails. It's closely related to replication but broader in scope — redundancy applies to hardware (redundant power supplies), networking (multiple ISP connections), geographic regions (multi-region deployments), and software services alike.
The key mental model shift redundancy requires is this:
❌ Wrong thinking: "Our servers rarely fail, so we don't need redundancy." ✅ Correct thinking: "At scale, hardware failure is not a matter of if but when. Design for the failure."
🎯 Key Principle: Redundancy and replication are proactive investments. You pay for them constantly (in hardware, operational complexity, and cost) so that when failures occur — and they will — the system keeps running without user impact.
A mature redundancy architecture typically has no single points of failure at any layer:
Redundant Multi-Layer Architecture:
Internet
│
┌───────▼────────┐
│ DNS (with │ ← Multiple DNS providers
│ failover IPs) │
└───────┬────────┘
│
┌───────▼────────┐
│ Load Balancer │ ← Active/Passive LB pair (VRRP/keepalived)
│ (Primary) │
└───────┬────────┘
┌───────▼────────┐
│ Load Balancer │ ← Standby, takes over if primary fails
│ (Standby) │
└───────┬────────┘
│
┌────────┴─────────┐
▼ ▼
[App Server 1] [App Server 2] ← Stateless; any can serve any request
│ │
└───────┬────────┘
▼
┌───────────────────────┐
│ DB Leader (Primary) │ ← Handles all writes
└───────────┬───────────┘
│ Async replication
┌───────────▼───────────┐
│ DB Replica (Standby) │ ← Auto-promoted on leader failure
└───────────────────────┘
💡 Mental Model: Think of redundancy like a spare tire. You don't use it every day, and carrying it has a cost (weight, space). But the moment you get a flat on a highway, you're deeply grateful it's there. The cost of not having it is always higher than the cost of carrying it.
🧠 Mnemonic: "SRRF" — Scalability (horizontal), Replication, Redundancy, Failover. These are the four pillars that building blocks are engineered to support. When evaluating any building block in an interview, ask yourself which of these four properties it contributes to.
Fault Tolerance and Graceful Degradation
Fault tolerance is the system's ability to continue operating correctly — or at least partially — when one or more of its components fail. A fault-tolerant system doesn't just survive failures; it handles them gracefully, often without the end user noticing anything is wrong.
A key concept in fault tolerance is graceful degradation: when a component fails, the system reduces functionality rather than crashing entirely. For example:
- 🎯 A recommendation engine goes down → the app still loads, just without personalized recommendations
- 🎯 A payment processor is slow → the checkout page shows a spinner and retries rather than returning a 500 error
- 🎯 A cache cluster is unavailable → queries fall through to the database (slower, but correct)
Building blocks support fault tolerance through several mechanisms:
| Mechanism | Building Block | Effect |
|---|---|---|
| Circuit Breaker | API Gateway / Service Mesh | Stops sending requests to a failing service to prevent cascade failures |
| Retry with Backoff | API Gateway / Client SDKs | Retries transient failures with increasing delay |
| Bulkhead | Load Balancer / Container Orchestration | Isolates failures to one pool, protecting others |
| Timeout | Every network call | Prevents indefinite blocking when a downstream service hangs |
⚠️ Common Mistake: Treating timeouts as optional configuration. Every network call in a distributed system must have a timeout. Without timeouts, a slow downstream service can exhaust your thread pool as connections pile up waiting indefinitely, causing a cascading failure that takes down the entire service — not just the one having trouble.
💡 Pro Tip: In system design interviews, explicitly mentioning circuit breakers and graceful degradation when discussing fault tolerance signals senior-level thinking. Most candidates describe what happens when things work; the best candidates describe what happens when things break.
Tying It All Together
The principles covered in this section — horizontal scaling, replication, consistent hashing, health checks, redundancy, and fault tolerance — are not independent ideas. They are deeply intertwined, each one reinforcing the others:
- Horizontal scaling requires load balancers to be effective
- Load balancers require health checks to route traffic correctly
- Scaling your cache requires consistent hashing to avoid cache stampedes
- Replication requires failover logic to be meaningful
- Redundancy requires monitoring and alerting to know when to activate
- Fault tolerance requires all of the above working in concert
When you approach a system design problem in an interview, you're not choosing one of these strategies. You're layering them. A well-designed URL shortener doesn't just have a database — it has a replicated database behind a load balancer with health checks, a cache layer using consistent hashing to avoid thundering herds, and circuit breakers protecting the application from downstream timeouts.
📋 Quick Reference Card: Scalability & Reliability Principles
| Principle | 🎯 Goal | 🔧 Key Building Block | ⚠️ Watch Out For |
|---|---|---|---|
| 🔄 Horizontal Scaling | Handle more traffic | Load Balancer | Stateful sessions |
| 📋 Replication | Eliminate data SPOFs | Database Replicas | Replication lag |
| 🔑 Consistent Hashing | Scale caches smoothly | Cache Cluster | Uneven distribution without vnodes |
| 💓 Health Checks | Route to live servers | Load Balancer / Proxy | Aggressive thresholds causing flapping |
| 🔁 Redundancy | Survive hardware failure | All critical components | Cost vs. criticality balance |
| 🛡️ Fault Tolerance | Degrade gracefully | API Gateway / Circuit Breaker | Missing timeouts |
With these principles firmly in mind, the next section builds on them by composing multiple building blocks into realistic system designs — showing not just what each block does, but why and when you'd reach for it.
Practical Patterns: Composing Building Blocks in Real Architectures
Understanding individual building blocks is necessary, but it is not sufficient. The real skill — the one that separates strong system design candidates from average ones — is knowing how to compose those blocks together to solve a specific problem. Just as a skilled architect chooses bricks, beams, and glass not randomly but according to the forces each must bear, a software architect arranges caches, queues, load balancers, and databases according to the nature of the traffic each must handle.
This section walks through two realistic scenarios — a URL shortener and a social media feed — and shows exactly how building blocks are assembled, why each component is placed where it is, and what failure modes each placement guards against. Along the way, we will look at CDN integration for media-heavy workloads and annotated pseudocode for a multi-region reverse proxy.
Scenario 1: Designing a URL Shortener
A URL shortener converts a long URL like https://www.example.com/some/very/long/path?query=param into a compact alias like https://sho.rt/aB3kZ. When a user visits the short URL, the system looks up the alias and redirects them. This problem is deceptively simple, which is exactly why interviewers love it — a correct solution requires every major building block working in concert.
The Request Path, End to End
Let's walk through the architecture from the moment a user clicks a short link.
User Browser
│
▼
┌─────────────┐
│ CDN / Edge │ (optional, for static assets)
└──────┬──────┘
│
▼
┌─────────────────┐
│ API Gateway │ Rate limiting, auth, routing
└────────┬────────┘
│
▼
┌─────────────────┐
│ Load Balancer │ Distributes across app servers
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│ App 1 │ │ App 2 │ Application servers (stateless)
└───┬────┘ └───┬────┘
│ │
└─────┬─────┘
│
▼
┌──────────────┐
│ Cache Layer │ Redis / Memcached
└──────┬───────┘
│ miss
▼
┌──────────────┐
│ Primary DB │ PostgreSQL / DynamoDB
└──────────────┘
Every component in this chain earns its place. The API gateway is the single entry point, and it enforces rate limiting so that a bot cannot hammer your redirect endpoint millions of times per second and exhaust your database. The load balancer distributes incoming redirect requests across multiple stateless application servers, allowing horizontal scaling — add more servers during peak traffic, remove them when demand drops.
The cache layer is the most important component for this specific system. URL shorteners are overwhelmingly read-heavy: once you shorten a URL, it may be clicked thousands of times, but the underlying mapping rarely changes. Placing a distributed cache like Redis between the application servers and the database means that a popular short code like aB3kZ is fetched from the database exactly once, stored in-memory, and served from the cache for every subsequent request until the entry expires.
🎯 Key Principle: Match the component to the access pattern. A read-heavy system with a small, hot dataset is the ideal use case for an in-memory cache. If 90% of your traffic is for the top 1% of URLs, a cache hit rate above 99% is achievable.
Cache Integration in Code
Here is how a redirect handler implements a cache-aside (lazy-loading) pattern — the most common caching strategy for this problem:
import redis
import psycopg2
from flask import Flask, redirect, abort
app = Flask(__name__)
cache = redis.Redis(host='cache-cluster', port=6379, decode_responses=True)
db = psycopg2.connect(dsn="postgresql://user:pass@primary-db/urls")
CACHE_TTL_SECONDS = 3600 # Entries live in cache for 1 hour
@app.route('/<string:short_code>')
def redirect_url(short_code):
# Step 1: Check cache first (O(1) lookup, sub-millisecond latency)
cached_url = cache.get(f"url:{short_code}")
if cached_url:
return redirect(cached_url, code=301) # 301 = permanent redirect
# Step 2: Cache miss — fall through to the database
cursor = db.cursor()
cursor.execute(
"SELECT original_url FROM urls WHERE short_code = %s",
(short_code,)
)
row = cursor.fetchone()
if row is None:
abort(404) # Short code does not exist
original_url = row[0]
# Step 3: Populate the cache so future requests are served from memory
cache.setex(f"url:{short_code}", CACHE_TTL_SECONDS, original_url)
return redirect(original_url, code=301)
This code embodies the cache-aside pattern: the application is responsible for reading from and writing to the cache. On a cache miss, it fetches from the database and then writes the result back to the cache. The TTL (time-to-live) of 3600 seconds ensures stale entries are eventually evicted, which matters if someone updates a shortened URL's destination.
⚠️ Common Mistake: Using a 301 Permanent Redirect response code caches the redirect in the user's browser indefinitely. This is great for performance but catastrophic if you ever need to change a URL destination — the browser will never ask your server again. In production, 302 Temporary Redirect gives you more control at the cost of slightly higher server load.
Scenario 2: The Social Media Feed — A Write-Heavy Challenge
A social media feed presents a fundamentally different challenge. When a user with 10 million followers posts a photo, your system must fan out that event — making it appear in 10 million individual feeds — potentially within seconds. This is a write-heavy system, or more precisely, a system that must handle write amplification at enormous scale.
Naively writing directly to a relational database for every feed update would create an immediate bottleneck. The solution requires careful placement of message queues and thoughtful replication strategies.
User Posts Photo
│
▼
┌─────────────┐
│ API Gateway │
└──────┬──────┘
│
▼
┌─────────────────┐
│ Post Service │ Validates, stores media metadata
└────────┬────────┘
│ Publishes event
▼
┌─────────────────────┐
│ Message Queue │ Kafka / SQS (decouples producer from consumers)
└──────────┬──────────┘
│ Consumed by Fan-out Workers
┌────┴─────────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Worker 1 │ ... │ Worker N │ Fan-out workers (horizontally scaled)
└────┬─────┘ └────┬─────┘
│ │
▼ ▼
┌─────────────────────────────┐
│ Feed Cache (Redis) │ Pre-computed feed lists per user
└─────────────────────────────┘
│
▼ (cold cache or deep history)
┌─────────────────────────────┐
│ Feed Database (Cassandra) │ Replicated, append-optimized
└─────────────────────────────┘
The message queue is the critical insertion point here. Rather than having the Post Service synchronously write to 10 million rows, it publishes a single event to the queue and returns success to the user immediately. The queue decouples the producer (the Post Service) from the consumers (fan-out workers), so the producer is never blocked by downstream processing time.
🎯 Key Principle: Use a queue whenever you need to absorb write spikes, decouple services, or convert synchronous operations into asynchronous ones. Queues trade latency for throughput — a message may not be processed instantly, but your system won't collapse under load.
The fan-out workers consume messages from the queue and update each follower's pre-computed feed. Cassandra is well-suited here because it is optimized for append-heavy, wide-column data patterns and replicates data across nodes automatically. The replication factor — typically 3 in production — means that even if two nodes fail simultaneously, reads and writes continue without interruption.
💡 Real-World Example: Twitter historically used this fan-out-on-write model for most users but switched to fan-out-on-read for celebrities with massive follower counts (like Beyoncé or Barack Obama), because pre-computing 100 million feed entries on a single post was simply too expensive. Real systems often use hybrid approaches.
CDN Integration for User-Uploaded Media
Back to our social media feed: when a user uploads a photo, it needs to be served — potentially to millions of viewers worldwide — without destroying the origin server. This is where Content Delivery Networks (CDNs) enter the picture.
A CDN is a geographically distributed network of edge servers that cache content close to end users. Instead of every viewer's request traveling to a data center in Virginia, a user in Tokyo fetches the photo from a CDN node in Tokyo. The round-trip time drops from ~150ms to ~5ms, and the origin server serves only the first request from each geographic region.
┌─────────┐ ┌─────────────────┐ ┌──────────────────┐
│ User │────▶│ CDN Edge (Tokyo│────▶│ Origin Server │
│ (Tokyo) │ │ Cache HIT ✓ │ │ (Virginia) │
└─────────┘ └─────────────────┘ └──────────────────┘
Returns cached (not contacted
image instantly on cache hit)
┌─────────┐ ┌─────────────────┐ ┌──────────────────┐
│ User │────▶│ CDN Edge (Paris│────▶│ Origin Server │
│ (Paris) │ │ Cache MISS ✗ │────▶│ (Virginia) │
└─────────┘ └─────────────────┘ └──────────────────┘
Fetches from origin, Serves original,
caches for next user CDN caches it
The integration pattern is straightforward: when a user uploads a photo, the Post Service stores it in object storage (like AWS S3) and records the CDN URL (not the origin URL) in the database. Every subsequent reference to that image points to the CDN, so the origin is contacted only during the first access from each edge location.
⚠️ Common Mistake: Storing origin URLs in your database and then transforming them to CDN URLs at render time introduces an extra transformation step and creates inconsistencies if the CDN URL pattern ever changes. Store the CDN URL from the moment of upload.
🤔 Did you know? CDNs also protect against DDoS attacks. Because edge nodes absorb traffic before it reaches your origin, a volumetric attack is distributed across hundreds of edge locations rather than concentrated on your single origin server.
Annotated Pseudocode: Reverse Proxy Routing for Multi-Region Deployment
For large-scale systems serving a global audience, a single data center is both a performance bottleneck and a reliability risk. A multi-region deployment places application servers and databases in multiple geographic regions, and a reverse proxy at the edge routes each incoming request to the nearest healthy region.
Here is annotated pseudocode showing how a reverse proxy might implement intelligent routing and response caching across regions:
## nginx.conf — Reverse proxy configuration for multi-region deployment
## Define upstream server groups for each region
upstream region_us_east {
server app-us-east-1a.internal:8080 weight=5; # Primary: higher weight
server app-us-east-1b.internal:8080 weight=5;
server app-us-east-1c.internal:8080 weight=1 backup; # Backup: only used if primaries fail
}
upstream region_eu_west {
server app-eu-west-1a.internal:8080 weight=5;
server app-eu-west-1b.internal:8080 weight=5;
}
## Configure the shared proxy cache (stored on the proxy's disk)
proxy_cache_path /var/cache/nginx
levels=1:2
keys_zone=url_cache:100m # 100 MB shared memory for cache keys
max_size=10g # Max 10 GB of cached content on disk
inactive=60m # Evict entries not accessed in 60 minutes
use_temp_path=off;
server {
listen 443 ssl;
server_name api.example.com;
location /redirect/ {
# Route to nearest region using GeoIP module
# $geoip2_data_country_code is set by the ngx_http_geoip2_module
set $target_upstream region_us_east; # Default to US East
if ($geoip2_data_country_code ~* "^(GB|FR|DE|NL|BE)$") {
set $target_upstream region_eu_west; # Route European users to EU
}
proxy_pass http://$target_upstream;
# Cache responses: the cache key includes the URI only (not cookies/auth headers)
# This ensures that /redirect/aB3kZ always maps to the same cached response
proxy_cache url_cache;
proxy_cache_key "$request_uri"; # Cache keyed on path only
proxy_cache_valid 200 301 10m; # Cache successful redirects for 10 minutes
proxy_cache_valid 404 1m; # Cache not-founds briefly (to prevent DB hammering)
proxy_cache_use_stale error timeout; # Serve stale content if upstream is briefly down
# Add cache status header for debugging (shows HIT, MISS, or BYPASS)
add_header X-Cache-Status $upstream_cache_status;
}
location /health {
# Health check endpoint — never cached, always hits the upstream
proxy_cache off;
proxy_pass http://region_us_east;
}
}
Let's unpack what each part of this configuration accomplishes. The upstream blocks define pools of application servers. The weight directive tells the load balancer how to distribute traffic — a server with weight=5 receives five times more requests than one with weight=1. The backup flag means a server is only used when all non-backup servers are unavailable, acting as a hot standby.
The proxy_cache_path directive creates a shared cache on the proxy's disk. Unlike a CDN (which caches at geographically distributed edge nodes), this cache is co-located with the reverse proxy and primarily benefits requests arriving at the same data center. The keys_zone parameter allocates shared memory for cache metadata — nginx uses this in-memory index to check for cache hits without touching the disk.
The routing logic inside location /redirect/ uses GeoIP data to select the appropriate upstream pool. This is a simplified version of latency-based routing: users in Europe are served by the EU cluster, reducing their request round-trip time from ~100ms (transatlantic) to ~20ms (intra-continental).
💡 Pro Tip: The proxy_cache_use_stale error timeout directive is a resilience pattern worth remembering. If the upstream application server returns an error or times out, nginx will serve the last-known-good cached response rather than propagating the failure to the user. This won't work forever, but it buys time for the upstream to recover from transient failures.
🧠 Mnemonic: Think of the reverse proxy as a smart receptionist: it greets every visitor, looks up their request in the filing cabinet (cache), routes them to the right department (upstream region) if necessary, and files a copy of the response for the next person who asks the same question.
Putting It All Together: A Pattern Language for Composition
Across these scenarios, several recurring patterns emerge. Recognizing these patterns by name — and knowing when to apply them — is what makes a system design answer sound authoritative.
📋 Quick Reference Card: Composition Patterns
| 🔧 Pattern | 📚 When to Apply | ⚠️ Trade-off |
|---|---|---|
| 🗄️ Cache-Aside | Read-heavy, cacheable responses | Risk of stale data; app manages cache |
| 📨 Fan-Out via Queue | Write-heavy, high-follower broadcast | Eventual consistency; added latency |
| 🌍 CDN for Static Assets | Global audience, large media files | Cache invalidation complexity |
| 🔀 Geo-Based Routing | Multi-region, latency-sensitive | Regional failover logic needed |
| 🛡️ API Gateway Throttling | Public APIs, bot protection | Added hop; must tune rate limits |
| 📋 Hot Standby Upstream | High availability, zero-downtime deploys | Idle server costs capacity |
❌ Wrong thinking: "I'll add caching everywhere to make everything faster."
✅ Correct thinking: "I'll add a cache in front of this specific component because this data is read 1000x more often than it is written, and the cache hit rate will be high enough to justify the consistency complexity."
Each building block you add is a deliberate trade-off. A cache trades consistency for speed. A queue trades synchronous confirmation for throughput. A CDN trades centralized control for geographic reach. The best system designers — and the best interview candidates — articulate why each component is present, not just what it does.
💡 Remember: In a system design interview, your verbal narration matters as much as the diagram. When you draw a cache, say: "I'm placing a Redis cache here because the URL lookup is idempotent and read-heavy. A cache hit rate of roughly 95% would reduce database load by 20x, which matters because our primary database is the hardest component to scale horizontally."
That kind of reasoning — component choice justified by access pattern, quantified by rough estimates, connected to scaling constraints — is the hallmark of an experienced architect, and it is exactly what your interviewer is listening for.
Common Pitfalls and Misconceptions When Using Building Blocks
Knowing what a load balancer, cache, or CDN is represents only half the battle in a system design interview. The other half — the part that separates strong candidates from exceptional ones — is knowing when not to use them, how they can fail, and what misconceptions trip up even experienced engineers. Interviewers are not just testing your vocabulary; they are probing your judgment. This section walks through the most common errors candidates make when reasoning about core building blocks, explains why each mistake happens, and offers the corrected mental model you should carry into every design conversation.
Pitfall 1: Over-Engineering Before Understanding Requirements
Over-engineering is the single most common mistake in system design interviews, and it almost always stems from the same root cause: reaching for well-known components before establishing what the system actually needs to handle.
Imagine you are asked to design a simple URL shortener for an internal company tool used by 200 employees. A nervous candidate might immediately sketch out a CDN, an API gateway, three layers of caching, and a Kafka event stream — all before asking a single clarifying question. The interviewer watches this unfold and wonders: does this person know how to reason about scale, or do they just know buzzwords?
❌ Wrong thinking: "I'll add a CDN and API gateway up front because Netflix uses them, and it shows I know advanced concepts."
✅ Correct thinking: "Let me first establish request volume, data size, geographic distribution, and read/write ratios — then I'll justify each component I add."
The practical test for every building block you consider adding is this: "What specific problem does this solve for the requirements I've been given?" A CDN solves the problem of latency for geographically dispersed users fetching static or cacheable content. If your system serves 500 requests per day from a single office, a CDN introduces operational overhead and cost with zero measurable benefit.
## Think of your design like a function with a cost/benefit signature
def should_add_component(component: str, requirements: dict) -> bool:
"""
Before adding any building block, evaluate it against actual requirements.
"""
traffic_volume = requirements.get("requests_per_second", 0)
geographic_spread = requirements.get("global_users", False)
read_heavy = requirements.get("read_write_ratio", 1) > 5 # 5:1 reads to writes
if component == "CDN":
# CDN is justified when content is static/cacheable AND users are geographically distributed
return geographic_spread and requirements.get("static_assets", False)
if component == "API_GATEWAY":
# API gateway is justified when you have multiple downstream services to route to
return len(requirements.get("backend_services", [])) > 1
if component == "CACHE":
# Cache is justified when read volume is high and data has reasonable TTL
return read_heavy and traffic_volume > 1000
return False # Default: don't add it until justified
This pseudocode captures the essential discipline: treat each building block as a solution in search of a specific, articulated problem. In an interview, narrating this reasoning out loud — "I'm considering a CDN here because you mentioned users in Asia and Europe accessing the same image assets" — demonstrates far more sophistication than silently drawing boxes on a whiteboard.
💡 Pro Tip: Use the phrase "I'd add this if..." when you're unsure whether a component is warranted. It signals conditional thinking rather than reflexive complexity.
Pitfall 2: Assuming Cached Data Is Always Fresh
Cache invalidation is famously one of the two hard problems in computer science (the other being naming things). Candidates frequently introduce a cache into their design without specifying how stale data gets removed or updated. This is not a minor implementation detail — it is a correctness problem that can cause real harm in production systems.
Consider a social media profile system. You cache a user's profile data with a 1-hour TTL. The user updates their email address. For up to 60 minutes, every service reading from the cache will see the old email. Depending on the use case, this might be acceptable (display name) or catastrophic (billing email, security alerts).
⚠️ Common Mistake: Drawing a cache in your architecture diagram and then moving on, as if the cache automatically stays consistent with your database.
There are three dominant cache invalidation strategies you must be able to articulate and trade off:
- 🔧 TTL-based expiration: Data expires after a fixed time window. Simple to implement, but data can be stale for up to the full TTL duration.
- 🔧 Write-through invalidation: When a write happens to the database, the corresponding cache entry is explicitly deleted or updated. Stronger consistency, but adds complexity and write latency.
- 🔧 Event-driven invalidation: A message queue or change-data-capture mechanism notifies the cache layer of updates. Most robust, but operationally heavyweight.
import time
class CacheWithInvalidation:
"""
Demonstrates write-through cache invalidation.
On any write to the source of truth, the corresponding cache key is evicted.
"""
def __init__(self, ttl_seconds: int = 3600):
self._store = {} # { key: (value, expiry_timestamp) }
self.ttl = ttl_seconds
def get(self, key: str):
entry = self._store.get(key)
if entry is None:
return None # Cache miss
value, expiry = entry
if time.time() > expiry:
del self._store[key] # TTL expired, evict it
return None
return value
def set(self, key: str, value):
expiry = time.time() + self.ttl
self._store[key] = (value, expiry)
def invalidate(self, key: str):
"""Call this whenever the underlying database record is updated."""
if key in self._store:
del self._store[key]
print(f"Cache invalidated for key: {key}")
## Usage pattern: write-through invalidation
cache = CacheWithInvalidation(ttl_seconds=3600)
def update_user_email(user_id: str, new_email: str, db):
# Step 1: Update the source of truth
db.execute("UPDATE users SET email = ? WHERE id = ?", (new_email, user_id))
# Step 2: Immediately invalidate the stale cache entry
cache.invalidate(f"user:{user_id}")
# Next read will be a cache miss, fetching fresh data from DB
In an interview, the moment you draw a cache, the interviewer is mentally waiting for you to address invalidation. Get ahead of it: say explicitly which strategy you're choosing and why, and acknowledge the trade-off.
🎯 Key Principle: A cache without an invalidation strategy is a consistency liability, not a performance feature.
Pitfall 3: Treating Load Balancers as Infinitely Scalable
A load balancer distributes incoming traffic across multiple backend servers, and most candidates understand this. What many miss is that the load balancer itself is a component in the request path — it can become a single point of failure (SPOF) or a throughput bottleneck if you don't account for its own scaling and redundancy.
Here is the flawed mental model, visualized:
[Flawed Design]
Internet Traffic
|
[Load Balancer] <-- Single instance. What happens when this fails?
/ | \
[S1] [S2] [S3]
If the load balancer goes down, your three backend servers become completely unreachable, regardless of how healthy they are. A production-grade design addresses this with active-passive or active-active load balancer pairs, often coordinated using a protocol like VRRP (Virtual Router Redundancy Protocol) or managed by cloud providers through anycast DNS and health-checked routing.
[Corrected Design]
Internet Traffic
|
[DNS / Anycast]
/ \
[LB Primary] [LB Secondary] <-- Active-passive failover
\ /
[Backend Pool: S1, S2, S3]
Beyond the SPOF problem, load balancers have throughput ceilings. A single software load balancer (like HAProxy or Nginx) on modest hardware might handle 50,000–100,000 requests per second. If your design is meant to handle millions of requests per second — as is the case for systems like Twitter's API or a major e-commerce checkout flow — you need to plan for horizontal scaling of the load balancing tier itself, often using DNS-level load balancing in front of multiple load balancer instances.
⚠️ Common Mistake: Adding load balancers to solve backend bottlenecks without considering whether the load balancer tier needs its own redundancy and scaling strategy.
💡 Real-World Example: AWS Elastic Load Balancer handles this transparently by scaling its underlying infrastructure automatically. But in an on-premises or interview scenario where you're designing from first principles, you must reason through it yourself.
🤔 Did you know? Google uses a system called Maglev as its software network load balancer, capable of handling millions of packets per second per machine, with consistent hashing to ensure session stickiness even as the load balancer cluster scales.
Pitfall 4: Conflating Reverse Proxies with Load Balancers
This is one of the most persistent conceptual confusions in system design, and it is understandable — tools like Nginx and HAProxy can perform both functions, which blurs the line. But in a design interview, using the terms interchangeably reveals a gap in your understanding of why each component exists.
A reverse proxy sits in front of one or more servers and acts as an intermediary on behalf of those servers. Its defining characteristic is that it hides the identity and implementation details of backend servers from clients. Reverse proxies are responsible for:
- 🔒 SSL/TLS termination
- 📚 Request buffering and response caching
- 🔧 Compression and header manipulation
- 🎯 Routing requests to the appropriate backend service
A load balancer, by contrast, is specifically concerned with distributing traffic across multiple instances of the same service to achieve horizontal scale and fault tolerance. Its defining characteristic is the routing algorithm (round-robin, least-connections, IP hash, etc.).
The confusion arises because a reverse proxy can distribute traffic across backends — making it look like a load balancer. And a load balancer typically is a reverse proxy at the network level. But the conceptual intent differs:
| 🔒 Primary Purpose | 📚 Secondary Capabilities | |
|---|---|---|
| Reverse Proxy | 🔧 Shield backends, transform requests | 🎯 Basic routing, caching, SSL termination |
| Load Balancer | 🎯 Distribute load for scale/redundancy | 🔧 Health checks, session persistence |
❌ Wrong thinking: "I'll put a load balancer in front of my monolith to handle SSL termination." (SSL termination is a reverse proxy concern, not inherently a load balancing concern.)
✅ Correct thinking: "I'll use a reverse proxy for SSL termination, request routing, and response caching. Behind it, a load balancer distributes traffic across my three application server instances."
💡 Mental Model: Think of a reverse proxy as a translator and gatekeeper, and a load balancer as a traffic cop. In small systems, the same tool (Nginx) wears both hats. In large systems, these responsibilities are separated across dedicated infrastructure.
Pitfall 5: Ignoring Latency Introduced by Each Building Block Hop
Every component you add to a request path introduces latency. This seems obvious when stated directly, but candidates frequently draw architecture diagrams with five or six sequential hops — DNS resolver → CDN → API gateway → reverse proxy → load balancer → application server → cache → database — without stopping to reason about the cumulative effect on end-to-end response time.
The concept here is request path depth: the number of network hops, serialization/deserialization cycles, and queuing delays a request must traverse before a response is returned.
Typical Latency Budget for a Web Request (approximate, varies by infrastructure):
Component Added Latency
---------------------------------------------
DNS Resolution ~1-50ms (cached: ~1ms)
TCP Handshake (per hop) ~0.5-5ms
TLS Termination ~1-10ms
CDN Edge Node ~1-5ms (on cache HIT)
API Gateway ~1-10ms
Load Balancer ~0.1-1ms
Application Server (fast op) ~5-50ms
Cache Read (Redis, local) ~0.1-2ms
Database Read (indexed) ~1-10ms
Cumulative (full path): ~10-143ms just in infrastructure overhead
The implication is that adding an API gateway to a simple two-service architecture might add 5–10ms of overhead per request — which is perfectly acceptable if it buys you authentication, rate limiting, and routing. But if you stack five layers of intermediary services for a read-heavy endpoint that needs sub-10ms latency, you have architected yourself into an impossible constraint.
## Illustrating latency accumulation across hops
class RequestPath:
def __init__(self):
self.hops = []
self.total_latency_ms = 0
def add_hop(self, component: str, latency_ms: float, justification: str):
self.hops.append({
"component": component,
"latency_ms": latency_ms,
"justification": justification
})
self.total_latency_ms += latency_ms
def evaluate(self, budget_ms: float):
print(f"\nRequest Path Analysis (Budget: {budget_ms}ms)")
print("-" * 50)
for hop in self.hops:
print(f" + {hop['component']:20s} {hop['latency_ms']:5.1f}ms ({hop['justification']})")
print(f" {'TOTAL':20s} {self.total_latency_ms:5.1f}ms")
if self.total_latency_ms > budget_ms:
print(f" ⚠️ OVER BUDGET by {self.total_latency_ms - budget_ms:.1f}ms")
else:
print(f" ✅ Within budget ({budget_ms - self.total_latency_ms:.1f}ms headroom)")
## Modeling an over-engineered path
path = RequestPath()
path.add_hop("CDN", 3.0, "static assets - justified")
path.add_hop("API Gateway", 8.0, "auth + routing - justified for microservices")
path.add_hop("Load Balancer", 0.5, "traffic distribution - justified")
path.add_hop("Service Mesh", 6.0, "observability - unjustified for MVP")
path.add_hop("App Server", 15.0, "core business logic")
path.add_hop("Cache", 1.0, "read optimization - justified")
path.evaluate(budget_ms=25.0)
## Output shows the service mesh pushed us over the latency budget
Running this analysis mentally during an interview is a sign of mature engineering thinking. When you add a component, say: "This adds roughly X milliseconds to the request path — given our latency target of Y milliseconds, this is acceptable because it gives us [specific benefit]."
⚠️ Common Mistake: Assuming that because each individual component is fast, the cumulative path is also fast. Latency is additive. Ten 5ms hops equal 50ms of overhead before a single line of business logic executes.
🧠 Mnemonic: EACH HOP COSTS — Every Added Component Has Hops, Overhead, and Propagation Costs That Scale.
💡 Real-World Example: Facebook's engineering team has famously constrained internal RPC (remote procedure call) latency to single-digit milliseconds for core feed operations, explicitly tracking the number of service calls in the critical path and enforcing limits through architectural review.
Putting It All Together: The Discipline of Justified Design
All five pitfalls share a common thread: they stem from pattern-matching to familiar components rather than reasoning from requirements to solutions. The corrected approach in every case follows the same structure:
- Identify the specific problem the component solves (not the general problem it can solve)
- Quantify the trade-off — what does this component cost in latency, complexity, and operational burden?
- Confirm the requirement justifies the cost — does the problem exist at the scale you're designing for?
- Articulate the failure modes — how does this component fail, and what is your mitigation?
📋 Quick Reference Card: Pitfall Checklist
| 🔒 Pitfall | ❌ Wrong Approach | ✅ Corrected Approach |
|---|---|---|
| 🔧 Over-engineering | Add CDN/API gateway reflexively | Justify each component against stated requirements |
| 📚 Cache staleness | Assume cache stays fresh | Explicitly design TTL, write-through, or event-based invalidation |
| 🎯 LB as SPOF | Single load balancer instance | Plan for active-passive or active-active LB redundancy |
| 🔒 Proxy/LB conflation | Use terms interchangeably | Distinguish purpose: gatekeeper (proxy) vs. traffic cop (LB) |
| 🧠 Latency blindness | Add hops without cost accounting | Track cumulative latency against end-to-end budget |
Carrying this discipline into your interview does more than prevent mistakes — it demonstrates that you think like a production engineer, not just a systems architect on paper. Interviewers remember candidates who push back on their own initial designs, catch their own trade-offs, and articulate why each component earns its place in the system.
🎯 Key Principle: In system design, the components you choose not to include are just as important as the ones you do. Every justified omission is evidence of judgment.
Key Takeaways and Interview Cheat Sheet
You have now traveled through the full landscape of core building blocks — from understanding why interviewers test them, to surveying what each one does, to seeing how they compose into real architectures, to learning the pitfalls that sink otherwise strong candidates. This final section is your consolidation point: a place to lock in what you have learned, internalize the decision-making framework, and walk into any system design interview with a clear mental model you can draw on under pressure.
The goal of this section is not to introduce new concepts. It is to make everything you have already absorbed retrievable — the kind of fluent, confident recall that lets you say, "I'd put a CDN in front of that because latency for static assets is the bottleneck, and the trade-off is cache invalidation complexity, which we can manage by versioning asset URLs."
The Master Cheat Sheet: Every Building Block at a Glance
The table below is designed to be your pre-interview review card. Each row captures the what, the why, and — most critically — the trade-off conversation you should be prepared to have.
📋 Quick Reference Card: Core Building Blocks
| 🧱 Building Block | 🎯 Primary Purpose | ⚖️ Key Trade-Off | 🗣️ Interview Talking Point |
|---|---|---|---|
| 🔀 Load Balancer | Distribute traffic across multiple servers | Adds a single point of failure if not itself redundant; requires session affinity consideration | "I'd add a load balancer to enable horizontal scaling and remove any single server as a bottleneck — but I'd run two in active-passive to avoid making it the new SPOF." |
| ⚡ Cache | Reduce latency by serving repeated reads from memory | Stale data risk; cache invalidation is hard; memory cost | "A cache buys us sub-millisecond reads, but I need to define a TTL and an invalidation strategy — write-through if consistency matters, write-behind if throughput does." |
| 🌍 CDN | Serve static and cacheable content from edge locations close to users | Adds cost; cache invalidation at the edge is slow; not suitable for personalized or dynamic content | "For static assets I'd offload to a CDN to cut latency globally, using versioned URLs so we can invalidate instantly when assets change." |
| 🚪 API Gateway | Unified entry point for client requests — handles routing, auth, rate limiting | Becomes a bottleneck if not scaled; adds a hop of latency; complex configuration | "An API gateway lets me centralize cross-cutting concerns — auth, rate limiting, logging — so microservices don't each reinvent that logic." |
| 🔁 Reverse Proxy | Sit in front of servers to handle SSL termination, compression, and routing | Adds latency; another component to manage and scale | "I'd use a reverse proxy for SSL offloading so backend services don't each need to handle TLS overhead." |
| 🔭 Forward Proxy | Control and monitor outbound traffic from internal clients | Visibility only into what passes through it; clients must be configured to use it | "A forward proxy gives us egress control — useful for compliance environments where outbound traffic must be audited." |
| 🗄️ Database (coming next) | Persist and query structured or unstructured data | Consistency vs. availability trade-off (CAP theorem); read/write scalability requires sharding or replication | "I'll cover this in depth — the short answer is: choose the database whose consistency and query model match the access patterns, not just the one you know best." |
| 📨 Message Queue (coming next) | Decouple producers from consumers; enable async processing | Added latency; at-least-once vs. exactly-once delivery; ordering guarantees vary | "A queue decouples the write path from processing, which gives us backpressure handling and retry logic for free." |
The Most Important Skill: Justifying Your Choices
Every experienced interviewer will tell you the same thing: the candidate who lists components loses to the candidate who reasons about components. Naming a CDN is table stakes. Explaining why this system, at this scale, with these access patterns, benefits from a CDN — and what you give up — is what separates a hire from a no-hire.
🎯 Key Principle: In a system design interview, your job is not to draw a correct diagram. Your job is to have a defensible conversation about trade-offs.
Here is the mental framework to internalize. Before adding any building block to your design, answer three questions:
- What problem does this solve specifically? (Not "it adds scalability" — what kind of scalability, for which bottleneck?)
- What does it cost? (Latency? Money? Operational complexity? Consistency risk?)
- Is that cost worth it at this scale? (A cache is overkill for a system with 10 RPS; it's essential at 100,000 RPS.)
The following pseudocode captures this reasoning loop as a decision algorithm you can mentally run during an interview:
## Mental decision loop for adding any building block
def should_add_component(component, system_context):
"""
Run this thought process before adding any component to your design.
Returns a justified decision, not just a yes/no.
"""
problem_solved = identify_specific_bottleneck(component, system_context)
# e.g., "Cache solves: DB read latency at 50k RPS for product catalog"
cost_introduced = enumerate_trade_offs(component)
# e.g., [
# "Stale data risk (TTL must be tuned)",
# "Cache invalidation complexity on writes",
# "Memory cost (~$X/GB in Redis)"
# ]
scale_justification = is_cost_worth_it(problem_solved, cost_introduced, system_context.scale)
# e.g., "At 50k RPS, DB cannot handle read load alone — cache ROI is high"
if scale_justification:
return f"Add {component}: {problem_solved}. Accept trade-offs: {cost_introduced}."
else:
return f"Skip {component} for now: over-engineering at this scale."
## Example output for a cache decision:
## "Add Cache: reduces DB read latency for hot product catalog at 50k RPS.
## Accept trade-offs: ['stale data (TTL=60s acceptable)', 'invalidation on price updates']."
This is not code you would write in an interview. It is a thinking structure dressed in code. The act of writing it here makes it concrete and memorable.
💡 Pro Tip: When you say "I'd add a cache here," immediately follow with "because..." and then "the trade-off is..." This two-beat pattern signals to interviewers that you think in trade-offs, not just components.
Every Building Block Has a Cost — Always Justify It
One of the most clarifying mental models you can carry into any system design conversation is this: there are no free components. Every building block you add to a system introduces at least one of three costs:
- 🕐 Latency — another network hop, another serialization/deserialization cycle, another queue to drain
- 💰 Cost — another service to run, license, or pay a cloud provider for
- 🔧 Complexity — another failure mode, another configuration to maintain, another team member who needs to understand it
⚠️ Common Mistake — Mistake 1: Adding components to look thorough. Candidates sometimes load their diagrams with caches, queues, CDNs, and API gateways for a system that handles 500 users per day. This signals a lack of engineering judgment, not expertise. ⚠️
✅ Correct thinking: The right architecture is the simplest one that meets the requirements. Start minimal. Add complexity only when you can name the specific problem it solves.
Here is a concrete example of justifying a component addition versus over-engineering:
## Scenario: Designing a URL shortener for ~1,000 redirects/day
## ❌ OVER-ENGINEERED: Adding components that aren't justified
over_engineered_design = {
"components": [
"Load Balancer (2x active-passive)",
"API Gateway with rate limiting",
"Redis Cache cluster (3 nodes)",
"CDN for redirect responses",
"Message Queue for analytics",
"Multiple application servers"
],
"daily_requests": 1_000,
"problem": "All this infrastructure costs $500/month and adds"
" 5+ failure points for a system a single server handles trivially."
}
## ✅ JUSTIFIED: Right-sized for the actual requirements
justified_design = {
"components": [
"Single application server", # Handles 1k req/day with ease
"Single database", # Persistence for short URL mappings
"Simple in-process cache", # Cache hot URLs in memory, no Redis needed yet
],
"daily_requests": 1_000,
"scaling_trigger": "Add Redis cache when DB read latency becomes measurable."
" Add load balancer when single server CPU > 70% sustained."
}
## The key is knowing WHEN to add each component, not just THAT it exists.
print("Build for today. Design for tomorrow. Don't pay for next year.")
This principle — build for today, design for tomorrow — is what distinguishes engineers who deliver value from engineers who build monuments to their own knowledge.
🤔 Did you know? Netflix's original architecture in 2007 was a monolith running on a few servers. The microservices and distributed system complexity came after they hit the scaling walls — not before. The building blocks were added because they were needed, not because they were fashionable.
What Comes Next: Databases and Message Queues
This lesson deliberately stopped at the network and infrastructure layer of building blocks — load balancers, caches, CDNs, proxies, and API gateways. The two building blocks you will interact with most in real system design interviews — databases and message queues — are covered in their own dedicated child lessons, and for good reason: each one contains enough depth to fill a lesson of its own.
Here is a preview of what those lessons will cover, and how they connect to what you have already learned:
┌─────────────────────────────────────────────────────────────────┐
│ BUILDING BLOCKS HIERARCHY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ THIS LESSON (Infrastructure Layer) │
│ ┌─────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Load │ │ Cache │ │ CDN │ │ API │ │
│ │ Balancer │ │ (Redis) │ │(CloudFront│ │ Gateway │ │
│ └─────────────┘ └────────────┘ └──────────┘ └───────────┘ │
│ │
│ NEXT LESSONS (Data & Async Layer) │
│ ┌─────────────────────────────┐ ┌──────────────────────────┐ │
│ │ DATABASES │ │ MESSAGE QUEUES │ │
│ │ • SQL vs NoSQL │ │ • Kafka, RabbitMQ, SQS │ │
│ │ • Replication & Sharding │ │ • At-least-once delivery │ │
│ │ • CAP Theorem │ │ • Consumer groups │ │
│ │ • ACID vs BASE │ │ • Dead letter queues │ │
│ └─────────────────────────────┘ └──────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
The infrastructure building blocks you now understand are the delivery layer — they get requests to your system efficiently and reliably. Databases and message queues are the processing and persistence layer — they determine what your system does with those requests once they arrive. Together, they form the complete toolkit for designing any production distributed system.
💡 Mental Model: Think of the infrastructure layer as the roads and traffic lights of a city. Databases and message queues are the warehouses and postal system. You need both, and they serve entirely different purposes.
Practice Exercise: Sketch a Three-Tier Web Application
Knowledge without application decays quickly. The single most effective thing you can do right now — before your next interview — is to sit down with a blank sheet of paper (or a whiteboard tool) and complete the following exercise.
The Exercise:
Sketch the architecture of a three-tier web application for an e-commerce product catalog service with the following requirements:
- Serves 100,000 product page requests per day globally
- Product data changes infrequently (a few hundred updates per day)
- Must handle traffic spikes during sales events (10x normal load)
- Needs to support mobile and web clients
Your sketch must include at least four building blocks from this lesson, and for each one you must be able to answer:
- Why did you add it?
- What problem does it solve at this scale?
- What trade-off does it introduce?
- At what scale would you reconsider or upgrade it?
Here is a starting template in pseudocode to seed your thinking:
## Three-Tier Architecture Skeleton — Fill in the reasoning
TIER 1 — PRESENTATION / EDGE LAYER
Component: _______________
Why here: _______________
Trade-off: _______________
Component: _______________
Why here: _______________
Trade-off: _______________
TIER 2 — APPLICATION LAYER
Component: _______________
Why here: _______________
Trade-off: _______________
Component: _______________
Why here: _______________
Trade-off: _______________
TIER 3 — DATA LAYER
Component: [Database — will be covered in next lesson]
Placeholder note: "SQL product DB with read replica
— justification pending database lesson"
## Example answer for ONE component to model your thinking:
## TIER 1 — CDN
## Why: 100k/day = ~1.15 RPS average, but sales spikes hit 11.5 RPS.
## Product images are static. CDN absorbs image traffic globally.
## Trade-off: Cache invalidation when product images change.
## Mitigated by versioning image URLs (product-img-v2.jpg).
## Scale reconsider: No reconsideration needed — CDN scales infinitely
## and cost is proportional to traffic.
💡 Real-World Example: This exact exercise mirrors what happens in the first 10 minutes of a real system design interview at companies like Google, Amazon, and Meta. The interviewer gives you a vague scenario and watches how you decompose it into components with justified reasoning. Practicing this motion with simple systems builds the muscle memory you need for complex ones.
Final Recap: What You Now Understand
Before this lesson, you may have known the names of some building blocks. After this lesson, you understand them as a system — each one exists in relationship to others, each one solves specific classes of problems, and each one asks you to make an explicit trade-off.
Here is the distilled set of things you now know that you did not before:
🧠 You understand the role of each building block — not just what it is, but what problem in a distributed system it exists to solve.
📚 You understand how building blocks compose — a CDN, a load balancer, an API gateway, and a cache are not independent decisions. They form a pipeline, and the order and interaction between them matters.
🔧 You understand that every component has a cost — latency, money, or operational complexity — and that adding a component without naming its cost is a red flag in an interview.
🎯 You understand the correct interview posture — justify choices with scale and access patterns, name trade-offs explicitly, and demonstrate that you can reason about when not to add a component.
🔒 You understand what comes next — databases and message queues are the next layer of depth, and you are now equipped with the mental model to fit them into the architecture you already understand.
⚠️ Final critical point to remember: The candidate who says "I'd add a cache" gets a follow-up question. The candidate who says "I'd add a write-through Redis cache in front of the product database because read volume is 20x write volume, accepting the cost of ~50ms added latency on writes and the complexity of TTL management" closes the loop themselves. Always close the loop. ⚠️
🧠 Mnemonic: "Why, Cost, Scale" — three words to run through every time you add a component. Why does it help? What does it cost? Does the scale justify that cost?
Practical Next Steps
- 📋 Print or bookmark the cheat sheet table at the top of this section. Review it the day before any system design interview.
- 🔧 Complete the three-tier architecture exercise on paper or a whiteboard before your next interview. Time yourself: you should be able to sketch and justify a four-component architecture in under 10 minutes.
- 📚 Continue to the database and message queue lessons — they build directly on the framework you have established here. Pay particular attention to how databases interact with caches (cache-aside, write-through, write-behind patterns) and how message queues interact with application servers (fan-out, competing consumers patterns).