Databases & Storage
Compare SQL vs NoSQL and understand when to use each type of storage system.
Why Storage Decisions Make or Break System Design
Imagine you're six months into building the next great social platform. Users are signing up. Engagement is climbing. Then one Tuesday morning, everything grinds to a halt. Pages timeout. The database CPU pegs at 100%. Your on-call engineer is staring at a query that used to run in 2ms and now takes 45 seconds. The root cause? A storage decision made in week one that nobody questioned. Sound familiar? Whether you've lived this nightmare or heard it from a colleague, storage failures are devastatingly common β and almost always preventable. Grab our free flashcards below to lock in the core vocabulary as you read, because this section lays the groundwork for everything else in this lesson.
The truth is, database and storage decisions are the single most consequential architectural choices you'll make in any system. They touch every other layer: your API latency, your infrastructure cost, your ability to scale horizontally, your disaster recovery posture. Yet in system design interviews, candidates routinely treat storage as an afterthought β something to sketch at the bottom of the whiteboard after they've "solved" the interesting parts. This section is here to reframe that instinct entirely.
When Storage Decisions Go Wrong: Real-World Catastrophes
The best way to understand why storage matters is to look at what happens when smart engineers get it wrong at scale. These aren't hypothetical horror stories β they're documented postmortems from companies you use every day.
Netflix and the Relational Database Wall
In the late 2000s, Netflix was running on a traditional relational database (Oracle) for nearly everything: user accounts, streaming metadata, viewing history, recommendations. This was a perfectly reasonable choice when they had hundreds of thousands of subscribers. But as they crossed into tens of millions of users and began expanding internationally, a fundamental truth became impossible to ignore: relational databases scale vertically, and vertical scaling has a ceiling.
Netflix engineers described the moment they hit that ceiling as "the database wall." A single node, no matter how much RAM or how many CPU cores you throw at it, simply cannot handle millions of concurrent reads and writes indefinitely. Their solution was a multi-year migration to Apache Cassandra β a distributed NoSQL database designed for massive write throughput and horizontal scaling. The migration was painful, expensive, and time-consuming. The lesson? The storage technology that gets you to 100,000 users might actively prevent you from reaching 10 million.
π‘ Real-World Example: Netflix now runs one of the largest Cassandra deployments in the world, handling millions of operations per second across multiple regions. Their engineering blog describes using Cassandra specifically because it offers tunable consistency β you can trade strict accuracy for speed depending on what each use case demands. Viewing history doesn't need to be perfectly consistent in real time. Billing data absolutely does.
Slack's Database Sharding Crisis
Slack's growth story is remarkable: from zero to millions of daily active users in a few years. But that growth exposed a critical flaw in their initial storage architecture. Slack originally used a single MySQL database with a straightforward sharding strategy β splitting data by workspace (team) ID. This worked beautifully until workspaces themselves started growing to tens of thousands of users.
The problem was that their sharding scheme assumed workspaces were roughly equal in size. Enterprise customers blew that assumption apart. A single Fortune 500 company joining Slack could suddenly represent more data and query volume than thousands of small workspaces combined. Individual shards became hot spots β overloaded nodes while others sat idle. Slack's 2019 engineering post on their "Flannel" caching architecture describes the elaborate machinery they had to build just to compensate for this original storage design decision.
π‘ Real-World Example: The fix required not just re-sharding, but introducing a vitess layer (a MySQL clustering system originally built at YouTube) and a sophisticated caching tier in front of the database. What could have been avoided with better initial capacity modeling became a multi-year infrastructure project.
Discord and the PostgreSQL-to-Cassandra Migration
Discord's story is particularly instructive because they published a detailed postmortem. Their messages table in PostgreSQL grew to hundreds of millions of rows. Queries on recent messages were fast, but anything requiring scanning older data became progressively slower β a phenomenon called read amplification. PostgreSQL's B-tree indexes work well for balanced distributions, but chat data is anything but balanced: most queries target the most recent messages, creating a hotspot at the "top" of every index.
Discord migrated to Cassandra specifically because its LSM-tree storage engine is optimized for sequential writes and time-ordered data access patterns. The lesson here isn't that PostgreSQL is bad β it's that every storage technology has a data model and access pattern it's optimized for, and mismatching them at scale is catastrophic.
The Cascading Impact: How One Storage Choice Affects Everything
Storage decisions don't exist in isolation. They ripple outward and affect every other dimension of your system in ways that are often invisible until you're already in trouble.
Storage Decision
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ βΌ βΌ
Latency Availability Cost Scalability
(query (replication (instance (horizontal
response strategy, size, vs vertical,
time) failover) storage sharding
tier) strategy)
β
βΌ
Developer Velocity
(schema migrations,
query complexity,
operational burden)
Latency is the most immediate casualty of a poor storage choice. A query that scans millions of rows because no index exists, or a cache miss that forces a round trip to a database on the other side of the world, can turn a 5ms API response into a 2-second timeout. At scale, even a 50ms increase in database query time can cascade through a microservices architecture and result in user-facing latency of seconds.
Availability is subtler but more dangerous. Your storage layer's replication strategy determines what happens when a disk fails, a data center loses power, or a network partition occurs. A system with a single primary database and no replicas has zero tolerance for hardware failure. A distributed database with configurable quorum reads gives you fine-grained control over the tradeoff between consistency and availability β which is exactly what the CAP theorem describes.
Cost compounds over time in ways that bite unexpectedly. Storing everything in a premium SSD-backed relational database when 90% of your data is rarely accessed is like storing your winter coats in a temperature-controlled vault. Storage tiering β moving cold data to cheaper object storage like S3 while keeping hot data in fast NVMe-backed databases β can reduce storage costs by 60-80% for read-heavy applications with clear data lifecycle patterns.
π― Key Principle: Every storage system optimizes for some combination of read speed, write speed, storage efficiency, and consistency guarantees. Understanding these tradeoffs β not memorizing database names β is what separates senior engineers from junior ones in system design interviews.
The Storage Landscape: A Mental Map
Before diving into specific technologies, you need a clear mental model of the major categories of storage and what distinguishes them. Think of this as your map of the territory.
Structured Data: The World of Schemas
Structured data is data with a predefined schema β a rigid contract about what fields exist, what types they are, and how they relate to each other. Relational databases like PostgreSQL, MySQL, and SQLite are the canonical home for structured data. They shine when your data has complex relationships, when you need ACID transactions (Atomicity, Consistency, Isolation, Durability), and when your query patterns are varied and unpredictable.
-- A schema that enforces structure: every order MUST have a user and product
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
product_id BIGINT NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
total_cents INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- The database ENFORCES these constraints at write time
-- Trying to insert an order with a non-existent user_id will FAIL
-- This is the power of structured storage: guaranteed data integrity
This schema enforces referential integrity at the database level. You cannot have an orphaned order pointing to a deleted user β the database prevents it. That guarantee is enormously valuable, but it comes at a cost: schema changes require migrations, and those migrations become increasingly painful as your dataset grows.
Unstructured and Semi-Structured Data: Flexibility at Scale
Unstructured data β images, videos, audio files, documents β doesn't fit into rows and columns. It lives in object storage systems like Amazon S3, Google Cloud Storage, or Azure Blob Storage. These systems are optimized for storing and retrieving large binary objects with high durability (S3 advertises 11 nines of durability) and essentially unlimited scale.
Semi-structured data sits between these extremes. JSON documents, log files, and event streams have some internal structure but don't conform to a fixed schema. Document databases like MongoDB and column-family stores like Cassandra are purpose-built for this. They allow schema evolution without migrations β you can add new fields to some documents without affecting others.
// MongoDB document - no fixed schema required
// This user has a 'preferences' field; another might not have it at all
const userDocument = {
_id: ObjectId("507f1f77bcf86cd799439011"),
username: "alice_dev",
email: "alice@example.com",
// This nested object can vary per user - no schema migration needed
preferences: {
theme: "dark",
notifications: { email: true, push: false },
timezone: "America/New_York"
},
// Arrays of any size without schema changes
tags: ["engineer", "early-adopter"]
};
// Querying is flexible but lacks JOIN semantics
// Finding users with dark theme requires a full collection scan
// unless you explicitly create an index on preferences.theme
db.users.createIndex({ "preferences.theme": 1 });
In-Memory Data: Speed Above All Else
In-memory data stores like Redis and Memcached keep data entirely in RAM, which makes them orders of magnitude faster than disk-based databases. A Redis GET operation typically completes in under 1ms. A PostgreSQL query touching disk might take 10-100ms. That difference β invisible at small scale β becomes a chasm when you're handling 100,000 requests per second.
In-memory stores are primarily used as caching layers: storing the results of expensive database queries so subsequent requests don't need to hit the database at all. They're also used for session storage, rate limiting counters, leaderboards, and pub/sub messaging.
π§ Mnemonic: Think of storage in terms of a library. Relational databases are the organized card catalog β perfectly indexed, great for complex lookups, but slow to reorganize. Document stores are filing cabinets β flexible, easy to add new folders, but harder to cross-reference. Object storage is the warehouse β infinite space for big items, but you need to know exactly where you put something. In-memory caches are your desk β tiny, fast, and you lose everything when you go home.
What Interviewers Are Actually Evaluating
Here's a reality that most candidates don't grasp until after a failed interview: when an interviewer asks about databases and storage, they're almost never testing whether you know which database to pick. They're testing something far more fundamental.
π Quick Reference Card: What Interviewers Evaluate on Storage Questions
| π Dimension | π― What They Want to See | β What Fails Candidates |
|---|---|---|
| π§ Tradeoff Reasoning | Can you articulate WHY you chose this technology? | "I'd use PostgreSQL" with no justification |
| π Scalability Thinking | Do you understand the limits of your choice? | Assuming one database scales infinitely |
| π Data Modeling | Does your schema match your access patterns? | Normalizing data that will be read together |
| π‘ Estimation | Can you calculate how much storage you need? | Vague answers like "probably a lot" |
| π§ Failure Awareness | What happens when this storage layer fails? | Not discussing replication or failover |
π‘ Pro Tip: The single most powerful phrase you can use in a storage discussion is: "It depends on the access patterns." Then explain what you mean. An interviewer lighting up when you say that is telling you they're excited because most candidates never get there.
β Wrong thinking: "I'll just use MySQL for everything, it's reliable and I know it well."
β Correct thinking: "For user profile data with complex relational queries, I'd use PostgreSQL. For the activity feed which needs high write throughput and time-ordered retrieval, I'd consider Cassandra. For caching the most recent feed items, Redis with a TTL-based expiration strategy."
β οΈ Common Mistake: Picking a technology by name without explaining the access patterns that justify it. Interviewers hear "I'd use MongoDB" dozens of times a week. They almost never hear a candidate say, "The document model fits here because each product listing has a highly variable attribute schema β some products have 3 attributes, others have 50, and schema migrations on a 200 million row table would be operationally risky."
The best candidates treat storage selection as a hypothesis that they validate against requirements: expected read/write ratio, data volume, consistency requirements, query complexity, and operational constraints. Every one of those factors can tip the balance toward a different storage technology.
π€ Did you know? According to multiple engineering interview coaches, storage and database design questions appear in over 85% of senior software engineer system design interviews. Companies like Google, Meta, and Amazon specifically train interviewers to probe deeper when candidates give surface-level database answers β expecting at least three rounds of follow-up questions before a candidate's understanding is fully assessed.
Your Roadmap: What's Coming in This Lesson
This section is the first of six in this lesson on Databases & Storage. Understanding where each piece fits will help you build knowledge incrementally rather than treating each topic as isolated.
Lesson Roadmap: Databases & Storage
[1] Why Storage Decisions Matter β You are here
β
βΌ
[2] Core Storage Concepts
(ACID, CAP, indexes, replication β the vocabulary
you need to speak fluently in every other section)
β
βΌ
[3] Estimating Storage Requirements
(Capacity math: how to calculate storage needs
live in an interview without freezing)
β
βΌ
[4] Choosing the Right Storage Layer
(Decision frameworks: Relational vs NoSQL vs Cache
β when each wins and why)
β
βΌ
[5] Common Storage Pitfalls
(The mistakes that eliminate candidates β and
how to avoid them)
β
βΌ
[6] Key Takeaways & Interview Cheat Sheet
(Your portable reference before the interview)
Each section builds on the previous. Section 2 gives you the vocabulary. Section 3 gives you the math. Section 4 gives you the decision framework. By the time you reach the specialized child lessons on Relational Databases, NoSQL, and Caching, you'll arrive with a foundation that lets you understand why each technology exists, not just what it does.
π― Key Principle: System design interviews reward engineers who think in tradeoffs, not engineers who memorize technology stacks. Storage is the domain where tradeoffs are most consequential, most visible, and most frequently tested. Every decision you make β relational or document, strong or eventual consistency, cache or no cache β is a bet you're placing about how your users will behave and how your system will grow. The rest of this lesson will teach you how to make those bets intelligently.
You now understand the stakes. Poor storage decisions have derailed companies with millions of users and hundreds of engineers. The right decisions β made with clear reasoning about access patterns, scale, and consistency requirements β are what separate systems that survive growth from systems that collapse under it. More importantly, they're what separate candidates who pass system design interviews from those who don't.
Let's build the foundation you need. In the next section, we'll establish the core vocabulary and mental models β ACID, CAP theorem, indexing strategies, replication β that underpin every storage conversation you'll ever have in an interview room.
Core Storage Concepts Every Engineer Must Know
Before you can make intelligent storage decisions in a system design interview, you need a shared vocabulary and a set of mental models that apply across every database, cache, and file system you'll ever encounter. Think of this section as building your storage intuition β the instinctive understanding that lets you hear a requirement and immediately feel which trade-offs are in play. These concepts are not tied to any single product; they are the physics of data storage, and they govern everything from SQLite on a Raspberry Pi to a petabyte-scale distributed warehouse at Netflix.
Persistence vs. Volatility: What Survives a Crash?
The most fundamental question you can ask about any storage system is: what happens to my data when the power goes out? This question separates volatile storage from persistent storage, and getting it wrong in a design is a silent killer.
Volatile storage holds data only as long as the system is running. RAM is the canonical example. It is blindingly fast β modern DRAM can serve reads in nanoseconds β but the moment a process crashes or a machine reboots, everything in memory is gone. In-memory caches like Redis (in its default configuration) and Memcached are volatile in this sense. This is perfectly acceptable when you're caching computed results you can reconstruct, but catastrophic if you treat the cache as your system of record.
Persistent storage survives power loss because it writes data to a durable medium β spinning hard disks, solid-state drives, or network-attached storage. Relational databases, document stores, and object storage systems all provide persistence by design.
The mechanism databases use to make writes durable is called a Write-Ahead Log (WAL) or commit log. Before a database modifies any data on disk, it first appends a description of the change to a sequential log file. If the system crashes mid-write, the database replays the log on restart and recovers to a consistent state. This is why sequential writes are at the heart of nearly every storage engine β appending to a log is one of the fastest operations a disk can do.
Client Write Request
β
βΌ
βββββββββββββββββββββ
β Write-Ahead Log β β Written FIRST (sequential, fast)
β [entry appended] β
βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Data Files β β Updated AFTER log is safe
β (B-tree pages) β
βββββββββββββββββββββ
β
βΌ
Acknowledge to Client
π‘ Mental Model: Think of the WAL like a pilot's flight recorder. It's not the destination, it's the record of every action taken. If something goes wrong, you replay the record to reconstruct where you were.
β οΈ Common Mistake: Assuming that writing to a database automatically means the data is durable. Many databases buffer writes in memory and flush to disk asynchronously. If your application can't afford to lose even a single write (financial transactions, audit logs), you must explicitly configure fsync behavior and understand your database's durability guarantees.
Durability guarantees exist on a spectrum. A database might offer:
- π Synchronous writes: Every acknowledged write is on durable storage. Safest, slowest.
- β‘ Asynchronous writes: Writes are buffered and flushed periodically. Fastest, risk of data loss on crash.
- π§ Configurable durability: Systems like PostgreSQL let you tune
synchronous_commitper transaction.
ACID Properties: The Contract of Reliable Transactions
When a storage system promises reliable transactions, it's making a four-part promise captured by the acronym ACID: Atomicity, Consistency, Isolation, and Durability. Understanding each property with a concrete example is essential, because interviewers will ask you to reason about what breaks when systems lack these guarantees.
Atomicity
Atomicity means a transaction is all-or-nothing. Either every operation in the transaction succeeds and commits, or none of them do. There is no partial state.
Imagine a bank transfer: subtract $100 from Alice's account and add $100 to Bob's account. Without atomicity, if the system crashes after the debit but before the credit, Alice loses $100 and Bob receives nothing. The money vanishes. Atomicity guarantees the crash rolls back the entire transaction β Alice keeps her $100 and no transfer occurs.
-- Without atomicity, a crash between these two statements destroys data
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE user_id = 'alice';
-- β CRASH HERE without atomicity = Alice loses $100 forever
UPDATE accounts SET balance = balance + 100 WHERE user_id = 'bob';
COMMIT;
-- With atomicity: crash causes full rollback, both accounts unchanged
Consistency
Consistency means a transaction brings the database from one valid state to another valid state, always respecting defined rules (constraints, cascades, triggers). If Alice's account has a minimum balance rule of $0 and she only has $50, a transaction trying to debit $100 will be rejected β the database enforces the rule and refuses to leave data in an invalid state.
π― Key Principle: Consistency is partly a database guarantee and partly an application guarantee. The database enforces schema constraints, but it's the application's job to define what "valid" means for the business domain.
Isolation
Isolation means concurrent transactions behave as if they executed serially β one after the other β even though they're running simultaneously. Without isolation, two users updating the same row simultaneously can produce corrupt results (the lost update problem).
Databases implement isolation through isolation levels that trade correctness for performance:
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | β Possible | β Possible | β Possible |
| Read Committed | β Prevented | β Possible | β Possible |
| Repeatable Read | β Prevented | β Prevented | β Possible |
| Serializable | β Prevented | β Prevented | β Prevented |
Most production applications use Read Committed as a pragmatic balance. Full Serializable isolation is the safest but significantly reduces throughput.
Durability
Durability means once a transaction commits, it stays committed β even if the system crashes a millisecond later. This is where the WAL we discussed earlier does its job. The commit is not acknowledged to the client until the log entry is safely on durable storage.
π§ Mnemonic: All or nothing, Correct always, Isolated from others, Durable forever. A-C-I-D keeps your data clean.
Indexes, B-Trees, and the Read vs. Write Trade-Off
Every storage decision you make involves a fundamental tension: optimizing for reads makes writes slower, and optimizing for writes makes reads slower. This is not a flaw in specific databases β it is a mathematical reality of how information is stored. Understanding indexes is the clearest window into this trade-off.
Without an index, finding a row in a table requires a full table scan β reading every row until you find the one you want. For a table with 10 million rows, that means potentially 10 million comparisons. An index is a separate data structure that stores a sorted copy of specific column values alongside pointers to the original rows, enabling fast lookups.
The dominant index structure in traditional databases is the B-tree (Balanced Tree). A B-tree organizes data into a hierarchy of nodes. Each internal node stores sorted keys and pointers to child nodes. Leaf nodes store the actual data or row pointers. The "balanced" property means every leaf is at the same depth, guaranteeing that any lookup takes O(log n) time regardless of which key you're searching for.
[ 30 | 70 ] β Root node
/ | \
[10|20] [40|60] [80|90] β Internal nodes
/ | \ / | \ / \
[5][15][25] [35][50][65] [75] [95] β Leaf nodes (data here)
Searching for key=50:
1. Root: 50 > 30 and 50 < 70 β go to middle child
2. Internal [40|60]: 50 > 40 and 50 < 60 β go to middle child
3. Leaf [50]: FOUND. 3 comparisons for any key in this tree.
The write trade-off is real and significant. When you insert a new row, the database must:
- Write the row to the data file
- Update every index on that table β find the right position, insert the key, potentially rebalance the tree
β οΈ Common Mistake: Adding indexes indiscriminately to "make the database faster." Each index you add slows down every INSERT, UPDATE, and DELETE on that table. A table with 15 indexes will have dramatically slower write throughput than a table with 2 well-chosen indexes.
A competing index structure worth knowing is the LSM-tree (Log-Structured Merge Tree), used by databases like Cassandra, RocksDB, and LevelDB. LSM-trees flip the trade-off: they batch writes sequentially in memory (a memtable) and periodically flush sorted runs to disk (SSTables). Reads become slightly more expensive (potentially checking multiple SSTables), but writes are extremely fast because they're always sequential. This makes LSM-tree databases excellent choices for write-heavy workloads like time-series data or event streams.
## Demonstrating the index trade-off in practice with SQLite
import sqlite3
import time
conn = sqlite3.connect(':memory:')
c = conn.cursor()
## Create table WITHOUT index
c.execute('CREATE TABLE events (id INTEGER, user_id INTEGER, event_type TEXT)')
## Insert 100,000 rows
data = [(i, i % 10000, 'click') for i in range(100_000)]
c.executemany('INSERT INTO events VALUES (?, ?, ?)', data)
## Time a query WITHOUT index
start = time.time()
c.execute('SELECT COUNT(*) FROM events WHERE user_id = 5000').fetchone()
no_index_time = time.time() - start
## Now add an index
c.execute('CREATE INDEX idx_user_id ON events(user_id)')
## Time the same query WITH index
start = time.time()
c.execute('SELECT COUNT(*) FROM events WHERE user_id = 5000').fetchone()
with_index_time = time.time() - start
print(f'Without index: {no_index_time:.4f}s') # ~0.02s - full scan
print(f'With index: {with_index_time:.4f}s') # ~0.0001s - O(log n)
## The index makes reads ~200x faster at the cost of slower writes
π‘ Pro Tip: In a system design interview, whenever you propose an index, immediately mention the write overhead. Interviewers want to see that you understand the trade-off, not just that indexes make queries faster.
Data Models: Choosing the Right Shape for Your Data
Different data has different natural shapes, and forcing the wrong shape onto data is a common source of complexity and performance problems. Here is a practical overview of the five data models you'll encounter most often:
Tabular (Relational)
Data is organized into tables of rows and columns with a predefined schema. Relationships between entities are expressed through foreign keys and joins. Best for: structured data with complex relationships, strong consistency requirements, and ad-hoc query needs. Examples: PostgreSQL, MySQL, SQLite.
Document
Data is stored as self-contained documents (typically JSON or BSON). A document can contain nested objects and arrays, making it natural for hierarchical data. No joins are required because related data is often embedded in the same document. Best for: content management, user profiles, catalogs β cases where you read the full entity together and the schema varies across records. Examples: MongoDB, Couchbase.
Key-Value
The simplest model: every piece of data is a value accessed by a unique key. No query language, no schema, no relationships. What you lose in expressiveness you gain in raw speed and simplicity. Best for: session storage, caching, feature flags β any scenario where you know exactly which key you want. Examples: Redis, DynamoDB (in its simplest usage mode).
Graph
Data is stored as nodes (entities) and edges (relationships between entities), each with properties. Graph databases excel when the relationships themselves are the data and queries traverse many hops. Best for: social networks (friends of friends), fraud detection (chains of related accounts), recommendation engines. Examples: Neo4j, Amazon Neptune.
Time-Series
Data is indexed primarily by timestamp and optimized for append-heavy, chronological workloads with range queries over time windows. Best for: metrics, monitoring, IoT sensor data, financial tick data. Examples: InfluxDB, TimescaleDB, Prometheus.
π Quick Reference Card:
| π Model | π§ Best For | β οΈ Avoid When |
|---|---|---|
| ποΈ Relational | Complex queries, joins, ACID transactions | Schema changes are very frequent |
| π Document | Hierarchical, schema-flexible entities | Data has deep cross-document relationships |
| π Key-Value | Cache, sessions, simple lookups | You need to query by non-key attributes |
| πΈοΈ Graph | Relationship traversal, network analysis | Data is not naturally graph-shaped |
| β±οΈ Time-Series | Metrics, events ordered by time | Non-temporal, relational queries dominate |
Replication and Partitioning: Distributing Data Across Machines
No single machine is fast enough, large enough, or reliable enough for a production system at scale. Distributing data across multiple machines requires two fundamental techniques: replication and partitioning (also called sharding).
Replication
Replication means keeping copies of the same data on multiple machines. The primary goals are fault tolerance (if one machine dies, another has the data) and read scalability (multiple machines can serve read requests in parallel).
The most common replication topology is single-leader replication (also called master-replica or primary-replica):
Client Writes
β
βΌ
βββββββββββββββββββ
β Leader/Primary β β All writes go here
βββββββββββββββββββ
β β
Replication Replication
(sync/async) (sync/async)
β β
βΌ βΌ
ββββββββββββ ββββββββββββ
β Replica 1β β Replica 2β β Reads can be served here
ββββββββββββ ββββββββββββ
Client Reads β Any Replica
The critical trade-off in replication is synchronous vs. asynchronous replication:
- Synchronous: The leader waits for at least one replica to confirm the write before acknowledging the client. Stronger durability, but higher write latency.
- Asynchronous: The leader acknowledges immediately and replicates in the background. Lower latency, but if the leader crashes before replication completes, acknowledged writes can be lost.
β οΈ Common Mistake: Reading your own writes from a replica. If you write to the leader and immediately read from a replica that hasn't replicated yet, you'll see stale data. This is called replication lag, and it's the source of bugs that are notoriously hard to reproduce. Applications that can't tolerate this must always read from the leader for data they just wrote.
Multi-leader and leaderless replication exist for higher availability at the cost of more complex consistency guarantees β territory you'll explore when studying distributed consistency models.
Partitioning (Sharding)
Where replication duplicates data, partitioning divides data so each machine is responsible for a distinct subset. When your dataset is too large for one machine's disk, or your write throughput exceeds what one machine can handle, you partition.
The two dominant partitioning strategies are:
Range partitioning: assign contiguous ranges of a key to each partition. Simple to implement and supports efficient range queries ("give me all events from January"). The danger is hot spots β if your key is a timestamp and most recent data is queried most, one partition bears all the load.
Hash partitioning: apply a hash function to the key and assign data to partitions based on the hash value. This distributes data evenly and eliminates hot spots, but destroys the ability to do efficient range queries.
## Demonstrating hash partitioning logic
def get_partition(key: str, num_partitions: int) -> int:
"""Determine which partition a key belongs to."""
# Use consistent hashing in real systems; this is the core idea
return hash(key) % num_partitions
## Simulate distributing 10 users across 3 partitions
num_partitions = 3
users = ['alice', 'bob', 'charlie', 'diana', 'eve',
'frank', 'grace', 'henry', 'iris', 'jack']
for user in users:
partition = get_partition(user, num_partitions)
print(f'User {user:8s} β Partition {partition}')
## Output distributes roughly evenly:
## User alice β Partition 2
## User bob β Partition 1
## User charlie β Partition 0
## ... (hash ensures roughly even spread)
A key complication of partitioning is cross-partition queries. If a single query needs data from multiple partitions (a JOIN across shards), the database must fan out the query to all relevant partitions and aggregate the results. This is expensive and is why many large-scale systems denormalize data to avoid cross-shard joins.
π― Key Principle: Replication and partitioning are almost always used together. A production distributed database typically has N partitions, each with M replicas. You get the write scalability of partitioning and the fault tolerance of replication simultaneously.
π‘ Real-World Example: Cassandra uses consistent hash-based partitioning with configurable replication factors. A cluster of 6 nodes with replication factor 3 means each piece of data lives on 3 different nodes. If 2 nodes fail simultaneously, the data is still available. Every node is both a leader for some partitions and a replica for others β a fully peer-to-peer topology.
Putting It All Together
These concepts do not exist in isolation β they form an interconnected system of trade-offs. When you choose a persistent store with strong ACID guarantees, you're accepting lower throughput because the WAL and synchronous replication add latency. When you add indexes to speed up reads, you slow down writes. When you partition for scale, you complicate cross-partition queries. When you replicate asynchronously for speed, you risk serving stale reads.
β Wrong thinking: "I'll just pick the most powerful database and it will handle everything." β Correct thinking: "What are my read patterns? Write patterns? Consistency requirements? Failure tolerance needs? Each answer changes which trade-offs are acceptable."
This is exactly the reasoning interviewers want to see. Not a memorized answer about which database is "best," but a structured walk through trade-offs using the vocabulary and mental models you've just built. Every section that follows will add tools to this foundation β but the foundation itself is what separates engineers who confidently navigate storage decisions from those who guess.
How to Estimate Storage Requirements in System Design
Every system design interview eventually arrives at the same uncomfortable moment: the interviewer asks, "How much storage will this system need?" Candidates who freeze, guess wildly, or skip the question entirely signal a critical gap in their engineering intuition. Candidates who walk through a structured estimation process β even if their final number is off by a factor of two β demonstrate the systematic thinking that separates senior engineers from juniors. This section gives you that structured process, grounded in real numbers and practiced through realistic examples.
Why Estimation Matters Before You Pick a Database
Storage estimation isn't just a warm-up exercise. The numbers you derive fundamentally shape every downstream decision. A system storing 1 GB of data per day has radically different needs than one storing 1 TB per day β even if the application logic looks identical. The volume determines whether you can use a single relational database or need a distributed storage cluster, whether you can afford full-text indexes on every column or must be surgical about indexing, and whether replication overhead is a minor footnote or a major cost driver.
π― Key Principle: Estimate storage before choosing a database engine. The scale you're designing for often eliminates entire categories of technology before you've written a single schema.
The good news is that you don't need perfect precision. Back-of-the-envelope estimation β a term borrowed from the practice of scribbling quick calculations on whatever paper is nearby β is explicitly about order-of-magnitude accuracy. Being within a factor of two or three is entirely acceptable. What matters is that your reasoning is sound and your assumptions are stated clearly.
The Three-Variable Framework
Almost every storage estimation problem reduces to three variables multiplied together:
Total Storage = (Bytes per Record) Γ (Records per Day) Γ (Retention Period in Days)
Then you layer on multipliers for replication, indexing, and blob data. Let's unpack each variable.
Bytes per record means the average size of a single unit of data β one tweet, one user profile, one URL mapping. You estimate this by listing the fields in your schema and summing their sizes. A 64-bit integer is 8 bytes. A UUID is 16 bytes. A timestamp is 8 bytes. A short string of 140 characters is roughly 140 bytes in UTF-8 for ASCII text. This is not an exercise in exactness; it's an exercise in not being wildly wrong.
Records per day comes from your user and activity assumptions. How many daily active users does the system have? How many actions does each user take per day on average? Multiply these together and you have your write throughput, which directly yields records per day.
Retention period is how long you keep data before archiving or deleting it. A compliance-sensitive financial system might keep data for seven years. A social media feed cache might expire after 30 days. Knowing retention turns a daily rate into a total volume.
π§ Mnemonic: BRR β Bytes per record, Records per day, Retention. Whenever you start a storage estimate, write BRR at the top of your scratch pad.
Worked Example: Estimating Storage for a Twitter-Like System
Let's put this framework into practice with a concrete scenario. Imagine you're designing a social media platform similar to Twitter, and the interviewer tells you: 100 million daily active users (DAU).
Step 1: Establish Activity Assumptions
Start by asking or stating how active those users are. A reasonable assumption:
- Each user reads about 50 posts per day (scrolling the feed)
- Each user writes about 2 posts per day
- This gives us a read-to-write ratio of 25:1 β a heavily read-dominant system
Daily writes: 100M users Γ 2 posts/user = 200 million posts per day
Step 2: Estimate Bytes per Record
Sketch a minimal schema for a post (tweet):
| Field | Type | Size |
|---|---|---|
| post_id | BIGINT | 8 bytes |
| user_id | BIGINT | 8 bytes |
| created_at | TIMESTAMP | 8 bytes |
| content | VARCHAR(280) | ~200 bytes avg |
| like_count | INT | 4 bytes |
| repost_count | INT | 4 bytes |
| Total | ~232 bytes |
Rounding up for overhead, encoding, and row metadata: approximately 250 bytes per post record.
Step 3: Calculate Raw Daily Storage
Daily raw storage = 200M posts Γ 250 bytes = 50 GB/day
Step 4: Apply Retention
If you retain posts for 5 years (1,825 days):
Total raw storage = 50 GB/day Γ 1,825 days β 91 TB
Round to ~100 TB of raw post data.
Step 5: Account for Replication, Indexing, and Media
This is where most candidates stop too early. Raw data is only part of the story.
Replication overhead: Distributed databases typically maintain 3 replicas for fault tolerance. Multiply raw storage by 3.
Replicated storage = 100 TB Γ 3 = 300 TB
Indexing storage costs: Indexes on frequently queried columns (user_id, created_at) can add 20β50% overhead. Use 30% as a rule of thumb.
With indexes = 300 TB Γ 1.3 β 390 TB
Blob vs. metadata separation: Posts with images or videos require separate blob storage (like Amazon S3 or Google Cloud Storage). Assume 20% of posts include a 2 MB image on average:
Image posts per day = 200M Γ 0.20 = 40M
Blob storage per day = 40M Γ 2 MB = 80 TB/day
Blob storage over 5 years = 80 TB Γ 1,825 = ~146 PB
β οΈ Common Mistake: Forgetting to separate blob storage from structured metadata storage. Images and videos dwarf your database row data. A system that looks manageable at the database layer can require petabytes of object storage β a completely different infrastructure tier.
Your final summary to an interviewer would look like this:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage Estimate Summary β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββ€
β Structured post data β ~390 TB (replicated) β
β User profile data β ~10 TB (small) β
β Media blob storage β ~146 PB β
β Cache layer (hot data) β ~5 TB (Redis/Memcache)β
βββββββββββββββββββββββββββ΄ββββββββββββββββββββββββ
This immediately signals that your blob storage strategy is the dominant engineering challenge, not your database schema.
Read-to-Write Ratio Analysis
The read-to-write ratio you calculate during estimation isn't just a number β it's a compass pointing toward the right storage technology.
In our Twitter example, the 25:1 read-to-write ratio tells us the system is read-heavy. This has direct architectural consequences:
Read-Heavy System (>10:1)
βββ Optimize reads at the cost of write complexity
βββ Consider denormalization (store redundant data to avoid joins)
βββ Read replicas make sense (route reads to replica nodes)
βββ Caching layer has high hit rate = high ROI
βββ Column-family stores (Cassandra) or document stores (MongoDB)
work well for fan-out reads
Write-Heavy System (<3:1)
βββ Optimize write throughput
βββ Avoid expensive indexes (each index adds write overhead)
βββ Consider append-only log structures (LSM trees)
βββ Time-series databases (InfluxDB, TimescaleDB) built for this
βββ Event streaming (Kafka) to buffer write spikes
Balanced System (3:1 to 10:1)
βββ Standard RDBMS often appropriate
βββ Careful index selection matters
βββ CQRS pattern may apply at larger scales
π‘ Real-World Example: Cassandra, used by Instagram and Netflix, is designed around write-heavy and read-heavy workloads alike, but its LSM-tree storage engine shines when writes dominate because it batches writes to sequential disk locations β avoiding the random I/O that kills write performance in B-tree databases.
β Wrong thinking: "I'll add indexes on every column so reads are fast." β Correct thinking: "Each additional index increases write latency and storage overhead by 10β30%. I'll index only the columns I query frequently, based on my read pattern analysis."
A Python Script for Storage Estimation
During an interview, you'll sketch this math verbally or on a whiteboard. But building a quick estimation script helps you internalize the numbers and practice the discipline of stating assumptions explicitly. Here's a reusable estimator:
## storage_estimator.py
## A back-of-the-envelope storage calculator for system design
def estimate_storage(
bytes_per_record: int,
records_per_day: int,
retention_days: int,
replication_factor: int = 3,
index_overhead: float = 0.30,
blob_fraction: float = 0.0,
blob_size_bytes: int = 0
) -> dict:
"""
Estimates total storage requirements for a system.
Args:
bytes_per_record: Average size of one data record in bytes
records_per_day: Number of new records written per day
retention_days: How long data is kept before deletion/archive
replication_factor: Number of data copies (default: 3)
index_overhead: Fraction of additional storage for indexes (default: 30%)
blob_fraction: Fraction of records that include a blob (e.g., image)
blob_size_bytes: Average size of each blob in bytes
Returns:
Dictionary with detailed breakdown in gigabytes
"""
# Step 1: Raw structured data
daily_raw_bytes = bytes_per_record * records_per_day
total_raw_bytes = daily_raw_bytes * retention_days
# Step 2: Apply replication
replicated_bytes = total_raw_bytes * replication_factor
# Step 3: Apply index overhead
with_indexes = replicated_bytes * (1 + index_overhead)
# Step 4: Calculate blob storage separately
blob_records_per_day = records_per_day * blob_fraction
total_blob_bytes = blob_records_per_day * blob_size_bytes * retention_days
# Convert to human-readable units
gb = 1024 ** 3
tb = 1024 ** 4
pb = 1024 ** 5
def format_bytes(b):
if b >= pb:
return f"{b / pb:.1f} PB"
elif b >= tb:
return f"{b / tb:.1f} TB"
else:
return f"{b / gb:.1f} GB"
return {
"daily_raw": format_bytes(daily_raw_bytes),
"total_raw": format_bytes(total_raw_bytes),
"with_replication": format_bytes(replicated_bytes),
"with_indexes": format_bytes(with_indexes),
"blob_storage": format_bytes(total_blob_bytes),
"total_estimate": format_bytes(with_indexes + total_blob_bytes)
}
## Twitter-like system example
result = estimate_storage(
bytes_per_record=250, # ~250 bytes per post record
records_per_day=200_000_000, # 200M posts per day
retention_days=365 * 5, # 5-year retention
replication_factor=3,
index_overhead=0.30,
blob_fraction=0.20, # 20% of posts have images
blob_size_bytes=2 * 1024 * 1024 # 2 MB average image size
)
print("=== Twitter-like System Storage Estimate ===")
for key, value in result.items():
print(f" {key.replace('_', ' ').title():<25} {value}")
Running this script produces output like:
=== Twitter-like System Storage Estimate ===
Daily Raw 46.6 GB
Total Raw 84.9 TB
With Replication 254.6 TB
With Indexes 331.0 TB
Blob Storage 144.4 PB
Total Estimate 144.7 PB
The script makes every assumption explicit as a named parameter β exactly the discipline you want to demonstrate in an interview.
Schema Sketch and Size Accounting
Let's look at a simpler but equally instructive example: a URL shortener like bit.ly. This is a classic interview question where the data model is compact and the math is clean.
## url_shortener_schema.py
## Schema definition with per-field byte costs annotated
## Each shortened URL mapping requires:
schema = {
"short_code": {"type": "CHAR(7)", "bytes": 7}, # 7-char alphanumeric code
"long_url": {"type": "VARCHAR(2048)", "bytes": 200}, # avg URL ~200 chars
"user_id": {"type": "BIGINT", "bytes": 8}, # creator's user ID
"created_at": {"type": "TIMESTAMP", "bytes": 8}, # creation time
"expires_at": {"type": "TIMESTAMP", "bytes": 8}, # expiry (nullable)
"click_count": {"type": "BIGINT", "bytes": 8}, # total clicks cached
}
bytes_per_row = sum(field["bytes"] for field in schema.values())
print(f"Bytes per URL record: {bytes_per_row}") # Output: 239 bytes
## Estimation: 100M new URLs created per day, kept for 5 years
daily_new_urls = 100_000_000
retention_days = 365 * 5
raw_total = bytes_per_row * daily_new_urls * retention_days
print(f"Raw total: {raw_total / (1024**4):.1f} TB") # ~54 TB
## With 3x replication and 30% index overhead
final = raw_total * 3 * 1.30
print(f"Final estimate: {final / (1024**4):.1f} TB") # ~211 TB
This kind of annotated schema sketch β even when done verbally in an interview β shows the interviewer that you think carefully about what data you're actually storing. Notice how explicitly naming each field and its size forces you to question whether you really need every field.
π‘ Pro Tip: In an interview, sketch your schema before estimating storage. The act of drawing columns and types naturally surfaces questions like "Do I really need to store click_count here, or can I compute it from a separate analytics table?" These are the kinds of design insights interviewers want to see.
Accounting for the Pieces Candidates Miss
Beyond the core BRR framework, experienced engineers account for several additional storage layers that often surprise candidates:
π§ Write-Ahead Logs (WAL): Databases like PostgreSQL write every change to a transaction log before applying it to the main data files. WAL can consume 1β2Γ your daily write volume in temporary storage. This matters when sizing disk on your database nodes.
π§ Compaction and Garbage: LSM-tree databases (Cassandra, RocksDB) write data in SSTables and periodically merge them in a process called compaction. During compaction, temporary disk usage can spike to 2Γ normal. Provision with headroom.
π§ Audit Logs and Change Data Capture: Many enterprise systems must retain a full audit trail of every data modification. If you delete a record in your primary store, the audit log preserves the deleted data. This can easily double your storage footprint in compliance-heavy domains.
π§ Backup Storage: Production databases require periodic backups stored separately β often in object storage. Full weekly backups plus daily incrementals add another 30β100% of your primary storage footprint.
Full Storage Budget Breakdown:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Primary Database Nodes β
β Raw data ββββββββββββββββ 100 TB β
β Replication (3x) ββββββββββββββββ 300 TB β
β Indexes (30%) ββββββββββββββββ 90 TB β
β WAL / temp ββββββββββββββββ 20 TB β
β Compaction buffer ββββββββββββββββ 60 TB β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Blob / Object Storage β
β Media files ββββββββββββββββ 146 PB β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Backup Storage β
β DB backups ββββββββββββββββ 60 TB β
β Blob backups ββββββββββββββββ 73 PB β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ Common Mistake: Estimating only the size of the "hot" primary database and ignoring blob storage, backups, and WAL overhead. In production systems, these secondary storage pools frequently exceed the primary database in total volume.
Communicating Your Estimate in an Interview
The mechanics of calculation matter less than how you communicate them. Follow this structure:
- State your assumptions out loud β "I'll assume 100M DAU, each writing 2 posts per day."
- Walk through BRR in order β bytes per record, then daily volume, then total over retention.
- Apply multipliers explicitly β "With 3x replication, that becomes X. Indexes add another 30%, bringing us to Y."
- Separate blob from structured data β treat them as distinct storage tiers with different technologies.
- Sanity-check your number β "Does 400 TB of database storage feel right for a platform at Twitter's scale? Twitter has disclosed petabytes of storage, so our estimate seems plausible."
π€ Did you know? Twitter's infrastructure at peak operated roughly 300 TB of MySQL data before migrating significant portions to Manhattan (their distributed key-value store). Estimates in the hundreds of terabytes for a major social platform are in the right ballpark β which means our back-of-the-envelope math is well-calibrated.
π Quick Reference Card: Storage Estimation Checklist
| Step | What to Calculate | Typical Multiplier |
|---|---|---|
| π’ Bytes per record | Sum field sizes in your schema | 1Γ (baseline) |
| π Daily write volume | DAU Γ writes per user | 1Γ |
| π Retention volume | Daily Γ days retained | 1Γ |
| π Replication | Raw Γ replica count | 3Γ typical |
| ποΈ Index overhead | Replicated Γ index fraction | 1.2β1.5Γ |
| πΌοΈ Blob separation | (blob fraction Γ blob size Γ days) Γ replicas | Varies |
| πΎ WAL + compaction | Add 10β20% buffer on DB nodes | 1.1β1.2Γ |
| ποΈ Backup storage | 30β100% of primary storage | 0.3β1Γ |
Putting It All Together
Storage estimation is a skill built through repetition. The Twitter example and the URL shortener example follow identical structural logic β BRR, then multipliers, then separation of concerns between structured data and blobs. What changes between problems is the specific numbers and which multipliers dominate.
The read-to-write ratio you derive from your user activity assumptions isn't just bookkeeping β it's the first signal about which storage technology is appropriate. A 25:1 read-heavy system suggests read replicas, caching, and denormalized schemas. A 1:10 write-heavy system suggests LSM-tree storage engines, minimal indexes, and write-optimized hardware.
π‘ Mental Model: Think of storage estimation as building a budget before construction. You wouldn't start building a house without knowing whether you need a studio apartment's worth of space or a warehouse. Storage estimation gives you the blueprint's scale before you start arguing about which materials to use.
In the next section, we'll take the scale you've just learned to estimate and use it to drive a principled decision framework for choosing which storage technology β relational, document, column-family, or object store β fits the job at hand.
Choosing the Right Storage Layer for the Job
Every system design interview eventually reaches the same pivotal moment: the interviewer asks, "So what database would you use?" Many candidates freeze, or worse, reflexively answer "PostgreSQL" or "MongoDB" without any justification. The difference between a candidate who lands the offer and one who doesn't often comes down to how confidently and systematically they reason through storage choices. This section gives you that system.
Choosing a storage layer is not about picking your favorite technology. It is about matching the characteristics of a storage solution to the specific demands of your problem. The best engineers treat this as an engineering decision β one that can be reasoned about, justified, and even quantified.
The Storage Selection Checklist
Before naming any database, run through four fundamental dimensions. Think of these as the four questions your interviewer is secretly hoping you'll ask yourself out loud.
1. Consistency Needs
How critical is it that every read reflects the most recent write? A banking ledger where a transfer must be immediately visible across all nodes demands strong consistency. A social media "like" count that can lag by a second or two can tolerate eventual consistency. The CAP theorem (covered in a later lesson) frames this trade-off formally, but practically speaking, ask yourself: what is the worst-case impact of a stale read?
2. Query Patterns
This is one of the most underrated questions in storage selection. A relational database with SQL excels at flexible, ad-hoc queries with JOINs across normalized tables. A document store shines when your access patterns are almost always "fetch the entire entity by ID." A columnar store is purpose-built for aggregating across millions of rows on a handful of columns. If you design for the wrong query pattern, no amount of hardware will save you.
3. Scale Targets
Are you designing for 10,000 users or 100 million? Will your data grow by 1 GB per month or 1 TB per day? This dimension determines whether horizontal scaling matters, whether sharding is necessary, and whether you need a distributed database at all. Many problems that look like they need exotic infrastructure are perfectly served by a well-tuned single Postgres instance with read replicas.
4. Team Familiarity
This factor is often omitted from textbook frameworks, but it is deeply practical. A technology your team does not know becomes a liability in production. In an interview, acknowledging this dimension signals maturity. If two storage solutions are otherwise equivalent, the one your team can operate confidently is the better choice.
π§ Mnemonic: Remember CQST β Consistency, Query patterns, Scale targets, Team familiarity. Run through CQST before you name any database.
STORAGE SELECTION CHECKLIST
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CQST Framework β
β β
β C β Consistency βββΊ Strong? Eventual? None? β
β Q β Query Pattern βββΊ Point lookup? Aggregate? β
β Join-heavy? Full-text? β
β S β Scale Target βββΊ Reads/sec? Write volume? β
β Data size? Growth rate? β
β T β Team Familiarity βΊ Operability? Expertise? β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Pro Tip: In an interview, verbalize this checklist out loud before answering. Say something like: "Before I commit to a database, let me think through consistency requirements, query patterns, scale, and operational complexity." This immediately signals structured thinking.
Polyglot Persistence: Why Modern Systems Use Multiple Stores
Polyglot persistence is the practice of using multiple storage technologies within the same system, each chosen for what it does best. The term was popularized by Neal Ford and Martin Fowler, and it reflects a fundamental truth: no single database is optimal for every access pattern within a complex application.
Consider a mature e-commerce platform. It almost certainly uses:
- A relational database (PostgreSQL, MySQL) for orders, payments, and inventory β because these require ACID transactions and complex joins.
- A key-value cache (Redis) for session data and product listings β because speed matters and the data is transient or derived.
- A search engine (Elasticsearch) for product search β because full-text search with faceting and relevance ranking is what it is built for.
- A blob/object store (S3) for product images and static assets β because storing binary files in a relational database is expensive and slow.
- An analytics warehouse (Snowflake, BigQuery) for business intelligence queries β because OLAP workloads on terabytes of data require columnar storage.
POLYGLOT PERSISTENCE IN AN E-COMMERCE PLATFORM
βββββββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β PostgreSQL β β Redis β βElasticsearchβ
β (Orders, β β (Sessions, β β (Product β
β Payments, β β Caching) β β Search) β
β Inventory) β βββββββββββββββ βββββββββββββββ
βββββββββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β S3 β β Snowflake β
β (Images, β β (Analytics, β
β Assets) β β Reporting) β
βββββββββββββββ βββββββββββββββ
π― Key Principle: Polyglot persistence is not complexity for its own sake. Each additional storage technology must earn its place by solving a problem that existing stores cannot solve adequately.
β οΈ Common Mistake: Introducing polyglot persistence too eagerly. Every additional store is a new failure domain, a new operational burden, and a new consistency challenge between stores. Premature polyglot persistence is a real anti-pattern.
Scenario Walkthroughs
The best way to internalize this framework is to apply it. Let's walk through three distinct scenarios that commonly appear in system design interviews.
Scenario 1: E-Commerce Order System
An order system must record purchases, track status transitions (placed β paid β shipped β delivered), and support customer queries like "show me all orders from the last 30 days." It must never lose a transaction or record a partial write.
Running CQST:
- π Consistency: Strong. A customer's payment and the corresponding order record must be written atomically. Double-charges or phantom orders are business-critical failures.
- π§ Query Pattern: Mixed. Point lookups by order ID, but also range queries by customer and date. JOINs between orders, line items, and products.
- π Scale: Moderate write volume, high read volume. Most e-commerce sites have far more order reads (status checks) than writes.
- π― Team: Most engineering teams know SQL. Relational databases have mature tooling.
Decision: A relational database (PostgreSQL) is the right primary store. ACID transactions protect payment integrity. SQL handles the mixed query patterns. A Redis cache in front of order status reads reduces load for the common "is my order shipped?" pattern.
Scenario 2: Real-Time Analytics Dashboard
A SaaS product wants a dashboard showing active users per minute, feature usage trends over 90 days, and funnel conversion rates β all computed over billions of event rows.
Running CQST:
- π Consistency: Eventual is fine. A dashboard that is 30 seconds behind is acceptable. Approximate counts are often acceptable too.
- π§ Query Pattern: Heavy aggregation over large datasets. Few or no JOINs. Queries scan millions to billions of rows but touch only a few columns.
- π Scale: Massive. Potentially billions of rows, terabytes of data, growing continuously.
- π― Team: Needs SQL-like query language for analysts.
Decision: A columnar analytics warehouse (Snowflake, BigQuery, or ClickHouse for self-hosted) is the right answer. Columnar storage makes aggregation over sparse columns dramatically faster. A streaming ingestion layer (Kafka + a stream processor) feeds events into the warehouse in near-real-time.
Scenario 3: User Session Storage
A web application needs to store user session data: authentication tokens, shopping cart contents, and recent preferences. Sessions expire after 30 minutes of inactivity.
Running CQST:
- π Consistency: Weak. If a session is momentarily stale, the user might see a slightly outdated cart. This is tolerable.
- π§ Query Pattern: Pure key-value. Always "get session by session ID" or "set session by session ID." No aggregation, no joins.
- π Scale: High read/write frequency, but tiny data per key. Total data volume is bounded by concurrent active users.
- π― Team: Redis is universally known. Expiry (TTL) is a native feature.
Decision: Redis is the canonical answer. Native TTL support handles session expiration automatically. Sub-millisecond latency serves authentication checks without adding perceptible delay to every request. Persistence is optional and should be evaluated based on whether losing sessions on a Redis restart is acceptable.
When a Single Database Is the Right Answer
One of the most counterintuitive lessons in system design is that simpler is often better. The instinct in interviews is to propose sophisticated multi-store architectures, but experienced interviewers often look for candidates who know when not to add complexity.
A single well-chosen database is likely sufficient when:
- π§ The access patterns are uniform and don't span wildly different data models.
- π The scale is within the range of a single powerful machine or a small cluster (millions of rows, not billions).
- π§ The team is early-stage and operational complexity is a significant cost.
- π― The product is pre-product-market fit and requirements will change frequently.
π‘ Real-World Example: Basecamp, a project management SaaS serving hundreds of thousands of users, ran on a single MySQL database for years. DHH (its creator) has written extensively about how this was a deliberate, successful choice. Not every system is Google.
β Wrong thinking: "I'll use PostgreSQL for the main data, Redis for caching, Elasticsearch for search, Cassandra for time-series events, and S3 for files" β for a startup MVP with 500 users.
β Correct thinking: "PostgreSQL can handle all of this at our current scale. I'll add Redis when query latency becomes measurable, and Elasticsearch when full-text search requirements are confirmed."
When to move to a layered architecture: Add a new storage layer when you have measured evidence that the current store cannot handle a specific workload β not before. This might be a cache when p99 latency on hot reads exceeds a threshold, or a search engine when LIKE queries begin causing table scans.
Code Example: A Python Service Layer That Routes to Multiple Stores
The following code demonstrates a storage router β a service layer that abstracts the decision of which store to read from or write to. This pattern is common in production systems practicing polyglot persistence.
import redis
import psycopg2
import json
from typing import Optional
## Configuration (in production, use environment variables or a config service)
REDIS_HOST = "localhost"
REDIS_PORT = 6379
PG_DSN = "postgresql://user:pass@localhost:5432/ecommerce"
SESSION_TTL_SECONDS = 1800 # 30 minutes
class StorageRouter:
"""
Routes reads and writes to the appropriate storage layer.
- Sessions β Redis (fast, ephemeral, TTL-native)
- Orders β PostgreSQL (durable, ACID, relational)
- Hot product data β Redis cache with PostgreSQL as source of truth
"""
def __init__(self):
# Initialize Redis connection pool
self.redis = redis.Redis(
host=REDIS_HOST,
port=REDIS_PORT,
decode_responses=True
)
# Initialize PostgreSQL connection
self.pg_conn = psycopg2.connect(PG_DSN)
# ββ SESSION OPERATIONS (Redis only) βββββββββββββββββββββββββββββββββββββ
def set_session(self, session_id: str, data: dict) -> None:
"""Write session data to Redis with automatic TTL expiry."""
key = f"session:{session_id}"
self.redis.setex(
name=key,
time=SESSION_TTL_SECONDS,
value=json.dumps(data)
)
def get_session(self, session_id: str) -> Optional[dict]:
"""Read session data from Redis. Returns None if expired or missing."""
key = f"session:{session_id}"
raw = self.redis.get(key)
return json.loads(raw) if raw else None
# ββ ORDER OPERATIONS (PostgreSQL only) ββββββββββββββββββββββββββββββββββ
def create_order(self, user_id: int, items: list, total: float) -> int:
"""
Write a new order to PostgreSQL inside a transaction.
Returns the new order ID.
"""
with self.pg_conn.cursor() as cur:
try:
# Insert order header
cur.execute(
"INSERT INTO orders (user_id, total, status) "
"VALUES (%s, %s, 'placed') RETURNING id",
(user_id, total)
)
order_id = cur.fetchone()[0]
# Insert line items
for item in items:
cur.execute(
"INSERT INTO order_items (order_id, product_id, quantity, price) "
"VALUES (%s, %s, %s, %s)",
(order_id, item["product_id"], item["qty"], item["price"])
)
self.pg_conn.commit() # Atomic commit for both tables
return order_id
except Exception:
self.pg_conn.rollback() # Rollback on any failure
raise
# ββ PRODUCT READ (Cache-aside pattern) ββββββββββββββββββββββββββββββββββ
def get_product(self, product_id: int) -> Optional[dict]:
"""
Cache-aside pattern:
1. Check Redis for cached product data.
2. On miss, fetch from PostgreSQL and populate the cache.
"""
cache_key = f"product:{product_id}"
# Step 1: Try the cache first
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached) # Cache hit: fast path
# Step 2: Cache miss β query the source of truth
with self.pg_conn.cursor() as cur:
cur.execute(
"SELECT id, name, price, stock FROM products WHERE id = %s",
(product_id,)
)
row = cur.fetchone()
if not row:
return None
product = {"id": row[0], "name": row[1],
"price": float(row[2]), "stock": row[3]}
# Step 3: Populate cache with a 5-minute TTL
self.redis.setex(cache_key, 300, json.dumps(product))
return product
This StorageRouter class encodes three distinct routing decisions in one place. Session operations go exclusively to Redis, leveraging its native TTL. Order creation goes exclusively to PostgreSQL inside a database transaction, ensuring atomicity across the orders and order_items tables. Product reads use the cache-aside pattern β check Redis first, fall back to PostgreSQL on a cache miss, then populate the cache for future reads.
Notice how the calling code (a web handler, a background job) never needs to know which database is being used for which operation. The storage routing logic is centralized and testable.
## Example usage β the caller knows nothing about which store is used
router = StorageRouter()
## Session management
router.set_session("abc123", {"user_id": 42, "cart": [{"sku": "WIDGET-1", "qty": 2}]})
session = router.get_session("abc123")
print(session) # {'user_id': 42, 'cart': [{'sku': 'WIDGET-1', 'qty': 2}]}
## Place an order (atomic PostgreSQL transaction)
order_id = router.create_order(
user_id=42,
items=[{"product_id": 1, "qty": 2, "price": 9.99}],
total=19.98
)
print(f"Order created: {order_id}")
## Fetch a product (cache-aside from Redis β PostgreSQL)
product = router.get_product(product_id=1)
print(product) # {'id': 1, 'name': 'Widget', 'price': 9.99, 'stock': 148}
π‘ Pro Tip: In an interview, you don't need to write this level of code. But sketching the idea β "I'd have a service layer that routes sessions to Redis and orders to Postgres" β shows precisely the architectural thinking that distinguishes strong candidates.
β οΈ Common Mistake: Forgetting cache invalidation. In the cache-aside pattern above, if a product's price is updated in PostgreSQL, the Redis cache will serve stale data for up to 5 minutes. The solution is either to invalidate the cache key on writes (redis.delete(cache_key)) or to accept bounded staleness. Always address this trade-off explicitly in interviews.
A Quick Reference Decision Guide
π Quick Reference Card:
| π― Use Case | β Recommended Store | β οΈ Key Reason |
|---|---|---|
| π Financial transactions | Relational (PostgreSQL) | ACID, JOINs, strong consistency |
| π§ Session / ephemeral data | Key-value (Redis) | TTL support, sub-ms latency |
| π Full-text product search | Search engine (Elasticsearch) | Relevance ranking, facets |
| π§ User activity / event streams | Time-series (InfluxDB, ClickHouse) | Efficient time-range queries |
| π― Flexible document schemas | Document (MongoDB) | Schema flexibility, nested docs |
| π Images, video, binary files | Object store (S3) | Cost, durability, CDN integration |
| π BI / aggregate analytics | Columnar warehouse (Snowflake) | Scan efficiency on wide tables |
| π Social graph / relationships | Graph (Neo4j) | Traversal, relationship queries |
π€ Did you know? Discord famously migrated from Cassandra to ScyllaDB (a Cassandra-compatible database written in C++) to handle billions of messages β not because Cassandra was wrong for the problem, but because ScyllaDB's architecture produced dramatically more consistent tail latency at their scale. Even correct technology choices get revisited as scale changes.
Bringing It Together: Making the Call in an Interview
When you sit across from an interviewer and they describe a system to design, your storage reasoning should follow a clear arc. First, identify the data entities and how they relate to each other. Second, map each entity or access pattern to a CQST analysis. Third, propose a storage architecture β simple if the problem warrants it, layered if it doesn't. Fourth, explicitly acknowledge the trade-offs you're making.
The last step is the one most candidates skip. Saying "I'd use Redis for sessions" is acceptable. Saying "I'd use Redis for sessions because its native TTL handles expiration without a background job, the access pattern is pure key-value so we don't need SQL, and the data is inherently ephemeral so durability is not a requirement" is what earns the offer.
Storage decisions, like all engineering decisions, are ultimately about trade-offs. The engineer who can articulate those trade-offs clearly β even imperfectly β is far more valuable than one who names the "right" database without knowing why.
Common Storage Pitfalls in System Design Interviews
Knowing which database to choose is only half the battle in a system design interview. The other half is knowing which traps to avoid. Experienced interviewers are not just listening for the right answer β they are actively listening for the reasoning behind your choices, and they know exactly where candidates stumble. This section maps the most common storage mistakes, explains why they happen, and gives you the concrete language and mental models to sidestep them confidently.
Pitfall 1: The Over-Engineering Trap
One of the most common signals of inexperience in a system design interview is reaching immediately for a distributed database when the problem does not yet require one. The moment a candidate hears "design a system for X," they jump to "we'll use a globally distributed, multi-region, sharded Cassandra cluster." This is the over-engineering trap, and it costs candidates in two ways: it signals poor judgment about tradeoffs, and it burns time on complexity that the problem has not earned yet.
π― Key Principle: A single well-tuned PostgreSQL instance can handle tens of thousands of queries per second and store hundreds of gigabytes of relational data. Most startups and mid-sized applications never outgrow it.
Consider a candidate asked to design a URL shortener. The write load is modest β users create shortened links far less frequently than they click them. A single-node relational database with proper indexing can handle millions of rows without breaking a sweat. Yet candidates routinely propose distributed key-value stores before ever asking "what is our expected write throughput?"
β Wrong thinking: "This is a scalable system, so I need a distributed database from day one."
β Correct thinking: "Let me estimate the load first, exhaust single-node options, and only introduce distribution when I can justify the operational cost."
The progression of scale should follow a deliberate ladder:
Load grows over time:
[Single-node DB] β [Read replicas] β [Vertical scaling] β [Sharding/Distribution]
β
Start here.
Most systems live here longer than engineers expect.
π‘ Pro Tip: In an interview, explicitly walking through this ladder β even if you ultimately land on a distributed solution β demonstrates mature engineering judgment. Say something like: "Before we introduce sharding complexity, let me check whether a primary with read replicas would cover us. Based on our earlier estimate of 5,000 reads/second, a single primary plus two replicas likely handles this comfortably."
β οΈ Common Mistake: Proposing DynamoDB or Cassandra for a system with 1,000 daily active users. Distributed databases come with real operational costs β eventual consistency, complex failure modes, higher latency for certain access patterns, and significantly more difficult debugging. Never pay those costs without earning them with scale requirements.
Pitfall 2: Ignoring CAP Theorem Implications
The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency (every read receives the most recent write), Availability (every request receives a response), and Partition Tolerance (the system continues operating despite network partitions). Because network partitions are a reality in any distributed system, the real tradeoff is between consistency and availability during a partition.
Many candidates can recite CAP theorem definitions on demand. Far fewer can reason through what their storage choice actually means when a network partition occurs in the system they just proposed.
π€ Did you know? Eric Brewer, who formulated the CAP theorem, later clarified that "2 of 3" is an oversimplification β the real nuance is about how much consistency and how much availability you sacrifice during partitions, which is why PACELC (an extension of CAP) is often more practical for interview discussions.
Consider a candidate designing a banking ledger system who proposes using Apache Cassandra (a CP-leaning or AP system depending on configuration). During a network partition, Cassandra configured for high availability may return stale data. For a balance inquiry or a funds transfer, this could mean a user sees an incorrect balance and the system allows an overdraft. The candidate failed to connect their storage choice to the consistency requirements of their domain.
CAP Tradeoffs During a Partition:
PARTITION OCCURS
β
ββββββββββββ΄βββββββββββ
βΌ βΌ
Choose Consistency Choose Availability
(CP System) (AP System)
β β
Some nodes All nodes
refuse writes accept writes
until sync (may diverge)
β β
β
Data safe β
System stays up
β Reduced availability β Possible stale reads
Examples:
CP: HBase, Zookeeper, traditional RDBMS with sync replication
AP: Cassandra, CouchDB, DynamoDB (by default)
The correct approach is to work backward from your consistency requirements before choosing a storage engine. Ask yourself:
π§ Consistency-first domains: financial transactions, inventory counts, authentication tokens β choose CP systems or strong consistency settings.
π§ Availability-first domains: social media feeds, analytics counters, user preference flags β AP systems with eventual consistency are acceptable because a slightly stale "like count" does not harm the user.
π‘ Real-World Example: Amazon's shopping cart is a famous example of an AP design choice. During a partition, the system allows writes to the cart even if some replicas are unreachable β resulting in occasional "merge conflicts" (two items added to the same cart slot). Amazon explicitly chose availability over consistency here because an unavailable cart loses revenue. The merge conflict is a tolerable edge case. This was the real-world motivation behind Dynamo, Amazon's internal key-value store.
β οΈ Common Mistake: Proposing eventual consistency for a system where the interviewer has already told you correctness is critical. If you say "we'll use an AP system and reconcile later" for a payment processor, you need to explain exactly how reconciliation works, what happens to the user experience during divergence, and how long convergence takes. If you cannot answer those follow-ups, you have proposed a system you do not understand.
Pitfall 3: Forgetting About Hot Spots
Sharding β the practice of distributing data across multiple database nodes β sounds like a clean solution to scale. But many candidates propose sharding without considering hot spots: situations where a disproportionate amount of traffic concentrates on a single shard, negating the benefits of distribution entirely.
A hot spot (or hot key) occurs when your shard key does not distribute load evenly. The classic example is sharding a social media platform's posts by user_id. This seems logical until you realize that a celebrity with 50 million followers generates vastly more reads and writes than an average user. Their user_id shard becomes a bottleneck while other shards sit idle.
## ANTI-PATTERN: Sharding by user_id for a social platform
## This creates hot spots for high-traffic users
def get_shard(user_id: int, num_shards: int) -> int:
# Naive modulo sharding β looks clean, but...
return user_id % num_shards
# Problem: user_id=1 (celebrity) routes ALL their
# 50M followers' feed reads to a single shard
## BETTER APPROACH: Composite key with randomization
## for write-heavy hot users
import hashlib
import random
def get_shard_with_salt(user_id: int, num_shards: int, salt_range: int = 10) -> int:
"""
For known hot users, append a random salt to distribute
writes across multiple physical shards. Reads must then
query all salt variants and merge β a deliberate tradeoff.
"""
salt = random.randint(0, salt_range - 1)
composite_key = f"{user_id}#{salt}"
hash_val = int(hashlib.md5(composite_key.encode()).hexdigest(), 16)
return hash_val % num_shards
# Now writes for user_id=1 spread across up to 10 shards
# Read cost increases (scatter-gather), but write throughput scales
The code above illustrates the write salting technique: for known hot keys, you intentionally spread writes across multiple shards by appending a random suffix to the key. Reads then perform a scatter-gather across all salt variants. This is a deliberate tradeoff β you increase read complexity to prevent write bottlenecks.
Other hot spot patterns to watch for:
π§ Time-based hot spots: Sharding by timestamp means all current writes go to the "latest" shard. Every time-series database and event log system faces this. Solutions include range sharding with rotation or hash-based time bucketing.
π Sequential ID hot spots: Auto-incrementing IDs route new records to the same shard until the range fills. Use UUID v4 or hash-based IDs to randomize distribution.
π― Geographic hot spots: If 80% of your users are in one region and you shard geographically, that region's shard is overloaded. Ensure your sharding strategy reflects actual traffic distribution, not theoretical geographic uniformity.
π‘ Mental Model: Think of sharding like distributing weight across table legs. If you place all the weight in one corner, the table tips over no matter how many legs you added. The shard key is your weight distribution policy β choose it based on where the load actually falls, not where it theoretically should.
β οΈ Common Mistake: Proposing a shard key in an interview without discussing the access patterns. An interviewer will almost always follow up with "what happens if one user is significantly more popular than others?" Be ready to address this proactively.
Pitfall 4: Conflating Caching with Persistence
Caches are fast. Caches are seductive. And caches will lose your data if you treat them like databases. The conflation of caching and persistence is one of the most dangerous misunderstandings in system design interviews, because it can lead to architectures that silently lose user data under pressure.
The core distinction is this: a cache stores data that can be reconstructed β it is a performance optimization. A persistent store stores data that must survive failures β it is the system of record. Redis is frequently misused here. Redis is an excellent cache (and a respectable message broker), but unless you have explicitly configured RDB snapshots or AOF (Append-Only File) logging with appropriate durability settings, a Redis restart will lose everything in memory.
## DANGEROUS ANTI-PATTERN: Using Redis as the primary store
## for user session data without any persistence fallback
import redis
client = redis.Redis(host='localhost', port=6379)
## Writing a user session ONLY to Redis
def save_session(session_id: str, user_data: dict) -> None:
client.hmset(f"session:{session_id}", user_data)
# β οΈ If Redis restarts with default config (no AOF, no RDB),
# ALL sessions are lost. Every logged-in user gets logged out.
# This is not a cache miss β this is data loss.
## SAFER PATTERN: Write-through with persistent fallback
import json
import psycopg2 # PostgreSQL driver
def save_session_with_persistence(
session_id: str,
user_data: dict,
redis_client: redis.Redis,
pg_conn: psycopg2.extensions.connection
) -> None:
"""
1. Write to persistent store first (source of truth)
2. Then populate cache for fast reads
Cache miss = read from DB, not data loss
"""
# Persist to PostgreSQL first
cursor = pg_conn.cursor()
cursor.execute(
"INSERT INTO sessions (id, data) VALUES (%s, %s) "
"ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data",
(session_id, json.dumps(user_data))
)
pg_conn.commit()
# Then cache in Redis with an expiry
client.setex(
f"session:{session_id}",
3600, # 1 hour TTL
json.dumps(user_data)
)
# Now a Redis failure = slower reads, NOT lost sessions
This code demonstrates the write-through caching pattern: data is written to the persistent store first, then placed in the cache. A cache failure degrades performance but never loses data. The alternative β cache-first with async write-back β is valid for some use cases (like analytics counters) but must be named and justified explicitly, because it accepts a window of data loss.
The data loss risk spectrum looks like this:
Data Loss Risk: Low βββββββββββββββββββββββββββββββββββΊ High
Write-through Write-behind Cache-only
(persist first, (cache first, (no persistence
then cache) async persist) configured)
β β β
β
No loss on β οΈ Loss during β Total loss
cache failure async window on restart
π§ Mnemonic: Think of the cache as a sticky note on your monitor and the database as your filed document. The sticky note is fast and convenient, but if your monitor restarts, the sticky note is gone. The filed document survives.
β οΈ Common Mistake: Saying "we'll cache writes in Redis and flush to the database periodically" without specifying the flush interval, what happens to unflushed writes during a crash, and whether the business domain can tolerate that loss window. For financial data, the answer is almost always no.
Pitfall 5: Failing to Address Backup, Recovery, and Migration
The final pitfall is one of omission rather than commission. Candidates design an elegant storage architecture and then stop β as if data, once written, perpetually maintains itself. In reality, production storage systems require ongoing operational thinking: backup strategies, recovery time objectives, and data migration plans. Neglecting these in an interview signals that the candidate has never been responsible for a production database.
Interviewers, especially senior ones, often probe this explicitly with questions like "how do you handle a database corruption event?" or "what happens when you need to add a column to a table with 500 million rows?"
Backup and Recovery Fundamentals
Two critical metrics frame every backup conversation:
- RPO (Recovery Point Objective): How much data loss is acceptable? If your RPO is 1 hour, your backups must run at least every hour.
- RTO (Recovery Time Objective): How long can the system be down during recovery? An RTO of 15 minutes means your restore process must complete within 15 minutes.
Backup Strategy Comparison:
Strategy | RPO | RTO | Storage Cost
ββββββββββββββββββ|ββββββββββββββ|ββββββββββββ|βββββββββββββ
Full daily backup | Up to 24h | Moderate | High
Full + WAL logs | Near-zero | Moderate | High
Continuous backup | Seconds | Low | Very High
(e.g., AWS RDS | | |
Point-in-Time) | | |
Replica failover | Near-zero | Very low | 2x storage
(hot standby) | | (seconds) |
π‘ Pro Tip: In an interview, tie your backup strategy to the domain. A social media post system might accept an RPO of 1 hour (some posts could be re-created). A payment transaction system requires near-zero RPO β every transaction must be recoverable. Interviewers respect this nuance.
Data Migration Without Downtime
The question "how do you add a column to a live table?" is a classic senior-level probe. The naive answer β ALTER TABLE users ADD COLUMN preferences JSONB β locks the entire table during the migration, causing downtime proportional to table size. On a 500-million-row table, that could be hours.
The correct answer involves online schema change techniques:
-- DANGEROUS on large tables: locks the entire table
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';
-- On 500M rows, this holds a lock for potentially hours
-- All reads and writes to this table are blocked
-- SAFER APPROACH: Expand-Contract Pattern
-- Step 1: Add nullable column (fast β no data rewrite)
ALTER TABLE users ADD COLUMN preferences JSONB;
-- Step 2: Backfill in small batches during off-peak
-- UPDATE users SET preferences = '{}'
-- WHERE id BETWEEN :start AND :end AND preferences IS NULL;
-- Step 3: Add NOT NULL constraint only after backfill completes
-- ALTER TABLE users ALTER COLUMN preferences SET NOT NULL;
This is the expand-contract pattern: first expand the schema to be backward-compatible (adding a nullable column), backfill data gradually without locking, then contract to the final state (adding constraints). Tools like pt-online-schema-change (for MySQL) and pg_repack (for PostgreSQL) automate this process.
π― Key Principle: Always have a rollback plan. Before running any migration, know exactly how to undo it. In an interview, proactively mentioning "and here is how we roll this back if something goes wrong" is a strong signal of production maturity.
Putting It All Together: The Self-Check Framework
Before you finish presenting any storage decision in an interview, run through this mental checklist:
π Quick Reference Card: Storage Pitfall Avoidance Checklist
| # | β Question to Ask Yourself | β Green Flag | β Red Flag |
|---|---|---|---|
| π§ 1 | Did I justify scale before distributing? | Estimated load, started simple | Jumped to Cassandra immediately |
| π 2 | What does my system do during a partition? | Named CP/AP tradeoff explicitly | Ignored partition behavior |
| π― 3 | Is my shard key evenly distributed? | Analyzed access patterns | Used naive sequential key |
| π 4 | Can I lose cache data without losing user data? | Cache is a performance layer only | Cache is the only write path |
| π§ 5 | What is my RPO/RTO and migration plan? | Named objectives and rollback | Stopped at "we store the data" |
π― Key Principle: The best system design candidates are not the ones who know the most databases β they are the ones who can reason about tradeoffs out loud, connect technical choices to business requirements, and proactively surface risks before the interviewer has to ask about them.
These five pitfalls share a common root cause: premature optimization without grounding in requirements. The antidote is always the same β ask clarifying questions first, estimate before architecting, and explain every tradeoff explicitly. An interviewer who sees you navigate these pitfalls consciously will trust that your production systems would navigate them too.
Key Takeaways and Interview Cheat Sheet
You made it to the end of the foundational storage module β and that matters more than it might seem. Most engineers walk into system design interviews knowing that databases exist; you now understand how they behave, why they differ, and when to choose one over another. That shift from consumer to architect is exactly what separates passing candidates from standout ones.
This section is your consolidation point. Think of it as the reference card you wish you had before your last interview: sharp definitions, the three questions that unlock every storage decision, a visual roadmap to the specialized topics ahead, and a hands-on practice prompt to test everything you've built.
Quick-Reference Summary: The Core Vocabulary in Plain Language
Before any decision framework can work, the vocabulary has to be automatic. Interviewers will use these terms without pausing to define them, and hesitation signals uncertainty. Here is every foundational concept from this lesson distilled to its sharpest form.
π Quick Reference Card: Core Storage Concepts
| π Concept | π Plain-Language Definition | β‘ Why It Matters in Interviews |
|---|---|---|
| π ACID | Atomicity, Consistency, Isolation, Durability β four guarantees that a transaction either completes fully and correctly or doesn't happen at all | Invoke ACID when correctness is non-negotiable: payments, inventory, user accounts |
| π BASE | Basically Available, Soft state, Eventually consistent β the trade-off NoSQL systems make to achieve horizontal scale | Invoke BASE when availability and partition tolerance beat strict accuracy |
| π¦ Replication | Copying data to multiple nodes so reads can scale and a node failure doesn't lose data | Always state your replication factor and whether reads hit replicas or only the primary |
| ποΈ Partitioning (Sharding) | Splitting data across nodes by a partition key so no single machine holds everything | Pick a partition key that distributes load evenly and supports your most common queries |
| π Data Model | The shape of your data: relational (tables), document (JSON), wide-column, graph, or key-value | Choose the model that matches your access patterns, not the one you're most comfortable with |
| π CAP Theorem | A distributed system can guarantee only two of: Consistency, Availability, Partition Tolerance | During a network partition you must choose C or A β frame your choice explicitly |
| π Normalization | Eliminating redundancy by breaking data into related tables | Reduces write anomalies; increases join cost β classic read vs. write trade-off |
| πͺ΅ Write-Ahead Log (WAL) | A sequential log of every change written before the change hits the main data files | Foundation of crash recovery and replication in most relational systems |
| π³ B-Tree Index | A balanced tree structure allowing O(log n) point lookups and range scans | Default index type; great for range queries on ordered data |
| #οΈβ£ LSM Tree | Log-Structured Merge-Tree β optimizes for sequential writes, compacts data in background | Used by Cassandra, RocksDB; shines in write-heavy workloads |
| π§ Cold vs. Hot Data | Hot data is accessed frequently; cold data rarely | Tier storage cost: hot on SSD/cache, cold on object storage |
| πΎ Durability | Data survives crashes, power failures, and hardware loss | Achieved through WAL, fsync, replication β not something to assume, something to design |
π§ Mnemonic: To remember the four ACID properties use "A Consistent Isolation Delivers" β Atomicity, Consistency, Isolation, Durability.
π§ Mnemonic: For CAP Theorem, picture a triangle with C, A, and P at the corners. During a partition (P is always present in real networks), you slide toward either the C corner or the A corner β never both.
The Three Questions to Always Ask Before Picking a Database
Every storage decision in a system design interview can be unlocked by asking three questions in sequence. These aren't tricks β they are the actual reasoning process that experienced engineers use. Internalizing them transforms an open-ended question like "what database would you use?" into a structured analysis.
Question 1: What Are My Access Patterns?
This is always first because the access pattern determines the data model, which in turn constrains which database families are even eligible. Ask yourself: will queries be point lookups by a known key, range scans over time windows, graph traversals across relationships, or full-text searches? A ride-sharing app needs to look up a driver's current location by driver ID (point lookup), scan a rider's trip history by date (range scan), and match nearby drivers to a rider (geospatial query). Three different access patterns that might point to three different storage layers.
π‘ Pro Tip: In an interview, state your access patterns before naming a database. Saying "I'll use PostgreSQL" before explaining why you need relational guarantees sounds like a habit. Saying "my primary access pattern is transactional writes with strong consistency requirements between a user's account balance and their transaction history, which points me toward a relational database with ACID guarantees β so PostgreSQL" sounds like engineering.
Question 2: What Are My Consistency and Availability Requirements?
Once you know the access pattern, ask whether the business can tolerate stale reads, lost writes, or partial failures. Payment systems cannot. Social media feeds can. This question forces you to engage with CAP Theorem concretely rather than abstractly. It also surfaces replication strategy: if you need strong consistency, replica reads are dangerous unless you use synchronous replication or read-your-writes sessions. If you can tolerate eventual consistency, you unlock massive read scalability through async replication.
π― Key Principle: Consistency is a spectrum, not a binary. Distinguish between linearizability (every read sees the most recent write globally), read-your-writes consistency (a user sees their own writes immediately), and eventual consistency (all replicas converge given enough time). Different parts of the same system can live at different points on this spectrum.
Question 3: What Are My Scale Dimensions?
Scale has at least three independent axes: read volume, write volume, and data size. Each has different solutions. High read volume β read replicas and caching. High write volume β write sharding and LSM-tree-based stores. Large data size β partitioning and tiered storage. Getting this wrong is a classic interview mistake β candidates immediately jump to horizontal sharding when their bottleneck is actually read-heavy and solvable with a read replica and a Redis cache layer.
π‘ Mental Model: Think of scale as a three-axis graph. Plot your system on it. Which axis is closest to the ceiling? That's where your design energy should go first.
## A simple capacity estimation helper that surfaces all three scale dimensions
## Use this kind of structured thinking to anchor your interview estimates
def estimate_storage_requirements(
daily_active_users: int,
writes_per_user_per_day: int,
reads_per_user_per_day: int,
avg_record_size_bytes: int,
retention_days: int
) -> dict:
"""
Estimates the three scale dimensions for a storage layer.
Returns a dict with read QPS, write QPS, and total storage.
"""
# Write throughput
total_writes_per_day = daily_active_users * writes_per_user_per_day
write_qps = total_writes_per_day / 86_400 # seconds in a day
# Read throughput
total_reads_per_day = daily_active_users * reads_per_user_per_day
read_qps = total_reads_per_day / 86_400
# Storage size
total_records = total_writes_per_day * retention_days
total_storage_gb = (total_records * avg_record_size_bytes) / (1024 ** 3)
return {
"write_qps": round(write_qps, 1),
"read_qps": round(read_qps, 1),
"total_storage_gb": round(total_storage_gb, 2),
"read_write_ratio": round(read_qps / write_qps, 1) if write_qps > 0 else float('inf')
}
## Example: Ride-sharing trip records
result = estimate_storage_requirements(
daily_active_users=1_000_000,
writes_per_user_per_day=2, # ~2 trips per active user per day
reads_per_user_per_day=10, # checking trip history, status, etc.
avg_record_size_bytes=500, # trip record with metadata
retention_days=365 * 3 # 3-year retention
)
print(f"Write QPS: {result['write_qps']}")
print(f"Read QPS: {result['read_qps']}")
print(f"Total Storage: {result['total_storage_gb']} GB")
print(f"Read/Write Ratio: {result['read_write_ratio']}x")
## Output:
## Write QPS: 23.1
## Read QPS: 115.7
## Total Storage: 1022.08 GB (~1 TB)
## Read/Write Ratio: 5.0x
This output tells you immediately: write QPS is modest (no need for exotic write-optimized storage), reads are 5Γ higher (read replicas or caching will help), and 1 TB over three years is manageable but worth partitioning by date for query efficiency.
Visual Decision Tree: Your Map to the Child Topics
The three specialized lessons that follow this one each deepen one pillar of what you've built here. Here is how they connect:
βββββββββββββββββββββββββββββββ
β STORAGE DECISION TREE β
βββββββββββββββ¬ββββββββββββββββ
β
Ask: What is my primary need?
β
βββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
ββββββββΌβββββββ βββββββββΌβββββββ ββββββββββΌββββββββ
β Structured β β Flexible / β β Sub-10ms β
β data with β β Massive β β Read Latency β
β ACID txns β β Scale β β Required β
ββββββββ¬βββββββ βββββββββ¬βββββββ ββββββββββ¬ββββββββ
β β β
ββββββββΌβββββββ βββββββββΌβββββββ ββββββββββΌββββββββ
β RELATIONAL β β NoSQL β β CACHING β
β DATABASES β β DATABASES β β LAYER β
β (Lesson 7) β β (Lesson 8) β β (Lesson 9) β
ββββββββ¬βββββββ βββββββββ¬βββββββ ββββββββββ¬ββββββββ
β β β
PostgreSQL, Document, Key-Value, Redis, Memcached,
MySQL, CockroachDB Wide-column, Graph CDN edge caches
β β β
Deepens: Indexing, Deepens: CAP tradeoffs, Deepens: Eviction,
Normalization, Eventual consistency, Cache invalidation,
Query optimization Partition key design Write-through/back
π‘ Remember: This tree is a starting point, not a rulebook. Real systems almost always use multiple storage layers simultaneously. A ride-sharing app will likely use PostgreSQL for financial transactions, Cassandra or DynamoDB for trip history at scale, Redis for driver location caches, and an object store for receipts and photos.
Practice Prompt: Sketch the Storage Layer for a Ride-Sharing App
The best way to consolidate a mental model is to apply it under light pressure. Here is a realistic practice prompt β the kind you might encounter in the first fifteen minutes of a system design interview.
Prompt: "Design the storage layer for a ride-sharing app like Uber or Lyft. The system handles 1 million daily active riders, 500,000 active drivers, real-time location updates every 4 seconds per driver, trip booking and payment processing, and trip history queries."
Work through the three questions first:
Access patterns: Point lookups (get driver by ID), geospatial queries (find drivers near a coordinate), time-range scans (user's last 10 trips), and transactional writes (create trip + charge payment atomically).
Consistency requirements: Payment and trip creation need ACID guarantees β a charge must not succeed if the trip creation fails, and vice versa. Driver location can tolerate a few seconds of staleness. Trip history reads can be eventually consistent.
Scale dimensions: 500,000 drivers Γ one location update every 4 seconds = ~125,000 writes per second to the location store. That is a serious write bottleneck that rules out a naive relational approach for location data. Trip history reads will be frequent but can be served from replicas.
Here is what a well-structured storage sketch looks like, expressed as configuration pseudocode that mirrors how you'd explain it verbally:
## Ride-sharing storage architecture sketch
## Use this structure to organize your verbal explanation in an interview
storage_layers:
transactional_core:
technology: PostgreSQL # ACID guarantees for payments + trip lifecycle
use_for:
- user accounts and payment methods
- trip records (created, accepted, completed, cancelled states)
- driver profiles and documents
replication:
type: synchronous_primary_replica
replicas: 2
reason: "payment data cannot tolerate data loss"
partitioning:
strategy: none_initially # ~1M trips/day * 500B = ~500GB/year, manageable
future: partition_by_created_at_month if growth demands
indexes:
- trips(rider_id, created_at DESC) # trip history queries
- trips(driver_id, status) # driver's active trip lookup
- payments(trip_id) # payment join
location_store:
technology: Redis (Sorted Sets with geospatial commands)
use_for:
- real-time driver coordinates (GEOADD, GEORADIUS)
- driver availability status
write_qps: 125_000 # 500k drivers / 4 second intervals
ttl_seconds: 30 # expire stale locations automatically
persistence: RDB_snapshot_only # acceptable to lose a few seconds of location
reason: "Redis GEORADIUS gives O(N+log M) geospatial queries at sub-millisecond latency"
trip_history:
technology: Cassandra or DynamoDB
use_for:
- long-term trip history for analytics and user-facing history tab
partition_key: rider_id
clustering_key: trip_id DESC # most recent trips first
replication_factor: 3
consistency_level: LOCAL_QUORUM # strong enough without cross-region penalty
reason: "write-heavy append pattern, horizontal scale, no complex joins needed"
object_storage:
technology: S3 or GCS
use_for:
- driver documents (license photos, vehicle photos)
- trip receipts as PDFs
- audio recordings (safety feature)
reason: "unstructured blobs; cheap at scale; CDN-friendly"
caching:
technology: Redis (separate cluster from location store)
use_for:
- driver profile lookups (avoid DB hits on every trip match)
- surge pricing multipliers by zone (updated every 30s)
ttl: 60s for profiles, 30s for surge pricing
π‘ Real-World Example: Uber's actual architecture evolved to use a combination of MySQL (later migrated to a sharded setup, then partly to Docstore β their internal document store), Redis for location data, and Kafka as the nervous system between components. Their 2016 blog post on migrating from Postgres to MySQL (and the lessons learned) is required reading and illustrates exactly why storage decisions that seem fine at 100k users break at 10 million.
β οΈ Common Mistake β Mistake 1: Putting driver location updates into PostgreSQL. At 125,000 writes per second, a standard Postgres setup would immediately saturate. Candidates who jump to "I'll use a relational database for everything" without calculating write throughput fail this part of the question reliably. β οΈ
What You Now Understand That You Didn't Before
It's worth making explicit the cognitive shift this lesson was designed to produce. Before this lesson, storage decisions in a system design interview might have felt like trivia: know the names of some databases, pick one that sounds right. After this lesson, storage decisions are an engineering process.
Here is the transformation in concrete terms:
| β Before This Lesson | β After This Lesson |
|---|---|
| "I'll use MySQL because I know MySQL" | "My access patterns and consistency requirements point toward a relational database; MySQL or PostgreSQL fits here" |
| Vague awareness that "NoSQL is faster" | Understanding that NoSQL trades specific guarantees (joins, ACID) for specific gains (horizontal writes, flexible schema) |
| Treating storage as a single layer | Designing multi-layer architectures: transactional DB + cache + object store + analytics store |
| Estimating storage with a hand-wave | Structured capacity math: users Γ events Γ size Γ retention = concrete numbers |
| Naming databases without justification | Anchoring every choice in access patterns, consistency needs, and scale dimensions |
| Avoiding CAP Theorem | Using CAP Theorem to articulate trade-offs confidently and specifically |
π― Key Principle: The storage layer is not a single technology choice. It is a layered architecture where each layer is chosen because its properties match the requirements of a specific subset of your data and access patterns. An interview answer that describes one database for everything should always raise your own internal flag.
Recommended Next Steps: How the Child Topics Deepen Each Pillar
Each of the three specialized lessons that follows this one takes one pillar of what you've built and makes it production-depth. Here is exactly what each one gives you and why you'll need it.
ποΈ Relational Databases (next lesson): This lesson gave you the vocabulary β ACID, normalization, B-tree indexes, WAL. The relational databases lesson turns that vocabulary into technique. You'll learn how to design schemas that survive both correctness requirements and query performance at scale, how indexes interact with query planners, and when to denormalize intentionally. You'll also explore distributed relational databases like CockroachDB and Spanner that bring horizontal scale without sacrificing ACID.
-- Preview of what the relational lesson covers:
-- Understanding why this index choice matters at scale
-- β Problematic: Low-cardinality index on a write-heavy table
CREATE INDEX idx_trips_status ON trips(status);
-- 'status' has ~5 values β the planner may ignore this index entirely
-- and it adds write overhead on every trip state change
-- β
Better: Composite index matching the actual query pattern
CREATE INDEX idx_trips_rider_created ON trips(rider_id, created_at DESC);
-- Covers "get last N trips for rider X" with index-only scans
-- High cardinality on rider_id; DESC on created_at matches ORDER BY
π¦ NoSQL Databases (following lesson): This lesson gave you the CAP theorem and the concept of eventual consistency. The NoSQL lesson gives you the mechanics: how Cassandra's gossip protocol achieves decentralized coordination, how DynamoDB's partition key design determines whether you get uniform throughput or hot partition nightmares, how MongoDB's aggregation pipeline compares to SQL joins, and when graph databases like Neo4j unlock query patterns that would require dozens of self-joins in SQL. You'll come out able to design a Cassandra schema from scratch given a set of access patterns.
β‘ Caching (lesson after that): This lesson introduced caching as a layer for sub-millisecond reads. The caching lesson makes you dangerous with it: cache-aside vs. write-through vs. write-back patterns, cache invalidation strategies (the hardest problem in distributed systems), TTL design, cache stampede prevention with probabilistic early expiration, and how CDNs function as geographically distributed caches. You'll be able to explain why a 5% cache miss rate on a system with 1 million QPS is a very different problem than a 5% miss rate on 1,000 QPS.
π‘ Pro Tip: As you move through the child topics, keep returning to the three-question framework from this lesson. Every new technology you learn should be filed mentally under: "this is the right answer when my access patterns look like X, my consistency requirement is Y, and my scale dimension is Z." Build a lookup table in your head, not a list of database names.
β οΈ Final Critical Point to Remember: System design interviews are not tests of knowledge β they are tests of reasoning. An interviewer who hears you say "I'd use Cassandra because it's write-optimized and I've calculated 125k writes/second on this path" is hearing an engineer. An interviewer who hears "I'd use Cassandra because it scales well" is hearing a candidate who read a blog post. The concepts in this lesson are the scaffolding for that reasoning. The next three lessons are the tools. The practice prompt above β and others like it β are how the reasoning becomes instinct. β οΈ
π€ Did you know? The term "database" comes from the 1960s concept of a "data base" β a shared collection of operational data used by an organization's application systems. The first commercial relational databases appeared in the late 1970s. The NoSQL movement as a named phenomenon is only about 15 years old, yet it already encompasses more distinct data models than relational databases do. The storage landscape will keep evolving β the decision framework you now have will outlast any specific technology.
You've completed the foundational storage layer of this system design curriculum. The vocabulary is sharp, the decision framework is in hand, and the practice prompt has given you a concrete anchor. The next three lessons go deep β pick the one whose gap you feel most acutely, or work through them in order to build the full picture.