What Incidents Reveal
Reading the architectural flaws exposed by failures
What Incidents Reveal
Master incident analysis with free flashcards and spaced repetition practice. This lesson covers post-incident insights, architectural patterns derived from failures, and transforming operational scars into resilient system designβessential skills for engineers working under pressure.
Welcome π¨
Every production incident is a gift wrapped in stress and panic. While the immediate goal during an outage is restoration, the real value emerges afterward: incidents reveal truths about your system that no amount of theoretical design or testing can expose. They show you where your mental models diverge from reality, where your assumptions were wrong, andβmost importantlyβwhere your architecture needs fundamental rethinking.
This lesson explores how to extract maximum learning from incidents and translate those painful lessons into architectural improvements that prevent entire classes of future failures.
The Hidden Curriculum of Failures π
Incidents teach lessons that can't be learned any other way. They expose the difference between how we think systems work and how they actually work under stress.
What Production Reveals That Testing Cannot
Real Load Patterns: Testing environments rarely capture actual user behavior. Incidents often reveal:
- Traffic patterns you never anticipated (e.g., retry storms, thundering herds)
- Data distributions that break assumptions (null values, extreme outliers, Unicode edge cases)
- Timing dependencies hidden in normal conditions but critical under load
- Emergent behaviors from component interactions
Operational Realities: Your system exists in a context broader than its code:
- How teams actually communicate during crises (not how the runbook says they should)
- Which monitoring gaps leave you blind at critical moments
- Where tribal knowledge lives and what happens when that person is unavailable
- How deployment processes fail in ways you didn't design for
System Boundaries: Incidents illuminate the edges of your system:
- Which dependencies you didn't know were critical
- Where timeouts are too generous or too strict
- How failures cascade across service boundaries
- Which rate limits actually protect you (and which are theatrical)
The Incident Forensics Mindset π
Effective incident analysis requires a specific approach:
Blameless Curiosity: The goal is understanding, not punishment. Questions should be:
- "How did the system make this failure possible?"
- "What signals did we miss and why?"
- "What would have prevented this?"
Systems Thinking: Single root causes are myths. Look for:
- Contributing factors (usually 3-7 independent things that aligned)
- Normal operations that became pathological under stress
- Missing feedback loops that would have provided early warning
Evidence Over Narrative: Humans are storytelling machines, but our initial explanations are often wrong:
- Collect logs, metrics, and traces before forming hypotheses
- Look for disconfirming evidence to challenge your theories
- Distinguish between "what happened" and "why it happened"
π‘ Pro Tip: Create a timeline first, add theories second. The sequence of events often reveals causation that theory alone would miss.
Architectural Insights from Common Incident Patterns ποΈ
Certain incident types recur across organizations. Each pattern suggests specific architectural interventions.
Pattern 1: The Cascading Failure
What It Looks Like:
ββββββββββββ ββββββββββββ ββββββββββββ
β Service ββββXβββββ Service ββββXβββββ Service β
β A β β B β β C β
ββββββββββββ ββββββββββββ ββββββββββββ
β β β
Slow/Down Backing up Timing out
(no backpressure) (retry storms)
Service B slows down. Service A has no backpressure mechanism, so it queues requests, exhausting connection pools and memory. Service C's retries amplify the problem. The failure cascades.
What It Reveals:
- Missing circuit breakers between services
- Lack of proper timeout hierarchies
- No load shedding or admission control
- Retry logic that makes problems worse
Architectural Fix:
// Before: naive retry
async function callServiceB() {
let attempts = 0;
while (attempts < 5) {
try {
return await fetch('/service-b/endpoint');
} catch (err) {
attempts++;
await sleep(1000); // Always retry
}
}
}
// After: circuit breaker with exponential backoff
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failures = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED';
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.openedAt > this.timeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
}
}
}
Pattern 2: The Silent Data Corruption
What It Looks Like:
- Bad data enters the system through an edge case
- Validation exists but has a blind spot
- Data propagates to downstream systems
- Discovered only when reports look wrong weeks later
What It Reveals:
- Input validation is incomplete
- No data quality monitoring
- Missing invariant checks in critical paths
- Type systems aren't being fully leveraged
Architectural Fix:
## Before: partial validation
def process_transaction(amount, currency):
if amount > 0:
return save_transaction(amount, currency)
## After: parse, don't validate
from decimal import Decimal, InvalidOperation
from typing import Literal
from dataclasses import dataclass
Currency = Literal['USD', 'EUR', 'GBP']
@dataclass(frozen=True)
class Money:
amount: Decimal
currency: Currency
def __post_init__(self):
if self.amount <= 0:
raise ValueError(f"Amount must be positive: {self.amount}")
if self.amount.as_tuple().exponent < -2:
raise ValueError(f"Too many decimal places: {self.amount}")
def process_transaction(money: Money):
# If we get here, data is valid by construction
return save_transaction(money)
## Usage
try:
money = Money(Decimal('100.50'), 'USD')
process_transaction(money)
except (ValueError, InvalidOperation) as e:
log_validation_error(e)
Pattern 3: The Resource Exhaustion
What It Looks Like:
Memory/Connections/File Descriptors
^
β β±βββββ OOM Kill
β β±βββββ±
β β±βββββ±
β β±βββββ±
β β±βββββ±
β β±βββββ±
ββββββββββββββββββββββββββββββββ Time
Normal Slow Leak Critical
What It Reveals:
- Resource leaks (connections not closed, event listeners not removed)
- Missing backpressure in streaming/queue systems
- No bounds on in-memory caches or buffers
- Unbounded growth in data structures
Architectural Fix:
// Before: unbounded cache
use std::collections::HashMap;
struct Cache {
data: HashMap<String, Vec<u8>>
}
impl Cache {
fn get_or_fetch(&mut self, key: &str) -> Vec<u8> {
if let Some(value) = self.data.get(key) {
return value.clone();
}
let value = expensive_fetch(key);
self.data.insert(key.to_string(), value.clone());
value // Memory grows forever!
}
}
// After: LRU cache with size limit
use lru::LruCache;
use std::num::NonZeroUsize;
struct BoundedCache {
data: LruCache<String, Vec<u8>>,
max_bytes: usize,
current_bytes: usize,
}
impl BoundedCache {
fn new(capacity: usize, max_bytes: usize) -> Self {
BoundedCache {
data: LruCache::new(NonZeroUsize::new(capacity).unwrap()),
max_bytes,
current_bytes: 0,
}
}
fn get_or_fetch(&mut self, key: &str) -> Option<Vec<u8>> {
if let Some(value) = self.data.get(key) {
return Some(value.clone());
}
let value = expensive_fetch(key);
let size = value.len();
if size > self.max_bytes {
return Some(value); // Don't cache huge items
}
// Evict until we have space
while self.current_bytes + size > self.max_bytes {
if let Some((_, evicted)) = self.data.pop_lru() {
self.current_bytes -= evicted.len();
} else {
break;
}
}
self.current_bytes += size;
self.data.put(key.to_string(), value.clone());
Some(value)
}
}
From Incident to Architecture: The Translation Process π
Step 1: Identify the Vulnerability Class
Don't just fix the specific bug. Ask: "What class of problems does this represent?"
Example: Database connection leak in the checkout service.
- Specific fix: Close the connection in that code path
- Class fix: Implement connection pooling with max lifetime
- Architecture fix: All services use a resource manager that guarantees cleanup
Step 2: Generalize the Solution
| Incident Type | Specific Cause | Architectural Pattern |
|---|---|---|
| π₯ Cascading Failure | Service A overwhelmed Service B | Bulkheads, circuit breakers, load shedding |
| πΎ Data Loss | Cache-aside pattern inconsistency | Write-through cache, event sourcing, CDC |
| β‘ Race Condition | Concurrent access to shared state | Immutable data, CRDTs, idempotency keys |
| π Performance Cliff | O(nΒ²) algorithm hit production scale | Pagination, streaming, job queues |
| π³οΈ Partial Failure | Timeout left system in inconsistent state | Saga pattern, two-phase commit, eventual consistency |
Step 3: Design for Observability
Every architectural change should answer: "How will we know if this is working?"
// Before: no visibility into circuit breaker behavior
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == Open {
return ErrCircuitOpen
}
return fn()
}
// After: observable circuit breaker
import "github.com/prometheus/client_golang/prometheus"
var (
circuitBreakerState = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "circuit_breaker_state",
Help: "Current state of circuit breakers (0=closed, 1=half-open, 2=open)",
},
[]string{"service", "endpoint"},
)
circuitBreakerTransitions = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "circuit_breaker_transitions_total",
Help: "Number of circuit breaker state transitions",
},
[]string{"service", "endpoint", "from_state", "to_state"},
)
)
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == Open {
circuitBreakerState.WithLabelValues(cb.service, cb.endpoint).Set(2)
return ErrCircuitOpen
}
err := fn()
if err != nil {
cb.recordFailure()
} else {
cb.recordSuccess()
}
return err
}
func (cb *CircuitBreaker) transition(from, to State) {
circuitBreakerTransitions.WithLabelValues(
cb.service,
cb.endpoint,
from.String(),
to.String(),
).Inc()
cb.state = to
circuitBreakerState.WithLabelValues(cb.service, cb.endpoint).Set(float64(to))
}
π‘ Metrics to Add: For every architectural pattern you introduce, emit metrics about its behavior: circuit breaker state changes, cache hit ratios, queue depths, retry counts, timeout frequencies.
Step 4: Test the Architectural Change
Architectural patterns need architectural tests:
## Test circuit breaker behavior under failure
import pytest
from circuit_breaker import CircuitBreaker
@pytest.mark.asyncio
async def test_circuit_breaker_opens_after_threshold():
failing_service = MockService(failure_rate=1.0)
cb = CircuitBreaker(threshold=3, timeout=1000)
# First 3 failures should go through
for i in range(3):
with pytest.raises(ServiceError):
await cb.call(failing_service.request)
# Circuit should now be OPEN
assert cb.state == "OPEN"
# Subsequent calls should fail immediately
start = time.time()
with pytest.raises(CircuitOpenError):
await cb.call(failing_service.request)
duration = time.time() - start
# Should fail fast (< 100ms, not wait for timeout)
assert duration < 0.1
@pytest.mark.asyncio
async def test_circuit_breaker_recovers_after_timeout():
flaky_service = MockService(failure_rate=0.0) # Now working
cb = CircuitBreaker(threshold=3, timeout=100)
# Force circuit open
cb.state = "OPEN"
cb.opened_at = time.time() - 0.150 # 150ms ago
# Should attempt request (HALF_OPEN)
result = await cb.call(flaky_service.request)
assert result == "success"
assert cb.state == "CLOSED"
Real-World Example: Turning a Payment Failure into Idempotency π³
The Incident
A customer was charged twice for the same order. Investigation revealed:
- User clicked "Submit Payment" once
- Request timed out from client's perspective (slow network)
- Client retried the request
- Both requests succeeded (the first was just slow, not failed)
- Two charges, one order
The Insight
This wasn't a bugβit was a missing architectural guarantee. The system assumed requests wouldn't be retried, but networks are unreliable. Any request might be duplicated.
The Architectural Solution: Idempotency Keys
// Before: vulnerable to duplicate requests
app.post('/api/payments', async (req, res) => {
const { userId, amount, cardToken } = req.body;
const charge = await stripe.charges.create({
amount,
currency: 'usd',
source: cardToken
});
await db.orders.create({
userId,
amount,
chargeId: charge.id
});
res.json({ success: true, chargeId: charge.id });
});
// After: idempotent with deduplication
interface PaymentRequest {
userId: string;
amount: number;
cardToken: string;
idempotencyKey: string; // Client-generated UUID
}
app.post('/api/payments', async (req, res) => {
const { userId, amount, cardToken, idempotencyKey } = req.body;
if (!idempotencyKey) {
return res.status(400).json({ error: 'idempotencyKey required' });
}
// Check if we've already processed this exact request
const existing = await db.idempotentRequests.findOne({
key: idempotencyKey,
userId
});
if (existing) {
// Return the cached response
return res.status(existing.statusCode).json(existing.response);
}
// Process the request
try {
const charge = await stripe.charges.create(
{
amount,
currency: 'usd',
source: cardToken
},
{
idempotencyKey // Stripe also supports idempotency keys!
}
);
const order = await db.orders.create({
userId,
amount,
chargeId: charge.id
});
const response = { success: true, chargeId: charge.id, orderId: order.id };
// Cache the response
await db.idempotentRequests.create({
key: idempotencyKey,
userId,
statusCode: 200,
response,
expiresAt: Date.now() + 24 * 60 * 60 * 1000 // 24 hours
});
res.json(response);
} catch (err) {
// Cache failures too (for a shorter time)
await db.idempotentRequests.create({
key: idempotencyKey,
userId,
statusCode: 500,
response: { error: 'Payment failed' },
expiresAt: Date.now() + 60 * 60 * 1000 // 1 hour
});
res.status(500).json({ error: 'Payment failed' });
}
});
Client-side implementation:
import { v4 as uuidv4 } from 'uuid';
async function submitPayment(userId, amount, cardToken) {
const idempotencyKey = uuidv4();
let attempts = 0;
while (attempts < 3) {
try {
const response = await fetch('/api/payments', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
userId,
amount,
cardToken,
idempotencyKey // Same key for all retries
})
});
if (response.ok) {
return await response.json();
}
throw new Error(`HTTP ${response.status}`);
} catch (err) {
attempts++;
if (attempts >= 3) throw err;
await sleep(1000 * attempts); // Exponential backoff
}
}
}
Key Architectural Decisions:
- Client generates the key: Ensures retries use the same key
- Cache both successes and failures: Prevents retry storms
- Time-limited cache: Balance between correctness and storage
- Include userId in uniqueness: Prevent cross-user collisions
- Pass idempotency downstream: Stripe and other APIs support this pattern
Common Mistakes in Post-Incident Analysis β οΈ
Mistake 1: Fixing Symptoms Instead of Causes
Symptom: "The database ran out of connections." Bad fix: Increase connection pool size to 500. Root cause: Services don't properly close connections; no timeout on long-running queries. Good fix: Implement connection lifecycle management, query timeouts, and monitoring for connection leaks.
Mistake 2: Adding Complexity Without Removing It
Every incident response adds something (monitoring, safeguards, redundancy). Few remove the complexity that made the incident possible in the first place.
## Adding complexity (circuit breaker) without simplifying
class ServiceClient:
def __init__(self):
self.circuit_breaker = CircuitBreaker()
self.retry_policy = ExponentialRetry()
self.fallback = FallbackCache()
self.load_balancer = RoundRobin()
# Now you have 5 things that can fail...
Better approach: Can you eliminate the dependency entirely? Can you make it asynchronous? Can you simplify the interaction model?
Mistake 3: Over-Generalizing from Single Incidents
One timeout doesn't mean every service needs a 500ms timeout. One cache stampede doesn't mean you need distributed locking everywhere.
Good heuristic: Implement targeted solutions first, generalize only after seeing the pattern 2-3 times.
Mistake 4: Ignoring the Human System
Technical fixes address only half the problem. Consider:
- Communication: How did teams coordinate? What slowed them down?
- Documentation: What knowledge was missing? Where was the runbook wrong?
- Authority: Who could make decisions? Were there escalation bottlenecks?
- Tooling: What capabilities did responders lack?
Mistake 5: Not Testing the Countermeasures
Bad: Add a circuit breaker, assume it works. Good: Add a circuit breaker, write tests, then deliberately cause failures in staging to verify it opens and recovers correctly.
π§ Try this: Schedule monthly "game days" where you intentionally trigger failure modes to test your defenses. Chaos engineering isn't optional for critical systems.
Key Takeaways π―
π Incident Insights Quick Reference
| Core Principle | Incidents reveal what testing cannotβreal-world complexity, emergent behavior, operational reality |
| Analysis Goal | Identify vulnerability classes, not just specific bugs |
| Cascading Failures | Need circuit breakers, backpressure, load shedding, timeout hierarchies |
| Data Corruption | Parse don't validate; use types to make illegal states unrepresentable |
| Resource Exhaustion | Bound all caches, pools, queues; implement proper cleanup |
| Idempotency | Critical for retry safety; client generates keys, server caches results |
| Observability | Every architectural pattern needs metrics showing it's working |
| Common Mistake | Fixing symptoms (increase limits) instead of causes (fix leaks) |
| Testing | Verify countermeasures work through chaos engineering and game days |
π Further Study
- Release It! (Michael T. Nygard): Comprehensive patterns for production-ready systems, including circuit breakers, bulkheads, and more: https://pragprog.com/titles/mnee2/release-it-second-edition/
- Site Reliability Engineering (Google): Google's approach to incident management, postmortems, and learning from failures: https://sre.google/sre-book/postmortem-culture/
- Resilience Engineering Papers (Sidney Dekker et al.): Academic perspective on how complex systems fail and how humans adapt: https://www.resilienceengineering.org/
Remember: The best architects are those who've debugged the most disasters. Every incident is an opportunity to build systems that are not just functional, but antifragileβsystems that get stronger when stressed. Your scars become your architecture.