You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

What Incidents Reveal

Reading the architectural flaws exposed by failures

What Incidents Reveal

Master incident analysis with free flashcards and spaced repetition practice. This lesson covers post-incident insights, architectural patterns derived from failures, and transforming operational scars into resilient system designβ€”essential skills for engineers working under pressure.

Welcome 🚨

Every production incident is a gift wrapped in stress and panic. While the immediate goal during an outage is restoration, the real value emerges afterward: incidents reveal truths about your system that no amount of theoretical design or testing can expose. They show you where your mental models diverge from reality, where your assumptions were wrong, andβ€”most importantlyβ€”where your architecture needs fundamental rethinking.

This lesson explores how to extract maximum learning from incidents and translate those painful lessons into architectural improvements that prevent entire classes of future failures.

The Hidden Curriculum of Failures πŸ“š

Incidents teach lessons that can't be learned any other way. They expose the difference between how we think systems work and how they actually work under stress.

What Production Reveals That Testing Cannot

Real Load Patterns: Testing environments rarely capture actual user behavior. Incidents often reveal:

  • Traffic patterns you never anticipated (e.g., retry storms, thundering herds)
  • Data distributions that break assumptions (null values, extreme outliers, Unicode edge cases)
  • Timing dependencies hidden in normal conditions but critical under load
  • Emergent behaviors from component interactions

Operational Realities: Your system exists in a context broader than its code:

  • How teams actually communicate during crises (not how the runbook says they should)
  • Which monitoring gaps leave you blind at critical moments
  • Where tribal knowledge lives and what happens when that person is unavailable
  • How deployment processes fail in ways you didn't design for

System Boundaries: Incidents illuminate the edges of your system:

  • Which dependencies you didn't know were critical
  • Where timeouts are too generous or too strict
  • How failures cascade across service boundaries
  • Which rate limits actually protect you (and which are theatrical)

The Incident Forensics Mindset πŸ”

Effective incident analysis requires a specific approach:

Blameless Curiosity: The goal is understanding, not punishment. Questions should be:

  • "How did the system make this failure possible?"
  • "What signals did we miss and why?"
  • "What would have prevented this?"

Systems Thinking: Single root causes are myths. Look for:

  • Contributing factors (usually 3-7 independent things that aligned)
  • Normal operations that became pathological under stress
  • Missing feedback loops that would have provided early warning

Evidence Over Narrative: Humans are storytelling machines, but our initial explanations are often wrong:

  • Collect logs, metrics, and traces before forming hypotheses
  • Look for disconfirming evidence to challenge your theories
  • Distinguish between "what happened" and "why it happened"

πŸ’‘ Pro Tip: Create a timeline first, add theories second. The sequence of events often reveals causation that theory alone would miss.

Architectural Insights from Common Incident Patterns πŸ—οΈ

Certain incident types recur across organizations. Each pattern suggests specific architectural interventions.

Pattern 1: The Cascading Failure

What It Looks Like:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Service  │───X───→│ Service  │───X───→│ Service  β”‚
β”‚    A     β”‚        β”‚    B     β”‚        β”‚    C     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓                   ↓                   ↓
  Slow/Down          Backing up          Timing out
                    (no backpressure)    (retry storms)

Service B slows down. Service A has no backpressure mechanism, so it queues requests, exhausting connection pools and memory. Service C's retries amplify the problem. The failure cascades.

What It Reveals:

  • Missing circuit breakers between services
  • Lack of proper timeout hierarchies
  • No load shedding or admission control
  • Retry logic that makes problems worse

Architectural Fix:

// Before: naive retry
async function callServiceB() {
  let attempts = 0;
  while (attempts < 5) {
    try {
      return await fetch('/service-b/endpoint');
    } catch (err) {
      attempts++;
      await sleep(1000); // Always retry
    }
  }
}

// After: circuit breaker with exponential backoff
class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED';
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
    }
  }
}

Pattern 2: The Silent Data Corruption

What It Looks Like:

  • Bad data enters the system through an edge case
  • Validation exists but has a blind spot
  • Data propagates to downstream systems
  • Discovered only when reports look wrong weeks later

What It Reveals:

  • Input validation is incomplete
  • No data quality monitoring
  • Missing invariant checks in critical paths
  • Type systems aren't being fully leveraged

Architectural Fix:

## Before: partial validation
def process_transaction(amount, currency):
    if amount > 0:
        return save_transaction(amount, currency)

## After: parse, don't validate
from decimal import Decimal, InvalidOperation
from typing import Literal
from dataclasses import dataclass

Currency = Literal['USD', 'EUR', 'GBP']

@dataclass(frozen=True)
class Money:
    amount: Decimal
    currency: Currency
    
    def __post_init__(self):
        if self.amount <= 0:
            raise ValueError(f"Amount must be positive: {self.amount}")
        if self.amount.as_tuple().exponent < -2:
            raise ValueError(f"Too many decimal places: {self.amount}")

def process_transaction(money: Money):
    # If we get here, data is valid by construction
    return save_transaction(money)

## Usage
try:
    money = Money(Decimal('100.50'), 'USD')
    process_transaction(money)
except (ValueError, InvalidOperation) as e:
    log_validation_error(e)

Pattern 3: The Resource Exhaustion

What It Looks Like:

Memory/Connections/File Descriptors
  ^
  β”‚                           ╱───── OOM Kill
  β”‚                      ╱────╱
  β”‚                 ╱────╱
  β”‚            ╱────╱
  β”‚       ╱────╱
  β”‚  ╱────╱
  └──────────────────────────────→ Time
   Normal   Slow Leak   Critical

What It Reveals:

  • Resource leaks (connections not closed, event listeners not removed)
  • Missing backpressure in streaming/queue systems
  • No bounds on in-memory caches or buffers
  • Unbounded growth in data structures

Architectural Fix:

// Before: unbounded cache
use std::collections::HashMap;

struct Cache {
    data: HashMap<String, Vec<u8>>
}

impl Cache {
    fn get_or_fetch(&mut self, key: &str) -> Vec<u8> {
        if let Some(value) = self.data.get(key) {
            return value.clone();
        }
        let value = expensive_fetch(key);
        self.data.insert(key.to_string(), value.clone());
        value  // Memory grows forever!
    }
}

// After: LRU cache with size limit
use lru::LruCache;
use std::num::NonZeroUsize;

struct BoundedCache {
    data: LruCache<String, Vec<u8>>,
    max_bytes: usize,
    current_bytes: usize,
}

impl BoundedCache {
    fn new(capacity: usize, max_bytes: usize) -> Self {
        BoundedCache {
            data: LruCache::new(NonZeroUsize::new(capacity).unwrap()),
            max_bytes,
            current_bytes: 0,
        }
    }

    fn get_or_fetch(&mut self, key: &str) -> Option<Vec<u8>> {
        if let Some(value) = self.data.get(key) {
            return Some(value.clone());
        }
        
        let value = expensive_fetch(key);
        let size = value.len();
        
        if size > self.max_bytes {
            return Some(value); // Don't cache huge items
        }
        
        // Evict until we have space
        while self.current_bytes + size > self.max_bytes {
            if let Some((_, evicted)) = self.data.pop_lru() {
                self.current_bytes -= evicted.len();
            } else {
                break;
            }
        }
        
        self.current_bytes += size;
        self.data.put(key.to_string(), value.clone());
        Some(value)
    }
}

From Incident to Architecture: The Translation Process πŸ”„

Step 1: Identify the Vulnerability Class

Don't just fix the specific bug. Ask: "What class of problems does this represent?"

Example: Database connection leak in the checkout service.

  • Specific fix: Close the connection in that code path
  • Class fix: Implement connection pooling with max lifetime
  • Architecture fix: All services use a resource manager that guarantees cleanup

Step 2: Generalize the Solution

Incident Type Specific Cause Architectural Pattern
πŸ”₯ Cascading Failure Service A overwhelmed Service B Bulkheads, circuit breakers, load shedding
πŸ’Ύ Data Loss Cache-aside pattern inconsistency Write-through cache, event sourcing, CDC
⚑ Race Condition Concurrent access to shared state Immutable data, CRDTs, idempotency keys
🐌 Performance Cliff O(n²) algorithm hit production scale Pagination, streaming, job queues
πŸ•³οΈ Partial Failure Timeout left system in inconsistent state Saga pattern, two-phase commit, eventual consistency

Step 3: Design for Observability

Every architectural change should answer: "How will we know if this is working?"

// Before: no visibility into circuit breaker behavior
func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == Open {
        return ErrCircuitOpen
    }
    return fn()
}

// After: observable circuit breaker
import "github.com/prometheus/client_golang/prometheus"

var (
    circuitBreakerState = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "circuit_breaker_state",
            Help: "Current state of circuit breakers (0=closed, 1=half-open, 2=open)",
        },
        []string{"service", "endpoint"},
    )
    
    circuitBreakerTransitions = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "circuit_breaker_transitions_total",
            Help: "Number of circuit breaker state transitions",
        },
        []string{"service", "endpoint", "from_state", "to_state"},
    )
)

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == Open {
        circuitBreakerState.WithLabelValues(cb.service, cb.endpoint).Set(2)
        return ErrCircuitOpen
    }
    
    err := fn()
    
    if err != nil {
        cb.recordFailure()
    } else {
        cb.recordSuccess()
    }
    
    return err
}

func (cb *CircuitBreaker) transition(from, to State) {
    circuitBreakerTransitions.WithLabelValues(
        cb.service,
        cb.endpoint,
        from.String(),
        to.String(),
    ).Inc()
    
    cb.state = to
    circuitBreakerState.WithLabelValues(cb.service, cb.endpoint).Set(float64(to))
}

πŸ’‘ Metrics to Add: For every architectural pattern you introduce, emit metrics about its behavior: circuit breaker state changes, cache hit ratios, queue depths, retry counts, timeout frequencies.

Step 4: Test the Architectural Change

Architectural patterns need architectural tests:

## Test circuit breaker behavior under failure
import pytest
from circuit_breaker import CircuitBreaker

@pytest.mark.asyncio
async def test_circuit_breaker_opens_after_threshold():
    failing_service = MockService(failure_rate=1.0)
    cb = CircuitBreaker(threshold=3, timeout=1000)
    
    # First 3 failures should go through
    for i in range(3):
        with pytest.raises(ServiceError):
            await cb.call(failing_service.request)
    
    # Circuit should now be OPEN
    assert cb.state == "OPEN"
    
    # Subsequent calls should fail immediately
    start = time.time()
    with pytest.raises(CircuitOpenError):
        await cb.call(failing_service.request)
    duration = time.time() - start
    
    # Should fail fast (< 100ms, not wait for timeout)
    assert duration < 0.1

@pytest.mark.asyncio
async def test_circuit_breaker_recovers_after_timeout():
    flaky_service = MockService(failure_rate=0.0)  # Now working
    cb = CircuitBreaker(threshold=3, timeout=100)
    
    # Force circuit open
    cb.state = "OPEN"
    cb.opened_at = time.time() - 0.150  # 150ms ago
    
    # Should attempt request (HALF_OPEN)
    result = await cb.call(flaky_service.request)
    assert result == "success"
    assert cb.state == "CLOSED"

Real-World Example: Turning a Payment Failure into Idempotency πŸ’³

The Incident

A customer was charged twice for the same order. Investigation revealed:

  1. User clicked "Submit Payment" once
  2. Request timed out from client's perspective (slow network)
  3. Client retried the request
  4. Both requests succeeded (the first was just slow, not failed)
  5. Two charges, one order

The Insight

This wasn't a bugβ€”it was a missing architectural guarantee. The system assumed requests wouldn't be retried, but networks are unreliable. Any request might be duplicated.

The Architectural Solution: Idempotency Keys

// Before: vulnerable to duplicate requests
app.post('/api/payments', async (req, res) => {
  const { userId, amount, cardToken } = req.body;
  
  const charge = await stripe.charges.create({
    amount,
    currency: 'usd',
    source: cardToken
  });
  
  await db.orders.create({
    userId,
    amount,
    chargeId: charge.id
  });
  
  res.json({ success: true, chargeId: charge.id });
});

// After: idempotent with deduplication
interface PaymentRequest {
  userId: string;
  amount: number;
  cardToken: string;
  idempotencyKey: string;  // Client-generated UUID
}

app.post('/api/payments', async (req, res) => {
  const { userId, amount, cardToken, idempotencyKey } = req.body;
  
  if (!idempotencyKey) {
    return res.status(400).json({ error: 'idempotencyKey required' });
  }
  
  // Check if we've already processed this exact request
  const existing = await db.idempotentRequests.findOne({
    key: idempotencyKey,
    userId
  });
  
  if (existing) {
    // Return the cached response
    return res.status(existing.statusCode).json(existing.response);
  }
  
  // Process the request
  try {
    const charge = await stripe.charges.create(
      {
        amount,
        currency: 'usd',
        source: cardToken
      },
      {
        idempotencyKey  // Stripe also supports idempotency keys!
      }
    );
    
    const order = await db.orders.create({
      userId,
      amount,
      chargeId: charge.id
    });
    
    const response = { success: true, chargeId: charge.id, orderId: order.id };
    
    // Cache the response
    await db.idempotentRequests.create({
      key: idempotencyKey,
      userId,
      statusCode: 200,
      response,
      expiresAt: Date.now() + 24 * 60 * 60 * 1000  // 24 hours
    });
    
    res.json(response);
  } catch (err) {
    // Cache failures too (for a shorter time)
    await db.idempotentRequests.create({
      key: idempotencyKey,
      userId,
      statusCode: 500,
      response: { error: 'Payment failed' },
      expiresAt: Date.now() + 60 * 60 * 1000  // 1 hour
    });
    
    res.status(500).json({ error: 'Payment failed' });
  }
});

Client-side implementation:

import { v4 as uuidv4 } from 'uuid';

async function submitPayment(userId, amount, cardToken) {
  const idempotencyKey = uuidv4();
  
  let attempts = 0;
  while (attempts < 3) {
    try {
      const response = await fetch('/api/payments', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          userId,
          amount,
          cardToken,
          idempotencyKey  // Same key for all retries
        })
      });
      
      if (response.ok) {
        return await response.json();
      }
      
      throw new Error(`HTTP ${response.status}`);
    } catch (err) {
      attempts++;
      if (attempts >= 3) throw err;
      await sleep(1000 * attempts);  // Exponential backoff
    }
  }
}

Key Architectural Decisions:

  1. Client generates the key: Ensures retries use the same key
  2. Cache both successes and failures: Prevents retry storms
  3. Time-limited cache: Balance between correctness and storage
  4. Include userId in uniqueness: Prevent cross-user collisions
  5. Pass idempotency downstream: Stripe and other APIs support this pattern

Common Mistakes in Post-Incident Analysis ⚠️

Mistake 1: Fixing Symptoms Instead of Causes

Symptom: "The database ran out of connections." Bad fix: Increase connection pool size to 500. Root cause: Services don't properly close connections; no timeout on long-running queries. Good fix: Implement connection lifecycle management, query timeouts, and monitoring for connection leaks.

Mistake 2: Adding Complexity Without Removing It

Every incident response adds something (monitoring, safeguards, redundancy). Few remove the complexity that made the incident possible in the first place.

## Adding complexity (circuit breaker) without simplifying
class ServiceClient:
    def __init__(self):
        self.circuit_breaker = CircuitBreaker()
        self.retry_policy = ExponentialRetry()
        self.fallback = FallbackCache()
        self.load_balancer = RoundRobin()
        # Now you have 5 things that can fail...

Better approach: Can you eliminate the dependency entirely? Can you make it asynchronous? Can you simplify the interaction model?

Mistake 3: Over-Generalizing from Single Incidents

One timeout doesn't mean every service needs a 500ms timeout. One cache stampede doesn't mean you need distributed locking everywhere.

Good heuristic: Implement targeted solutions first, generalize only after seeing the pattern 2-3 times.

Mistake 4: Ignoring the Human System

Technical fixes address only half the problem. Consider:

  • Communication: How did teams coordinate? What slowed them down?
  • Documentation: What knowledge was missing? Where was the runbook wrong?
  • Authority: Who could make decisions? Were there escalation bottlenecks?
  • Tooling: What capabilities did responders lack?

Mistake 5: Not Testing the Countermeasures

Bad: Add a circuit breaker, assume it works. Good: Add a circuit breaker, write tests, then deliberately cause failures in staging to verify it opens and recovers correctly.

πŸ”§ Try this: Schedule monthly "game days" where you intentionally trigger failure modes to test your defenses. Chaos engineering isn't optional for critical systems.

Key Takeaways 🎯

πŸ“‹ Incident Insights Quick Reference

Core Principle Incidents reveal what testing cannotβ€”real-world complexity, emergent behavior, operational reality
Analysis Goal Identify vulnerability classes, not just specific bugs
Cascading Failures Need circuit breakers, backpressure, load shedding, timeout hierarchies
Data Corruption Parse don't validate; use types to make illegal states unrepresentable
Resource Exhaustion Bound all caches, pools, queues; implement proper cleanup
Idempotency Critical for retry safety; client generates keys, server caches results
Observability Every architectural pattern needs metrics showing it's working
Common Mistake Fixing symptoms (increase limits) instead of causes (fix leaks)
Testing Verify countermeasures work through chaos engineering and game days

πŸ“š Further Study


Remember: The best architects are those who've debugged the most disasters. Every incident is an opportunity to build systems that are not just functional, but antifragileβ€”systems that get stronger when stressed. Your scars become your architecture.