Cascading Failures
Understanding how failures propagate through systems
Cascading Failures in Distributed Systems
Master the art of identifying and preventing cascading failures in distributed systems with free flashcards and spaced repetition practice. This lesson covers failure propagation patterns, circuit breaker implementations, and bulkhead isolation strategiesβessential concepts for debugging production systems under pressure.
Welcome to the World of Cascading Failures π»
Imagine a single database query timeout bringing down your entire platform. Sounds dramatic? It happens more often than you'd think. Cascading failures occur when a single component failure triggers a chain reaction, causing multiple dependent services to fail in sequence. In distributed systems, these failures can spread like wildfire, turning a minor hiccup into a full-scale outage.
β οΈ Why This Matters: According to Google's SRE handbook, cascading failures are responsible for some of the most severe and longest-lasting outages in production systems. Understanding how they propagateβand how to stop themβis critical for any engineer working with distributed architectures.
Core Concepts: Understanding the Cascade π
What is a Cascading Failure?
A cascading failure is a failure mode where the failure of one component causes other components to fail, which in turn causes more components to fail, creating a domino effect. Unlike isolated failures that affect only a single service, cascading failures amplify and propagate through system dependencies.
CASCADING FAILURE PROPAGATION
β‘ Initial Failure
β
β
βββββββββββ
βService Aβ β Database timeout
ββββββ¬βββββ
β
βββββ΄ββββ
β β
ββββββββ ββββββββ
βSrv B β βSrv C β β Retry storms
βββββ¬βββ βββββ¬βββ
β β
β β
ββββββββββββ
β Users π₯ β β Complete outage
ββββββββββββ
The Anatomy of a Cascade
Cascading failures typically follow these stages:
- Trigger Event: An initial failure occurs (server crash, network partition, resource exhaustion)
- Load Redistribution: Traffic shifts to remaining healthy instances
- Resource Saturation: Healthy instances become overloaded
- Failure Propagation: Overloaded instances fail, repeating the cycle
- System Collapse: Critical mass of failures causes total system unavailability
| Stage | System State | Risk Level |
|---|---|---|
| Initial Failure | 1/10 servers down | π‘ Low |
| Load Shift | 9 servers at 110% capacity | π Medium |
| Secondary Failures | 4/9 remaining servers fail | π΄ High |
| Critical Mass | 5 servers handling 10x load | π Critical |
Common Cascade Triggers
Resource Exhaustion Cascades
When one component exhausts a shared resource (connection pools, memory, threads), dependent services can't function:
## Vulnerable code - no connection pooling limits
class DatabaseClient:
def __init__(self):
self.connections = [] # Unlimited growth!
def query(self, sql):
conn = create_new_connection() # Creates every time
self.connections.append(conn)
return conn.execute(sql)
## Under load, this exhausts database connections
## causing ALL services sharing the DB to fail
Retry Storm Cascades
When services automatically retry failed requests without backoff, they create amplified load:
// Dangerous retry logic
async function fetchData(url) {
try {
return await fetch(url);
} catch (error) {
// Immediate retry without backoff!
return fetchData(url); // Recursive retry
}
}
// If server is slow, this creates exponential request growth:
// 1 request β 2 requests β 4 requests β 8 requests...
Timeout Cascades
Misconfigured timeouts cause requests to pile up, consuming threads and memory:
// Service A calls Service B
public Response callServiceB() {
HttpClient client = HttpClient.newBuilder()
.connectTimeout(Duration.ofMinutes(5)) // Too long!
.build();
// If Service B is slow, threads block for 5 minutes
// Eventually all threads are blocked waiting
return client.send(request, handler);
}
Dependency Graphs and Failure Domains
Understanding your system's dependency graph is crucial for predicting cascade paths:
DEPENDENCY GRAPH EXAMPLE
ββββββββββββ
β Cache β
ββββββ¬ββββββ
β
βββββββββββ΄ββββββββββ
β β
βββββ΄βββββ ββββββ΄ββββ
βAuth Svcβ βUser Svcβ
βββββ¬βββββ ββββββ¬ββββ
β β
βββββββββββ¬ββββββββββ
β
ββββββ΄ββββββ
βDatabase β β Single point of failure!
ββββββββββββ
Failure domains are boundaries that contain failures. Without proper isolation, a database failure propagates to all dependent services.
π‘ Pro Tip: Draw your system's dependency graph and identify critical paths. Services with many dependencies are cascade amplifiers.
Defensive Patterns: Breaking the Chain βοΈβπ₯
Circuit Breakers
The circuit breaker pattern stops cascades by failing fast when a dependency is unhealthy:
from enum import Enum
import time
class CircuitState(Enum):
CLOSED = 1 # Normal operation
OPEN = 2 # Failing fast
HALF_OPEN = 3 # Testing recovery
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = CircuitState.CLOSED
self.last_failure_time = None
def call(self, func, *args):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args)
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
raise e
## Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)
try:
response = breaker.call(call_external_service, request)
except Exception:
return fallback_response() # Fail gracefully
Circuit breaker states:
CIRCUIT BREAKER STATE MACHINE
βββββββββββββββββββββββ
β CLOSED π’ β
β (normal operation) β
ββββββββββββ¬βββββββββββ
β
Failure threshold
reached (3-5 fails)
β
β
βββββββββββββββββββββββ
β OPEN π΄ β
β (failing fast) β β Returns error immediately
ββββββββββββ¬βββββββββββ
β
Timeout expires
(30-60 seconds)
β
β
βββββββββββββββββββββββ
β HALF-OPEN π‘ β
β (testing recovery) β
ββββββββββββ¬βββββββββββ
β
ββββββββ΄βββββββ
β β
Success Failure
β β
β β
CLOSED OPEN
Bulkhead Isolation
The bulkhead pattern (borrowed from ship design) isolates resources to prevent total failure:
package main
import (
"context"
"fmt"
"time"
)
// Bulkhead isolates resources using separate thread pools
type Bulkhead struct {
semaphore chan struct{}
timeout time.Duration
}
func NewBulkhead(maxConcurrent int, timeout time.Duration) *Bulkhead {
return &Bulkhead{
semaphore: make(chan struct{}, maxConcurrent),
timeout: timeout,
}
}
func (b *Bulkhead) Execute(fn func() error) error {
ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
defer cancel()
select {
case b.semaphore <- struct{}{}:
defer func() { <-b.semaphore }()
return fn()
case <-ctx.Done():
return fmt.Errorf("bulkhead full: %w", ctx.Err())
}
}
// Separate bulkheads for different services
var (
dbBulkhead = NewBulkhead(50, 5*time.Second)
cacheBulkhead = NewBulkhead(100, 1*time.Second)
apiBulkhead = NewBulkhead(30, 10*time.Second)
)
// Database failures won't exhaust cache resources
func queryDatabase() error {
return dbBulkhead.Execute(func() error {
// Database query here
return nil
})
}
Bulkhead visualization:
WITHOUT BULKHEADS WITH BULKHEADS βββββββββββββββββββ βββββββ¬ββββββ¬ββββββ β β β DB βCacheβ API β β Shared Thread β β π¦ β π© β π¨ β β Pool β β π¦ β π© β π¨ β β β β π¦ β π© β π¨ β β π₯ One slow β β π₯ β β β β β β service blocks β β π₯ β β β β β β everything β βββββββ΄ββββββ΄ββββββ βββββββββββββββββββ Only DB affected!
Rate Limiting and Load Shedding
Rate limiting prevents cascades by controlling request rates:
use std::time::{Duration, Instant};
use std::collections::VecDeque;
// Token bucket rate limiter
pub struct RateLimiter {
capacity: usize,
tokens: usize,
refill_rate: usize, // tokens per second
last_refill: Instant,
}
impl RateLimiter {
pub fn new(capacity: usize, refill_rate: usize) -> Self {
RateLimiter {
capacity,
tokens: capacity,
refill_rate,
last_refill: Instant::now(),
}
}
fn refill(&mut self) {
let now = Instant::now();
let elapsed = now.duration_since(self.last_refill);
let tokens_to_add = (elapsed.as_secs() as usize) * self.refill_rate;
self.tokens = std::cmp::min(self.capacity, self.tokens + tokens_to_add);
self.last_refill = now;
}
pub fn try_acquire(&mut self) -> bool {
self.refill();
if self.tokens > 0 {
self.tokens -= 1;
true
} else {
false // Rate limit exceeded
}
}
}
// Usage in request handler
fn handle_request(limiter: &mut RateLimiter) -> Result<(), &'static str> {
if !limiter.try_acquire() {
return Err("Rate limit exceeded"); // Shed load
}
// Process request
Ok(())
}
Load shedding drops low-priority requests when under stress:
from enum import Enum
import time
class Priority(Enum):
CRITICAL = 1
HIGH = 2
NORMAL = 3
LOW = 4
class LoadShedder:
def __init__(self, cpu_threshold=0.8, memory_threshold=0.9):
self.cpu_threshold = cpu_threshold
self.memory_threshold = memory_threshold
def should_accept(self, priority: Priority) -> bool:
cpu_usage = self.get_cpu_usage()
memory_usage = self.get_memory_usage()
# Critical requests always accepted
if priority == Priority.CRITICAL:
return True
# High load: only critical and high priority
if cpu_usage > self.cpu_threshold or memory_usage > self.memory_threshold:
return priority in [Priority.CRITICAL, Priority.HIGH]
# Extreme load: only critical
if cpu_usage > 0.95 or memory_usage > 0.95:
return priority == Priority.CRITICAL
return True
def get_cpu_usage(self):
# Implementation depends on platform
import psutil
return psutil.cpu_percent() / 100.0
def get_memory_usage(self):
import psutil
return psutil.virtual_memory().percent / 100.0
## Usage
shedder = LoadShedder()
if not shedder.should_accept(Priority.LOW):
return {"error": "Service overloaded", "status": 503}
Timeouts and Deadlines
Aggressive timeouts prevent thread exhaustion:
// Proper timeout configuration with context propagation
interface TimeoutConfig {
connect: number; // Connection timeout
request: number; // Request timeout
total: number; // Total operation timeout
}
class TimeoutManager {
private config: TimeoutConfig = {
connect: 1000, // 1 second to connect
request: 5000, // 5 seconds for request
total: 10000 // 10 seconds total
};
async callWithTimeout<T>(
operation: () => Promise<T>,
timeoutMs: number
): Promise<T> {
return Promise.race([
operation(),
new Promise<T>((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeoutMs)
)
]);
}
async chainedCall(): Promise<any> {
const startTime = Date.now();
try {
// Each call has independent timeout
const result1 = await this.callWithTimeout(
() => this.serviceA(),
this.config.request
);
// Check total deadline
const elapsed = Date.now() - startTime;
const remaining = this.config.total - elapsed;
if (remaining <= 0) {
throw new Error('Total deadline exceeded');
}
const result2 = await this.callWithTimeout(
() => this.serviceB(result1),
Math.min(this.config.request, remaining)
);
return result2;
} catch (error) {
// Fail fast, don't retry on timeout
throw error;
}
}
private async serviceA(): Promise<any> {
// Simulated service call
return {};
}
private async serviceB(input: any): Promise<any> {
// Simulated service call
return {};
}
}
Real-World Examples π
Example 1: The Database Connection Pool Cascade
Scenario: An e-commerce site experiences a cascading failure during Black Friday.
Initial state:
- Web servers: 20 instances
- Database connection pool: 10 connections per server
- Total: 200 database connections
Timeline:
## Before the cascade
class OrderService:
def __init__(self):
self.db_pool = ConnectionPool(
max_connections=10,
timeout=30 # 30 second timeout - too long!
)
def create_order(self, user_id, items):
# No timeout on individual operations
conn = self.db_pool.get_connection() # Blocks if pool exhausted
# Complex query without timeout
result = conn.execute("""
INSERT INTO orders (user_id, items, total)
SELECT %s, %s, calculate_total(%s)
""", (user_id, items, items))
conn.release()
return result
What happened:
| Time | Event | Impact |
|---|---|---|
| 00:00 | Traffic spike: 10x normal load | All connection pools saturated |
| 00:01 | Slow query blocks connections | Threads waiting for 30s timeout |
| 00:02 | All web server threads blocked | Load balancer marks servers unhealthy |
| 00:03 | Database CPU at 100% | All queries slow, cascade amplifies |
| 00:05 | Complete outage | Zero successful requests |
The fix:
## After implementing defensive patterns
class OrderService:
def __init__(self):
# Bulkhead: Separate pools for different operations
self.read_pool = ConnectionPool(max_connections=15, timeout=5)
self.write_pool = ConnectionPool(max_connections=5, timeout=10)
# Circuit breaker for database
self.db_breaker = CircuitBreaker(
failure_threshold=5,
timeout=30
)
# Rate limiter
self.rate_limiter = RateLimiter(capacity=100, refill_rate=50)
def create_order(self, user_id, items):
# Check rate limit first
if not self.rate_limiter.try_acquire():
raise RateLimitError("Too many orders, try again")
# Use circuit breaker
try:
return self.db_breaker.call(self._do_create_order, user_id, items)
except CircuitBreakerOpenError:
# Queue for async processing instead
return self.queue_order(user_id, items)
def _do_create_order(self, user_id, items):
# Use write pool with aggressive timeout
with timeout(5): # 5 second max
conn = self.write_pool.get_connection(timeout=2)
result = conn.execute("INSERT INTO orders...", ...)
conn.release()
return result
π‘ Lesson: Unbounded resource consumption + high load = cascade. Always limit resources and fail fast.
Example 2: The Retry Storm Cascade
Scenario: A payment service experiences intermittent slowness, triggering a retry storm.
Vulnerable code:
// Payment service client - DANGEROUS!
class PaymentClient {
async processPayment(orderId, amount) {
const maxRetries = 5;
let attempt = 0;
while (attempt < maxRetries) {
try {
const response = await fetch('https://payment-api.example.com/charge', {
method: 'POST',
body: JSON.stringify({ orderId, amount })
});
if (response.ok) {
return await response.json();
}
// Immediate retry - no backoff!
attempt++;
} catch (error) {
attempt++;
// Continue retrying immediately
}
}
throw new Error('Payment failed after retries');
}
}
What happened:
RETRY STORM AMPLIFICATION Time: t=0 100 requests β Payment Service Time: t=1 (service slow, all fail) 100 requests Γ 5 retries = 500 requests Time: t=2 (still slow) 500 requests Γ 5 retries = 2,500 requests Time: t=3 (complete meltdown) 2,500 Γ 5 = 12,500 requests π₯ Original load: 100 req/s Peak load: 12,500 req/s (125x amplification!)
The fix: Exponential backoff with jitter
class PaymentClient {
async processPayment(orderId, amount) {
const maxRetries = 3; // Reduced
const baseDelay = 1000; // 1 second
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await this.makeRequest(orderId, amount);
return response; // Success
} catch (error) {
if (attempt === maxRetries - 1) {
throw error; // Last attempt, give up
}
// Exponential backoff: 1s, 2s, 4s, 8s...
const delay = baseDelay * Math.pow(2, attempt);
// Add jitter: Β±25% randomness to prevent thundering herd
const jitter = delay * (0.75 + Math.random() * 0.5);
console.log(`Retry ${attempt + 1}/${maxRetries} after ${jitter}ms`);
await this.sleep(jitter);
}
}
}
async makeRequest(orderId, amount) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000);
try {
const response = await fetch('https://payment-api.example.com/charge', {
method: 'POST',
body: JSON.stringify({ orderId, amount }),
signal: controller.signal
});
clearTimeout(timeout);
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return await response.json();
} catch (error) {
clearTimeout(timeout);
throw error;
}
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
Backoff comparison:
| Attempt | No Backoff | Linear Backoff | Exponential + Jitter |
|---|---|---|---|
| 1 | 0ms | 1000ms | 750-1250ms |
| 2 | 0ms | 2000ms | 1500-2500ms |
| 3 | 0ms | 3000ms | 3000-5000ms |
| 4 | 0ms | 4000ms | 6000-10000ms |
Example 3: The Microservices Timeout Cascade
Scenario: A shopping cart service depends on multiple microservices, each with cascading timeouts.
Problematic architecture:
// Each service has 30-second timeout
public class ShoppingCartService {
private InventoryService inventory;
private PricingService pricing;
private ShippingService shipping;
private TaxService tax;
public Cart buildCart(String userId) {
// Sequential calls, each with 30s timeout
List<Item> items = inventory.getItems(userId); // 30s max
List<Price> prices = pricing.getPrices(items); // 30s max
ShippingOptions shipping = shipping.getOptions(items); // 30s max
Tax taxes = tax.calculate(items, prices); // 30s max
// Total possible wait: 120 seconds!
return new Cart(items, prices, shipping, taxes);
}
}
What goes wrong:
TIMEOUT CASCADE SCENARIO
User Request β Cart Service
β
ββββββ΄βββββ
β Timeout β 120s total possible
β Budget β
ββββββ¬βββββ
β
ββββββββββββββΌβββββββββββββ
β β β
Inventory Pricing Shipping
(30s each) (30s) (30s)
β β β
β β β
Database Database Database
(slow) (slow) (slow)
Result: All threads blocked for 120s
Thread pool exhausted
Service completely down
The fix: Deadline propagation and parallel calls
import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;
public class ShoppingCartService {
private static final Duration REQUEST_DEADLINE = Duration.ofSeconds(10);
public Cart buildCart(String userId, Instant deadline) {
// Calculate remaining time budget
Duration remaining = Duration.between(Instant.now(), deadline);
if (remaining.isNegative()) {
throw new DeadlineExceededException("No time remaining");
}
// Parallel calls with shared deadline
CompletableFuture<List<Item>> itemsFuture =
CompletableFuture.supplyAsync(() ->
inventory.getItems(userId, deadline)
);
CompletableFuture<ShippingOptions> shippingFuture =
CompletableFuture.supplyAsync(() ->
shipping.getOptions(userId, deadline)
);
try {
// Wait for both with timeout
List<Item> items = itemsFuture.get(
remaining.toMillis(),
TimeUnit.MILLISECONDS
);
// Recalculate remaining time
remaining = Duration.between(Instant.now(), deadline);
// Dependent calls with propagated deadline
List<Price> prices = pricing.getPrices(items, deadline);
Tax taxes = tax.calculate(items, prices, deadline);
ShippingOptions shippingOpts = shippingFuture.get(
remaining.toMillis(),
TimeUnit.MILLISECONDS
);
return new Cart(items, prices, shippingOpts, taxes);
} catch (TimeoutException e) {
// Cancel pending operations
itemsFuture.cancel(true);
shippingFuture.cancel(true);
throw new DeadlineExceededException("Cart build timeout");
}
}
public Cart buildCart(String userId) {
// Create deadline from now
Instant deadline = Instant.now().plus(REQUEST_DEADLINE);
return buildCart(userId, deadline);
}
}
Example 4: The Autoscaling Positive Feedback Loop
Scenario: Autoscaling creates a positive feedback loop that amplifies the cascade.
What happened:
## Kubernetes autoscaling configuration - DANGEROUS
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 10
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Too aggressive!
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Instant scaling - dangerous!
policies:
- type: Percent
value: 100 # Double pods immediately
periodSeconds: 15
The cascade:
AUTOSCALING DEATH SPIRAL
t=0: Initial state
10 pods, 50% CPU each
Everything normal β
t=30: Database slow query
10 pods, 90% CPU (blocked on DB)
Autoscaler triggers: scale to 20 pods
t=45: New pods start
20 pods ALL hitting slow database
Database connections: 200 β 400
Database CPU: 80% β 100% π₯
t=60: All pods slow
20 pods at 95% CPU
Autoscaler triggers: scale to 40 pods
t=75: Cascade accelerates
40 pods Γ 20 connections = 800 DB connections
Database out of memory
Database crashes π₯
t=90: Complete failure
40 pods all failing
No database available
System down
The fix: Smarter autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 10
maxReplicas: 50 # Reduced max to protect downstream
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# KEY FIX: Rate limit scaling
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait before scaling
policies:
- type: Pods
value: 2 # Add max 2 pods at a time
periodSeconds: 60
- type: Percent
value: 25 # Or 25% increase
periodSeconds: 60
selectPolicy: Min # Choose smaller increase
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Pods
value: 1
periodSeconds: 60
Common Mistakes to Avoid β οΈ
Mistake 1: No Resource Limits
β Wrong:
## Unlimited thread creation
def handle_request(request):
thread = Thread(target=process_request, args=(request,))
thread.start() # Unbounded!
β Right:
from concurrent.futures import ThreadPoolExecutor
## Bounded thread pool
executor = ThreadPoolExecutor(max_workers=50)
def handle_request(request):
future = executor.submit(process_request, request)
try:
return future.result(timeout=5)
except TimeoutError:
future.cancel()
return error_response()
Mistake 2: Synchronous Cascading Calls
β Wrong:
// Sequential blocking calls
async function getUserProfile(userId) {
const user = await fetchUser(userId); // 2s
const posts = await fetchPosts(userId); // 2s
const friends = await fetchFriends(userId); // 2s
return { user, posts, friends }; // Total: 6s
}
β Right:
// Parallel calls with timeout
async function getUserProfile(userId) {
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 3000)
);
const dataPromise = Promise.all([
fetchUser(userId),
fetchPosts(userId),
fetchFriends(userId)
]);
const [user, posts, friends] = await Promise.race([
dataPromise,
timeoutPromise
]);
return { user, posts, friends }; // Total: ~2s
}
Mistake 3: Ignoring Partial Failures
β Wrong:
// All-or-nothing approach
func GetDashboard(userID string) (*Dashboard, error) {
user, err := userService.Get(userID)
if err != nil {
return nil, err // Fails entire dashboard
}
stats, err := statsService.Get(userID)
if err != nil {
return nil, err // One failure = total failure
}
return &Dashboard{User: user, Stats: stats}, nil
}
β Right:
// Graceful degradation
func GetDashboard(userID string) (*Dashboard, error) {
dashboard := &Dashboard{}
// Critical data - must succeed
user, err := userService.Get(userID)
if err != nil {
return nil, err
}
dashboard.User = user
// Optional data - best effort
stats, err := statsService.Get(userID)
if err != nil {
log.Warn("Stats unavailable", err)
dashboard.Stats = nil // Degrade gracefully
} else {
dashboard.Stats = stats
}
return dashboard, nil
}
Mistake 4: Missing Health Checks
β Wrong:
## No health check - sends traffic to dying instances
@app.route('/api/data')
def get_data():
return database.query("SELECT * FROM data")
β Right:
## Proper health checks
@app.route('/health/liveness')
def liveness():
# Just check if process is alive
return {'status': 'alive'}, 200
@app.route('/health/readiness')
def readiness():
# Check if can handle traffic
try:
# Quick dependency check
database.execute("SELECT 1", timeout=1)
cache.ping(timeout=1)
return {'status': 'ready'}, 200
except Exception as e:
# Remove from load balancer
return {'status': 'not ready', 'error': str(e)}, 503
@app.route('/api/data')
def get_data():
try:
return database.query("SELECT * FROM data")
except Exception as e:
# Mark unhealthy for subsequent requests
mark_unhealthy()
raise
Mistake 5: No Observability
β Wrong:
// Silent failures
pub fn call_service(request: Request) -> Result<Response, Error> {
http_client.post("/api/endpoint")
.json(&request)
.send()
.map_err(|e| Error::ServiceError)
}
β Right:
use tracing::{info, warn, error, instrument};
use metrics::{counter, histogram};
#[instrument(skip(http_client))]
pub fn call_service(request: Request) -> Result<Response, Error> {
let start = Instant::now();
info!("Calling external service");
match http_client.post("/api/endpoint")
.json(&request)
.timeout(Duration::from_secs(5))
.send() {
Ok(response) => {
let duration = start.elapsed();
histogram!("service_call_duration_ms", duration.as_millis() as f64);
counter!("service_call_success", 1);
info!("Service call succeeded in {:?}", duration);
Ok(response)
}
Err(e) => {
let duration = start.elapsed();
counter!("service_call_error", 1, "error_type" => e.to_string());
error!("Service call failed after {:?}: {}", duration, e);
Err(Error::ServiceError)
}
}
}
Key Takeaways π―
Cascading failures amplify: A single failure can trigger exponential load increases through retries, load redistribution, and dependency chains
Fail fast, not slow: Circuit breakers and aggressive timeouts prevent thread exhaustion and resource starvation
Isolate blast radius: Bulkheads ensure failures in one component don't exhaust resources for others
Exponential backoff with jitter: Prevents retry storms from amplifying load during failures
Deadline propagation: Pass timeout budgets through the call chain to prevent accumulating waits
Graceful degradation: Return partial results rather than complete failures when possible
Rate limiting: Control incoming load before it overwhelms the system
Observability is critical: You can't debug what you can't seeβinstrument everything
Test failure modes: Use chaos engineering to verify your defensive patterns work
Autoscaling isn't always the answer: Sometimes scaling up amplifies the problem
π Quick Reference Card: Cascade Prevention Patterns
| Pattern | Purpose | When to Use |
|---|---|---|
| Circuit Breaker | Stop calling failing dependencies | External service calls, database queries |
| Bulkhead | Isolate resource pools | Multiple dependencies sharing resources |
| Rate Limiter | Control incoming request rate | Public APIs, resource-intensive operations |
| Timeout | Prevent indefinite blocking | Every I/O operation, always |
| Exponential Backoff | Prevent retry storms | All retry logic |
| Load Shedding | Drop low-priority work | Overload conditions |
| Health Checks | Remove unhealthy instances | Load-balanced services |
π Further Study
Google SRE Book - Cascading Failures: https://sre.google/sre-book/addressing-cascading-failures/ - In-depth analysis of real-world cascading failures and prevention strategies
AWS Well-Architected Framework - Reliability: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/ - Best practices for building resilient distributed systems
Martin Fowler's Circuit Breaker Pattern: https://martinfowler.com/bliki/CircuitBreaker.html - Comprehensive explanation of the circuit breaker pattern with examples