Partial Failure Patterns
When some nodes fail while others appear healthy
Partial Failure Patterns in Distributed Systems
Debugging distributed systems requires mastering partial failure patternsβscenarios where some components fail while others continue operating. This lesson covers circuit breakers, bulkheads, timeout strategies, and graceful degradation with free flashcards to reinforce these critical resilience concepts through spaced repetition practice.
Welcome to Distributed Debugging π»
In monolithic applications, failures are binary: the system either works or crashes entirely. But distributed systems inhabit a messier reality where partial failuresβthe simultaneous existence of working and broken componentsβcreate uniquely challenging debugging scenarios. A database might be accessible from one service but unreachable from another. Network packets might arrive out of order, or not at all. A downstream API could respond in 50ms or timeout after 30 seconds.
These partial failure patterns demand specialized debugging strategies. You can't simply restart the system and hope for the best. You need to identify which components are failing, understand cascading effects, and implement resilience patterns that prevent localized failures from becoming system-wide outages.
Core Concepts: Understanding Partial Failures π
What Makes Partial Failures Unique
Partial failures occur when:
- Some nodes in a cluster respond while others are silent
- Network partitions separate components that should communicate
- Services degrade performance without fully failing
- Transient errors affect intermittent requests
- Cascading failures propagate through dependent services
Unlike complete system failures, partial failures create ambiguity. When a remote call doesn't return, you don't know if:
- The request never arrived
- The service processed it but the response was lost
- The service is still processing (slow, not failed)
- The service crashed mid-request
This ambiguity makes debugging exponentially harder.
The Fallacies of Distributed Computing
These eight assumptions lead to bugs when violated:
| Fallacy | Reality | Failure Pattern |
|---|---|---|
| The network is reliable | Packets drop, connections break | Silent message loss |
| Latency is zero | Network calls take time | Timeout cascades |
| Bandwidth is infinite | Throughput has limits | Backpressure failures |
| The network is secure | Attackers exist | DoS, data corruption |
| Topology doesn't change | Nodes join/leave constantly | Stale routing |
| There is one administrator | Multiple teams, configs | Configuration drift |
| Transport cost is zero | Serialization has overhead | Resource exhaustion |
| The network is homogeneous | Mixed protocols, versions | Compatibility breaks |
Key Resilience Patterns
1. Circuit Breaker Pattern π
A circuit breaker prevents cascading failures by stopping requests to failing services:
CIRCUIT BREAKER STATE MACHINE
βββββββββββββββββββββββββββββββ
β β
βΌ β
π’ CLOSED ββ(failures)βββ π΄ OPEN
β β
β (timeout)
β β
β βΌ
βββββ(success)ββββ π‘ HALF-OPEN
β
(test request)
β
βββββββββββ΄ββββββββββ
β β
(success) (failure)
β β
β β
CLOSED OPEN
States explained:
- CLOSED: Normal operation, requests flow through
- OPEN: Too many failures detected, requests immediately fail (fast-fail)
- HALF-OPEN: Testing if service recovered, allow limited requests
π‘ Tip: Circuit breakers should track failure rates, not just counts. 5 failures out of 10 requests (50%) is more concerning than 5 out of 10,000 (0.05%).
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = "CLOSED"
self.last_failure_time = None
def call(self, func):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit breaker is OPEN")
try:
result = func()
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise e
2. Bulkhead Pattern π’
Inspired by ship compartments that prevent one leak from sinking the entire vessel, bulkheads isolate resources:
BULKHEAD ISOLATION βββββββββββββββββββββββββββββββββββββββββββ β THREAD POOL (100 threads) β βββββββββββββββ¬βββββββββββββββ¬βββββββββββββ€ β Service A β Service B β Service C β β (30 threads)β (40 threads) β(30 threads)β β β β β β β‘β‘β‘β‘ β β‘β‘β‘β‘β‘ β β‘β‘β‘β‘ β β β‘β‘β‘β‘ β β‘β‘β‘β‘β‘ β β‘β‘β‘β‘ β β β‘β‘β‘β‘ β β‘β‘β‘β‘β‘ β β‘β‘β‘β‘ β βββββββββββββββ΄βββββββββββββββ΄βββββββββββββ If Service B hangs, it only exhausts its 40 threads. Services A and C continue operating normally.
Without bulkheads: One slow service exhausts the entire thread pool, starving all other services.
With bulkheads: Resource pools are isolated, containing failures.
// Separate thread pools per service
ExecutorService serviceAPool = Executors.newFixedThreadPool(30);
ExecutorService serviceBPool = Executors.newFixedThreadPool(40);
ExecutorService serviceCPool = Executors.newFixedThreadPool(30);
// Service B hanging won't affect Service A
serviceAPool.submit(() -> callServiceA());
serviceBPool.submit(() -> callServiceB()); // This might hang
serviceCPool.submit(() -> callServiceC());
3. Timeout Strategies β±οΈ
Timeouts are your first defense against indefinite waits, but they require careful tuning:
| Strategy | Use Case | Risk |
|---|---|---|
| Fixed Timeout | Predictable operations | Too short = false failures Too long = resource waste |
| Adaptive Timeout | Variable latency | Complex to implement |
| Percentile-Based | Set timeout to P99 latency | Requires metrics collection |
| Deadline Propagation | Multi-hop requests | Clock sync issues |
Timeout cascade problem:
## BAD: Timeouts don't consider call depth
async def service_a():
return await service_b(timeout=5) # 5 second timeout
async def service_b():
return await service_c(timeout=5) # Another 5 seconds!
async def service_c():
await asyncio.sleep(6) # Slow operation
return "data"
## service_a actually waits 10+ seconds before timing out!
GOOD: Budget remaining time:
async def service_a(deadline):
remaining = deadline - time.time()
return await service_b(timeout=remaining * 0.8) # Leave buffer
async def service_b(timeout):
remaining = timeout * 0.8 # Propagate reduced budget
return await service_c(timeout=remaining)
π‘ Tip: Always set connection timeouts (time to establish connection) shorter than request timeouts (time for full response). A connection timeout of 2s with request timeout of 30s prevents indefinite waiting.
4. Graceful Degradation π
When components fail, gracefully degrade instead of crashing:
SERVICE DEGRADATION LEVELS
ββββββββββββββββββββββββββββββββββββββ
β Level 1: FULL FUNCTIONALITY β
β β Personalized recommendations β
β β Real-time inventory β
β β User reviews β
β β High-res images β
ββββββββββββββββββββββββββββββββββββββ
β (recommendation service fails)
ββββββββββββββββββββββββββββββββββββββ
β Level 2: REDUCED FUNCTIONALITY β
β β Personalized recommendations β
β β Popular items (cached) β
β β Real-time inventory β
β β User reviews β
ββββββββββββββββββββββββββββββββββββββ
β (database slow)
ββββββββββββββββββββββββββββββββββββββ
β Level 3: MINIMAL FUNCTIONALITY β
β β Personalized recommendations β
β β Popular items (cached) β
β β Real-time inventory β
β β Cached inventory (stale OK) β
β β User reviews (skip) β
ββββββββββββββββββββββββββββββββββββββ
β (critical systems only)
ββββββββββββββββββββββββββββββββββββββ
β Level 4: MAINTENANCE MODE β
β Display: "Reduced service..." β
β β Cached static content only β
β β All dynamic features β
ββββββββββββββββββββββββββββββββββββββ
func GetProductPage(productID string) (*Page, error) {
page := &Page{ProductID: productID}
// Critical: Product details (must succeed)
details, err := getProductDetails(productID)
if err != nil {
return nil, err // Can't show page without product
}
page.Details = details
// Optional: Recommendations (degrade gracefully)
recommendations, err := getRecommendations(productID)
if err != nil {
log.Warn("Recommendations failed, using fallback")
recommendations = getCachedPopularItems() // Fallback
}
page.Recommendations = recommendations
// Optional: Reviews (can skip entirely)
reviews, err := getReviews(productID)
if err != nil {
log.Warn("Reviews unavailable")
// page.Reviews remains nil, UI hides review section
} else {
page.Reviews = reviews
}
return page, nil
}
5. Retry Strategies π
Retries handle transient failures, but naive implementations cause problems:
β BAD: Immediate retry storm
## All clients retry immediately β thundering herd
for attempt in range(3):
try:
return call_service()
except:
continue # Retry immediately
β GOOD: Exponential backoff with jitter
import random
def exponential_backoff_with_jitter(attempt, base_delay=1, max_delay=60):
# 2^attempt * base_delay + random jitter
delay = min(max_delay, base_delay * (2 ** attempt))
jitter = delay * random.uniform(0, 0.3) # Add 0-30% jitter
return delay + jitter
for attempt in range(5):
try:
return call_service()
except TransientError as e:
if attempt == 4: # Last attempt
raise
delay = exponential_backoff_with_jitter(attempt)
time.sleep(delay)
Retry timeline:
Attempt 1: β ββββββββββββββββββββββββββ (immediate) Attempt 2: β ββββββ(1.2s)ββββββββββββββ (1s + jitter) Attempt 3: β ββββββββββββββ(2.3s)ββββββ (2s + jitter) Attempt 4: β ββββββββββββββββββββββ(4.1s) (4s + jitter) Attempt 5: β ββββββββββββββββββββββββββββββ(8.4s) Total time: ~16 seconds with spreading
β οΈ Common Mistake: Retrying non-idempotent operations (like payment processing) without deduplication. Always use idempotency keys:
// Include unique request ID
const response = await fetch('/api/payment', {
method: 'POST',
headers: {
'Idempotency-Key': generateUUID(), // Same key for retries
'Content-Type': 'application/json'
},
body: JSON.stringify({amount: 100})
});
Examples: Debugging Partial Failures π§
Example 1: The Cascading Timeout Disaster
Scenario: An e-commerce site experiences complete outage during flash sale. No errors in logs, just extreme slowness.
Investigation:
## Frontend service (timeout: 30s)
async def render_product_page(product_id):
product = await product_service.get_details(product_id) # 30s timeout
reviews = await review_service.get_reviews(product_id) # 30s timeout
inventory = await inventory_service.check_stock(product_id) # 30s timeout
return render_template('product.html', product, reviews, inventory)
The problem:
- Inventory service is slow (25 seconds per request)
- Each frontend request waits 25s for inventory
- Frontend thread pool exhausted (all threads waiting)
- New requests queue up, waiting for threads
- Queue grows, effective timeout becomes 30s + queue time
- Eventually, even fast requests timeout
Debug trace:
βββββββββββββββββββββββββββββββββββββββββββ
β Thread Pool (50 threads) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Thread 1: [Waiting on inventory...25s] β
β Thread 2: [Waiting on inventory...25s] β
β Thread 3: [Waiting on inventory...25s] β
β ... (all 50 threads blocked) β
β Thread 50: [Waiting on inventory...25s] β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β Request Queue (growing...) β
β Request 51, 52, 53... 500+ queued β
β Clients see timeouts β
βββββββββββββββββββββββββββββββββββββββββββ
Solution:
async def render_product_page(product_id):
# Critical data - must have
product = await product_service.get_details(product_id, timeout=5)
# Use bulkheads + aggressive timeouts for optional data
async with asyncio.timeout(3): # Total budget for optional features
reviews_task = asyncio.create_task(
review_service.get_reviews(product_id)
)
inventory_task = asyncio.create_task(
inventory_service.check_stock(product_id)
)
# Gather with exception handling
results = await asyncio.gather(
reviews_task,
inventory_task,
return_exceptions=True
)
reviews = results[0] if not isinstance(results[0], Exception) else []
inventory = results[1] if not isinstance(results[1], Exception) else None
return render_template('product.html', product, reviews, inventory)
Key fixes:
- Aggressive timeouts (3s for optional features)
- Parallel fetching (don't wait serially)
- Graceful degradation (page works without reviews/inventory)
- Bulkhead pattern would further isolate inventory service
Example 2: The Distributed Transaction Deadlock
Scenario: Order service occasionally hangs for 30+ seconds, then fails with "transaction timeout."
Investigation reveals:
-- Order Service Transaction
BEGIN TRANSACTION;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 123;
-- Network call to Payment Service (holds DB lock during RPC!)
CALL payment_service.charge(user_id, amount);
INSERT INTO orders (user_id, product_id, amount) VALUES (456, 123, 99.99);
COMMIT;
The problem:
TIME β
Order Service Payment Service Database
β β β
β BEGIN TX β β
ββββββββββββββββββββββββββββββββββββββββββββ β
β UPDATE inventory β β
β (acquires row lock) β β
ββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β RPC: charge() β β
βββββββββββββββββββββββββ β
β (holding DB lock!) β β
β β BEGIN TX β
β ββββββββββββββββββββββ
β β UPDATE user_balanceβ
β β (tries same row!) β
β ββββββββββββββββββββββ
β β β³ BLOCKED β
β β³ WAITING β β³ WAITING β
β FOR RPC β FOR LOCK β
β β β
β β° 30s TIMEOUT βββββββββββββββββββββββββ
β β β
βββ ROLLBACK βββββββββββββββββββββββββββββββ β
Solution: Saga Pattern (split into local transactions):
## Step 1: Reserve inventory (local transaction)
def reserve_inventory(product_id, quantity):
with db.transaction():
inventory = db.query(
"SELECT quantity FROM inventory WHERE product_id = ?",
product_id
)
if inventory < quantity:
raise InsufficientInventoryError()
reservation_id = generate_id()
db.execute(
"INSERT INTO reservations (id, product_id, quantity, expires_at) "
"VALUES (?, ?, ?, ?)",
reservation_id, product_id, quantity, now() + timedelta(minutes=10)
)
return reservation_id
## Step 2: Charge payment (separate transaction, no locks held)
def charge_payment(user_id, amount):
return payment_service.charge(user_id, amount) # External call
## Step 3: Finalize order (local transaction)
def finalize_order(reservation_id, payment_id):
with db.transaction():
reservation = db.query(
"SELECT * FROM reservations WHERE id = ?",
reservation_id
)
db.execute(
"UPDATE inventory SET quantity = quantity - ? WHERE product_id = ?",
reservation.quantity, reservation.product_id
)
db.execute("DELETE FROM reservations WHERE id = ?", reservation_id)
db.execute(
"INSERT INTO orders (reservation_id, payment_id, status) "
"VALUES (?, ?, 'completed')",
reservation_id, payment_id
)
## Orchestration with compensation
try:
reservation_id = reserve_inventory(product_id, 1)
try:
payment_id = charge_payment(user_id, amount)
try:
finalize_order(reservation_id, payment_id)
except Exception as e:
payment_service.refund(payment_id) # Compensate
raise
except Exception as e:
cancel_reservation(reservation_id) # Compensate
raise
except Exception as e:
log.error(f"Order failed: {e}")
return {"error": "Order could not be completed"}
Example 3: The Silent Network Partition
Scenario: Distributed cache (Redis cluster) shows inconsistent data. Some services see updated values, others see stale data.
Root cause: Network partition split the cluster:
NORMAL STATE:
βββββββββββββββββββββββββββββββββββββββββββ
β REDIS CLUSTER β
β β
β ββββββββ ββββββββ ββββββββ β
β βNode 1βββββΊβNode 2βββββΊβNode 3β β
β βMasterβ βReplica βReplicaβ β
β ββββββββ ββββββββ ββββββββ β
β β² β² β² β
ββββββββΌββββββββββββΌββββββββββββΌββββββββββ
β β β
βββββ΄ββββ¬ββββββββ΄ββββ¬ββββββββ΄ββββ
β β β β
ServiceA ServiceB ServiceC ServiceD
AFTER NETWORK PARTITION:
βββββββββββββββββββ β³β³β³β³β³ βββββββββββββββββββ
β ββββββββ β SPLIT β ββββββββ β
β βNode 1β β β βNode 3β β
β βMasterβ β β βPromoted β
β ββββββββ β β βto Master! β
β β² β β ββββββββ β
ββββββββΌβββββββββββ ββββββββ²βββββββββββ
β β
ServiceA ServiceC
ServiceB ServiceD
(writes to Node 1) (writes to Node 3!)
Result: Split-brain scenario with two masters accepting writes.
Detection:
## Health check that detects partition
def check_cluster_health():
nodes = redis_cluster.get_nodes()
masters = [n for n in nodes if n.role == 'master']
if len(masters) > 1:
alert("CRITICAL: Multiple Redis masters detected - possible split-brain")
# Verify with external consensus (etcd, ZooKeeper, etc.)
elected_master = consensus_service.get_leader('redis-cluster')
for master in masters:
if master.id != elected_master:
# Demote false master
master.set_role('replica')
log.info(f"Demoted {master.id} from master to replica")
Prevention: Quorum-based writes:
def safe_write(key, value):
cluster_size = 3
min_acks = (cluster_size // 2) + 1 # Majority quorum (2 of 3)
try:
acks = redis_cluster.set_with_quorum(
key, value,
min_acks=min_acks,
timeout=1.0 # Fail fast if can't reach quorum
)
if acks < min_acks:
raise QuorumNotReachedError(
f"Only {acks}/{min_acks} nodes acknowledged write"
)
return True
except QuorumNotReachedError:
# Don't claim success if quorum not reached
log.error("Write failed - cluster partition suspected")
raise
Example 4: The Retry Amplification Attack
Scenario: Backend service experiences 2x normal load during incident. Load increases as more instances are added!
Investigation:
CASCADING RETRY AMPLIFICATION
βββββββββββββββββββ
β Frontend β
β (3 retries) β
ββββββββββ¬βββββββββ
β
βββββββββββΌββββββββββ
β β β
βββββββΌββββ ββββΌββββββ βββΌβββββββ
β API GW β β API GW β β API GW β
β(3 retry)β β(3 retryβ β(3 retryβ
ββββββ¬βββββ ββββββ¬ββββ ββββββ¬ββββ
β β β
ββββββββββΌββββββββββββΌβββββββββββΌβββββββββ
β β β β β
ββββββΌββββ βββΌβββββββ ββββΌββββββ ββββΌββββββ β
βService β βService β βService β βService β β
β A β β A β β A β β A β β
ββββββββββ ββββββββββ ββββββββββ ββββββββββ β
β
1 user request β 3 Frontend retries β
β 9 API Gateway retries β
β 27 backend requests! β
β π₯ LOAD AMPLIFICATION β
Fix: Request budget tracking:
class RequestContext:
def __init__(self, max_retries=3):
self.attempt_count = 0
self.max_retries = max_retries
self.request_id = generate_uuid()
def can_retry(self) -> bool:
return self.attempt_count < self.max_retries
def record_attempt(self):
self.attempt_count += 1
## Pass context through call chain
async def frontend_handler(request):
ctx = RequestContext(max_retries=3)
return await call_api_gateway(request, ctx)
async def api_gateway(request, ctx: RequestContext):
# DON'T create new retry budget - use existing
if not ctx.can_retry():
# Retry budget exhausted upstream
return await call_service_a(request, ctx)
for _ in range(2): # Only retry once at this level
ctx.record_attempt()
try:
return await call_service_a(request, ctx)
except TransientError:
if not ctx.can_retry():
raise
async def service_a(request, ctx: RequestContext):
# Check budget before processing
if ctx.attempt_count > 5: # Global limit across all layers
log.warn(f"Rejecting request {ctx.request_id} - too many retries")
raise TooManyRetriesError()
# Process request...
Common Mistakes When Debugging Partial Failures β οΈ
1. Assuming "No Error" Means "Success"
In distributed systems, silence is ambiguous. A request that doesn't return might be:
- Still processing (slow)
- Lost in transit
- Processed but response lost
- Deadlocked
β Wrong:
try {
await sendMessage(data);
// Assumes success if no exception
} catch (error) {
// Only handles explicit errors
}
β Right:
try {
const ack = await sendMessageWithAck(data, timeout=5000);
if (!ack.confirmed) {
throw new Error('Message not confirmed by receiver');
}
} catch (TimeoutError) {
// Explicit timeout handling - unknown state
log.warn('Message delivery unknown - might be duplicate on retry');
throw;
}
2. Not Testing Partial Failure Modes
Most tests assume "all up" or "all down" - neither reflects production.
β Wrong:
def test_order_creation():
order = create_order(user_id=123, product_id=456)
assert order.status == 'completed'
β Right:
@pytest.mark.parametrize('failure_scenario', [
'payment_slow',
'payment_timeout',
'inventory_stale',
'network_partition',
'partial_inventory_failure'
])
def test_order_creation_partial_failure(failure_scenario, chaos_monkey):
chaos_monkey.inject_fault(failure_scenario)
order = create_order(user_id=123, product_id=456)
# System should degrade gracefully
assert order.status in ['completed', 'pending', 'failed']
assert order.has_valid_state() # No corrupt data
# Verify compensating actions if failed
if order.status == 'failed':
assert not payment_charged(order.id)
assert inventory_restored(order.product_id)
3. Ignoring Thundering Herd on Recovery
When a failing service recovers, all circuit breakers transition HALF-OPEN β CLOSED simultaneously, causing a spike.
β Wrong:
## All circuit breakers test recovery at same time
if time.time() - self.last_failure > 60:
self.state = 'HALF_OPEN'
β Right:
import random
## Add jitter to recovery attempts
recovery_timeout = 60 + random.uniform(0, 20) # 60-80 seconds
if time.time() - self.last_failure > recovery_timeout:
self.state = 'HALF_OPEN'
4. Not Propagating Deadlines
Each service adds its own timeout, causing total latency to exceed SLAs.
β Wrong:
// Service A
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
response, err := serviceB.Call(ctx, request)
// Service B (creates new context!)
func (s *ServiceB) Call(ctx context.Context, req Request) {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
return s.serviceC.Call(ctx, req) // Another 30s!
}
β Right:
// Service A sets overall deadline
ctx, cancel := context.WithDeadline(
context.Background(),
time.Now().Add(30*time.Second)
)
response, err := serviceB.Call(ctx, request)
// Service B respects existing deadline
func (s *ServiceB) Call(ctx context.Context, req Request) {
deadline, ok := ctx.Deadline()
if !ok {
// No deadline set - create default
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()
}
// Use existing deadline (already propagated in ctx)
return s.serviceC.Call(ctx, req)
}
5. Logging Without Correlation IDs
You can't trace requests across services without identifiers.
β Wrong:
log.info("Processing order")
## Which order? Which user? Which request?
β Right:
import uuid
import structlog
log = structlog.get_logger()
def handle_request(request):
request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
# Bind context to all subsequent logs
log = log.bind(
request_id=request_id,
user_id=request.user_id,
service='order-service'
)
log.info('processing_order', order_id=order.id)
# Logs: {"request_id": "abc123", "user_id": 456, ...}
# Propagate to downstream services
response = payment_service.charge(
user_id=request.user_id,
headers={'X-Request-ID': request_id} # Pass along!
)
With correlation IDs, you can trace the entire request flow:
## Query logs across all services
grep 'request_id=abc123' logs/*.log
## Shows complete timeline:
## 10:23:45.123 [order-service] processing_order request_id=abc123
## 10:23:45.145 [payment-service] charging_card request_id=abc123
## 10:23:45.890 [payment-service] charge_successful request_id=abc123
## 10:23:45.902 [order-service] order_completed request_id=abc123
Key Takeaways π―
π Partial Failure Debugging Checklist
Essential Patterns:
- β Circuit Breakers: Prevent cascading failures (CLOSED β OPEN β HALF-OPEN)
- β Bulkheads: Isolate resource pools to contain failures
- β Timeouts: Set connection timeout < request timeout < total deadline
- β Graceful Degradation: Critical features work even when optional ones fail
- β Exponential Backoff: Add jitter to prevent retry storms
Debugging Principles:
- π Silence β Success (explicit acknowledgments required)
- π Always use correlation IDs to trace requests
- β° Propagate deadlines through call chains
- π² Add jitter to all timing-based operations
- πΎ Test partial failure modes with chaos engineering
Anti-Patterns to Avoid:
- β Holding database locks during network calls
- β Unbounded retry budgets (causes amplification)
- β Assuming "all up" or "all down" (partial failures are normal)
- β Creating new timeout contexts at each layer
- β Missing idempotency keys on retryable operations
Monitoring Requirements:
| Metric | Alert Threshold | Indicates |
|---|---|---|
| Circuit breaker open rate | >10% services | Widespread issues |
| P99 latency | >2x P50 | Resource exhaustion |
| Timeout rate | >5% | Cascading timeouts |
| Retry rate | >20% | Instability |
| Thread pool saturation | >80% | Need bulkheads |
Further Study π
Essential Reading:
- AWS Builders' Library: Avoiding fallback in distributed systems - Deep dive into graceful degradation patterns from AWS engineers
- Martin Fowler: Circuit Breaker Pattern - Canonical explanation of circuit breakers with implementation guidance
- Google SRE Book: Handling Overload - Google's approach to cascading failures, load shedding, and graceful degradation
Advanced Topics:
- Chaos Engineering: Use tools like Chaos Monkey, Gremlin, or Litmus to inject faults systematically
- Distributed Tracing: Implement OpenTelemetry for end-to-end request tracking
- Consensus Algorithms: Study Raft/Paxos for handling split-brain scenarios
- Backpressure: Learn reactive streams for propagating load signals upstream
π‘ Practice Tip: Set up a local microservices environment (Docker Compose) and use tc (Linux traffic control) to inject network delays, packet loss, and partitions. Experience debugging partial failures hands-on!