You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

Partial Failure Patterns

When some nodes fail while others appear healthy

Partial Failure Patterns in Distributed Systems

Debugging distributed systems requires mastering partial failure patternsβ€”scenarios where some components fail while others continue operating. This lesson covers circuit breakers, bulkheads, timeout strategies, and graceful degradation with free flashcards to reinforce these critical resilience concepts through spaced repetition practice.

Welcome to Distributed Debugging πŸ’»

In monolithic applications, failures are binary: the system either works or crashes entirely. But distributed systems inhabit a messier reality where partial failuresβ€”the simultaneous existence of working and broken componentsβ€”create uniquely challenging debugging scenarios. A database might be accessible from one service but unreachable from another. Network packets might arrive out of order, or not at all. A downstream API could respond in 50ms or timeout after 30 seconds.

These partial failure patterns demand specialized debugging strategies. You can't simply restart the system and hope for the best. You need to identify which components are failing, understand cascading effects, and implement resilience patterns that prevent localized failures from becoming system-wide outages.

Core Concepts: Understanding Partial Failures πŸ”

What Makes Partial Failures Unique

Partial failures occur when:

  • Some nodes in a cluster respond while others are silent
  • Network partitions separate components that should communicate
  • Services degrade performance without fully failing
  • Transient errors affect intermittent requests
  • Cascading failures propagate through dependent services

Unlike complete system failures, partial failures create ambiguity. When a remote call doesn't return, you don't know if:

  1. The request never arrived
  2. The service processed it but the response was lost
  3. The service is still processing (slow, not failed)
  4. The service crashed mid-request

This ambiguity makes debugging exponentially harder.

The Fallacies of Distributed Computing

These eight assumptions lead to bugs when violated:

FallacyRealityFailure Pattern
The network is reliablePackets drop, connections breakSilent message loss
Latency is zeroNetwork calls take timeTimeout cascades
Bandwidth is infiniteThroughput has limitsBackpressure failures
The network is secureAttackers existDoS, data corruption
Topology doesn't changeNodes join/leave constantlyStale routing
There is one administratorMultiple teams, configsConfiguration drift
Transport cost is zeroSerialization has overheadResource exhaustion
The network is homogeneousMixed protocols, versionsCompatibility breaks

Key Resilience Patterns

1. Circuit Breaker Pattern πŸ”Œ

A circuit breaker prevents cascading failures by stopping requests to failing services:

CIRCUIT BREAKER STATE MACHINE

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚                             β”‚
     β–Ό                             β”‚
  🟒 CLOSED ──(failures)──→ πŸ”΄ OPEN
     β”‚                         β”‚
     ↑                      (timeout)
     β”‚                         β”‚
     β”‚                         β–Ό
     └────(success)────  🟑 HALF-OPEN
                              β”‚
                         (test request)
                              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                   β”‚
                (success)           (failure)
                    β”‚                   β”‚
                    ↓                   ↓
                 CLOSED               OPEN

States explained:

  • CLOSED: Normal operation, requests flow through
  • OPEN: Too many failures detected, requests immediately fail (fast-fail)
  • HALF-OPEN: Testing if service recovered, allow limited requests

πŸ’‘ Tip: Circuit breakers should track failure rates, not just counts. 5 failures out of 10 requests (50%) is more concerning than 5 out of 10,000 (0.05%).

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.state = "CLOSED"
        self.last_failure_time = None
    
    def call(self, func):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is OPEN")
        
        try:
            result = func()
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise e
2. Bulkhead Pattern 🚒

Inspired by ship compartments that prevent one leak from sinking the entire vessel, bulkheads isolate resources:

BULKHEAD ISOLATION

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         THREAD POOL (100 threads)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Service A   β”‚  Service B   β”‚ Service C  β”‚
β”‚ (30 threads)β”‚ (40 threads) β”‚(30 threads)β”‚
β”‚             β”‚              β”‚            β”‚
β”‚  ⚑⚑⚑⚑    β”‚   ⚑⚑⚑⚑⚑  β”‚  ⚑⚑⚑⚑   β”‚
β”‚  ⚑⚑⚑⚑    β”‚   ⚑⚑⚑⚑⚑  β”‚  ⚑⚑⚑⚑   β”‚
β”‚  ⚑⚑⚑⚑    β”‚   ⚑⚑⚑⚑⚑  β”‚  ⚑⚑⚑⚑   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

If Service B hangs, it only exhausts its 40 threads.
Services A and C continue operating normally.

Without bulkheads: One slow service exhausts the entire thread pool, starving all other services.

With bulkheads: Resource pools are isolated, containing failures.

// Separate thread pools per service
ExecutorService serviceAPool = Executors.newFixedThreadPool(30);
ExecutorService serviceBPool = Executors.newFixedThreadPool(40);
ExecutorService serviceCPool = Executors.newFixedThreadPool(30);

// Service B hanging won't affect Service A
serviceAPool.submit(() -> callServiceA());
serviceBPool.submit(() -> callServiceB()); // This might hang
serviceCPool.submit(() -> callServiceC());
3. Timeout Strategies ⏱️

Timeouts are your first defense against indefinite waits, but they require careful tuning:

StrategyUse CaseRisk
Fixed TimeoutPredictable operationsToo short = false failures
Too long = resource waste
Adaptive TimeoutVariable latencyComplex to implement
Percentile-BasedSet timeout to P99 latencyRequires metrics collection
Deadline PropagationMulti-hop requestsClock sync issues

Timeout cascade problem:

## BAD: Timeouts don't consider call depth
async def service_a():
    return await service_b(timeout=5)  # 5 second timeout

async def service_b():
    return await service_c(timeout=5)  # Another 5 seconds!

async def service_c():
    await asyncio.sleep(6)  # Slow operation
    return "data"

## service_a actually waits 10+ seconds before timing out!

GOOD: Budget remaining time:

async def service_a(deadline):
    remaining = deadline - time.time()
    return await service_b(timeout=remaining * 0.8)  # Leave buffer

async def service_b(timeout):
    remaining = timeout * 0.8  # Propagate reduced budget
    return await service_c(timeout=remaining)

πŸ’‘ Tip: Always set connection timeouts (time to establish connection) shorter than request timeouts (time for full response). A connection timeout of 2s with request timeout of 30s prevents indefinite waiting.

4. Graceful Degradation πŸ“‰

When components fail, gracefully degrade instead of crashing:

SERVICE DEGRADATION LEVELS

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 1: FULL FUNCTIONALITY        β”‚
β”‚ βœ“ Personalized recommendations     β”‚
β”‚ βœ“ Real-time inventory               β”‚
β”‚ βœ“ User reviews                      β”‚
β”‚ βœ“ High-res images                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓ (recommendation service fails)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 2: REDUCED FUNCTIONALITY      β”‚
β”‚ βœ— Personalized recommendations      β”‚
β”‚ βœ“ Popular items (cached)            β”‚
β”‚ βœ“ Real-time inventory               β”‚
β”‚ βœ“ User reviews                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓ (database slow)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 3: MINIMAL FUNCTIONALITY      β”‚
β”‚ βœ— Personalized recommendations      β”‚
β”‚ βœ“ Popular items (cached)            β”‚
β”‚ βœ— Real-time inventory               β”‚
β”‚ βœ“ Cached inventory (stale OK)       β”‚
β”‚ βœ— User reviews (skip)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓ (critical systems only)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 4: MAINTENANCE MODE           β”‚
β”‚ Display: "Reduced service..."       β”‚
β”‚ βœ“ Cached static content only        β”‚
β”‚ βœ— All dynamic features              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
func GetProductPage(productID string) (*Page, error) {
    page := &Page{ProductID: productID}
    
    // Critical: Product details (must succeed)
    details, err := getProductDetails(productID)
    if err != nil {
        return nil, err  // Can't show page without product
    }
    page.Details = details
    
    // Optional: Recommendations (degrade gracefully)
    recommendations, err := getRecommendations(productID)
    if err != nil {
        log.Warn("Recommendations failed, using fallback")
        recommendations = getCachedPopularItems()  // Fallback
    }
    page.Recommendations = recommendations
    
    // Optional: Reviews (can skip entirely)
    reviews, err := getReviews(productID)
    if err != nil {
        log.Warn("Reviews unavailable")
        // page.Reviews remains nil, UI hides review section
    } else {
        page.Reviews = reviews
    }
    
    return page, nil
}
5. Retry Strategies πŸ”„

Retries handle transient failures, but naive implementations cause problems:

❌ BAD: Immediate retry storm

## All clients retry immediately β†’ thundering herd
for attempt in range(3):
    try:
        return call_service()
    except:
        continue  # Retry immediately

βœ… GOOD: Exponential backoff with jitter

import random

def exponential_backoff_with_jitter(attempt, base_delay=1, max_delay=60):
    # 2^attempt * base_delay + random jitter
    delay = min(max_delay, base_delay * (2 ** attempt))
    jitter = delay * random.uniform(0, 0.3)  # Add 0-30% jitter
    return delay + jitter

for attempt in range(5):
    try:
        return call_service()
    except TransientError as e:
        if attempt == 4:  # Last attempt
            raise
        delay = exponential_backoff_with_jitter(attempt)
        time.sleep(delay)

Retry timeline:

Attempt 1: βœ— ────────────────────────── (immediate)
Attempt 2: βœ— ──────(1.2s)────────────── (1s + jitter)
Attempt 3: βœ— ──────────────(2.3s)────── (2s + jitter)
Attempt 4: βœ— ──────────────────────(4.1s) (4s + jitter)
Attempt 5: βœ“ ──────────────────────────────(8.4s)

Total time: ~16 seconds with spreading

⚠️ Common Mistake: Retrying non-idempotent operations (like payment processing) without deduplication. Always use idempotency keys:

// Include unique request ID
const response = await fetch('/api/payment', {
  method: 'POST',
  headers: {
    'Idempotency-Key': generateUUID(),  // Same key for retries
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({amount: 100})
});

Examples: Debugging Partial Failures πŸ”§

Example 1: The Cascading Timeout Disaster

Scenario: An e-commerce site experiences complete outage during flash sale. No errors in logs, just extreme slowness.

Investigation:

## Frontend service (timeout: 30s)
async def render_product_page(product_id):
    product = await product_service.get_details(product_id)  # 30s timeout
    reviews = await review_service.get_reviews(product_id)  # 30s timeout
    inventory = await inventory_service.check_stock(product_id)  # 30s timeout
    return render_template('product.html', product, reviews, inventory)

The problem:

  1. Inventory service is slow (25 seconds per request)
  2. Each frontend request waits 25s for inventory
  3. Frontend thread pool exhausted (all threads waiting)
  4. New requests queue up, waiting for threads
  5. Queue grows, effective timeout becomes 30s + queue time
  6. Eventually, even fast requests timeout

Debug trace:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Thread Pool (50 threads)                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Thread 1:  [Waiting on inventory...25s] β”‚
β”‚ Thread 2:  [Waiting on inventory...25s] β”‚
β”‚ Thread 3:  [Waiting on inventory...25s] β”‚
β”‚    ...     (all 50 threads blocked)     β”‚
β”‚ Thread 50: [Waiting on inventory...25s] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Request Queue (growing...)              β”‚
β”‚ Request 51, 52, 53... 500+ queued       β”‚
β”‚ Clients see timeouts                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Solution:

async def render_product_page(product_id):
    # Critical data - must have
    product = await product_service.get_details(product_id, timeout=5)
    
    # Use bulkheads + aggressive timeouts for optional data
    async with asyncio.timeout(3):  # Total budget for optional features
        reviews_task = asyncio.create_task(
            review_service.get_reviews(product_id)
        )
        inventory_task = asyncio.create_task(
            inventory_service.check_stock(product_id)
        )
        
        # Gather with exception handling
        results = await asyncio.gather(
            reviews_task, 
            inventory_task, 
            return_exceptions=True
        )
        
        reviews = results[0] if not isinstance(results[0], Exception) else []
        inventory = results[1] if not isinstance(results[1], Exception) else None
    
    return render_template('product.html', product, reviews, inventory)

Key fixes:

  • Aggressive timeouts (3s for optional features)
  • Parallel fetching (don't wait serially)
  • Graceful degradation (page works without reviews/inventory)
  • Bulkhead pattern would further isolate inventory service

Example 2: The Distributed Transaction Deadlock

Scenario: Order service occasionally hangs for 30+ seconds, then fails with "transaction timeout."

Investigation reveals:

-- Order Service Transaction
BEGIN TRANSACTION;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 123;
-- Network call to Payment Service (holds DB lock during RPC!)
CALL payment_service.charge(user_id, amount);
INSERT INTO orders (user_id, product_id, amount) VALUES (456, 123, 99.99);
COMMIT;

The problem:

TIME β†’

Order Service          Payment Service       Database
     β”‚                       β”‚                    β”‚
     β”‚ BEGIN TX              β”‚                    β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’ β”‚
     β”‚ UPDATE inventory      β”‚                    β”‚
     β”‚ (acquires row lock)   β”‚                    β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’ β”‚
     β”‚                       β”‚                    β”‚
     β”‚ RPC: charge()         β”‚                    β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’β”‚                    β”‚
     β”‚  (holding DB lock!)   β”‚                    β”‚
     β”‚                       β”‚ BEGIN TX           β”‚
     β”‚                       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’β”‚
     β”‚                       β”‚ UPDATE user_balanceβ”‚
     β”‚                       β”‚ (tries same row!)  β”‚
     β”‚                       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’β”‚
     β”‚                       β”‚   ⏳ BLOCKED      β”‚
     β”‚  ⏳ WAITING           β”‚   ⏳ WAITING       β”‚
     β”‚     FOR RPC           β”‚      FOR LOCK      β”‚
     β”‚                       β”‚                    β”‚
     β”‚     ⏰ 30s TIMEOUT ───────────────────────→│
     β”‚                       β”‚                    β”‚
     └── ROLLBACK ──────────────────────────────→ β”‚

Solution: Saga Pattern (split into local transactions):

## Step 1: Reserve inventory (local transaction)
def reserve_inventory(product_id, quantity):
    with db.transaction():
        inventory = db.query(
            "SELECT quantity FROM inventory WHERE product_id = ?", 
            product_id
        )
        if inventory < quantity:
            raise InsufficientInventoryError()
        
        reservation_id = generate_id()
        db.execute(
            "INSERT INTO reservations (id, product_id, quantity, expires_at) "
            "VALUES (?, ?, ?, ?)",
            reservation_id, product_id, quantity, now() + timedelta(minutes=10)
        )
        return reservation_id

## Step 2: Charge payment (separate transaction, no locks held)
def charge_payment(user_id, amount):
    return payment_service.charge(user_id, amount)  # External call

## Step 3: Finalize order (local transaction)
def finalize_order(reservation_id, payment_id):
    with db.transaction():
        reservation = db.query(
            "SELECT * FROM reservations WHERE id = ?", 
            reservation_id
        )
        db.execute(
            "UPDATE inventory SET quantity = quantity - ? WHERE product_id = ?",
            reservation.quantity, reservation.product_id
        )
        db.execute("DELETE FROM reservations WHERE id = ?", reservation_id)
        db.execute(
            "INSERT INTO orders (reservation_id, payment_id, status) "
            "VALUES (?, ?, 'completed')",
            reservation_id, payment_id
        )

## Orchestration with compensation
try:
    reservation_id = reserve_inventory(product_id, 1)
    try:
        payment_id = charge_payment(user_id, amount)
        try:
            finalize_order(reservation_id, payment_id)
        except Exception as e:
            payment_service.refund(payment_id)  # Compensate
            raise
    except Exception as e:
        cancel_reservation(reservation_id)  # Compensate
        raise
except Exception as e:
    log.error(f"Order failed: {e}")
    return {"error": "Order could not be completed"}

Example 3: The Silent Network Partition

Scenario: Distributed cache (Redis cluster) shows inconsistent data. Some services see updated values, others see stale data.

Root cause: Network partition split the cluster:

NORMAL STATE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         REDIS CLUSTER                   β”‚
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚Node 1│◄──►│Node 2│◄──►│Node 3β”‚     β”‚
β”‚  β”‚Masterβ”‚    β”‚Replica   β”‚Replicaβ”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚      β–²           β–²           β–²         β”‚
β””β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚           β”‚           β”‚
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”
   β”‚       β”‚           β”‚           β”‚
 ServiceA ServiceB  ServiceC   ServiceD

AFTER NETWORK PARTITION:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β•³β•³β•³β•³β•³   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”       β”‚   SPLIT   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚Node 1β”‚       β”‚           β”‚  β”‚Node 3β”‚       β”‚
β”‚  β”‚Masterβ”‚       β”‚           β”‚  β”‚Promoted       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”˜       β”‚           β”‚  β”‚to Master!     β”‚
β”‚      β–²          β”‚           β”‚  β””β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                             β”‚
   ServiceA                      ServiceC
   ServiceB                      ServiceD
   (writes to Node 1)           (writes to Node 3!)

Result: Split-brain scenario with two masters accepting writes.

Detection:

## Health check that detects partition
def check_cluster_health():
    nodes = redis_cluster.get_nodes()
    masters = [n for n in nodes if n.role == 'master']
    
    if len(masters) > 1:
        alert("CRITICAL: Multiple Redis masters detected - possible split-brain")
        # Verify with external consensus (etcd, ZooKeeper, etc.)
        elected_master = consensus_service.get_leader('redis-cluster')
        
        for master in masters:
            if master.id != elected_master:
                # Demote false master
                master.set_role('replica')
                log.info(f"Demoted {master.id} from master to replica")

Prevention: Quorum-based writes:

def safe_write(key, value):
    cluster_size = 3
    min_acks = (cluster_size // 2) + 1  # Majority quorum (2 of 3)
    
    try:
        acks = redis_cluster.set_with_quorum(
            key, value, 
            min_acks=min_acks,
            timeout=1.0  # Fail fast if can't reach quorum
        )
        if acks < min_acks:
            raise QuorumNotReachedError(
                f"Only {acks}/{min_acks} nodes acknowledged write"
            )
        return True
    except QuorumNotReachedError:
        # Don't claim success if quorum not reached
        log.error("Write failed - cluster partition suspected")
        raise

Example 4: The Retry Amplification Attack

Scenario: Backend service experiences 2x normal load during incident. Load increases as more instances are added!

Investigation:

CASCADING RETRY AMPLIFICATION

                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚   Frontend      β”‚
                     β”‚   (3 retries)   β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚         β”‚         β”‚
              β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚ API GW  β”‚ β”‚ API GW β”‚ β”‚ API GW β”‚
              β”‚(3 retry)β”‚ β”‚(3 retryβ”‚ β”‚(3 retryβ”‚
              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜
                   β”‚           β”‚          β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚        β”‚           β”‚          β”‚        β”‚
     β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β” β”‚
     β”‚Service β”‚ β”‚Service β”‚ β”‚Service β”‚ β”‚Service β”‚ β”‚
     β”‚   A    β”‚ β”‚   A    β”‚ β”‚   A    β”‚ β”‚   A    β”‚ β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                                                   β”‚
1 user request β†’ 3 Frontend retries               β”‚
               β†’ 9 API Gateway retries             β”‚
               β†’ 27 backend requests!              β”‚
               β†’ πŸ’₯ LOAD AMPLIFICATION             β”‚

Fix: Request budget tracking:

class RequestContext:
    def __init__(self, max_retries=3):
        self.attempt_count = 0
        self.max_retries = max_retries
        self.request_id = generate_uuid()
    
    def can_retry(self) -> bool:
        return self.attempt_count < self.max_retries
    
    def record_attempt(self):
        self.attempt_count += 1

## Pass context through call chain
async def frontend_handler(request):
    ctx = RequestContext(max_retries=3)
    return await call_api_gateway(request, ctx)

async def api_gateway(request, ctx: RequestContext):
    # DON'T create new retry budget - use existing
    if not ctx.can_retry():
        # Retry budget exhausted upstream
        return await call_service_a(request, ctx)
    
    for _ in range(2):  # Only retry once at this level
        ctx.record_attempt()
        try:
            return await call_service_a(request, ctx)
        except TransientError:
            if not ctx.can_retry():
                raise
    
async def service_a(request, ctx: RequestContext):
    # Check budget before processing
    if ctx.attempt_count > 5:  # Global limit across all layers
        log.warn(f"Rejecting request {ctx.request_id} - too many retries")
        raise TooManyRetriesError()
    
    # Process request...

Common Mistakes When Debugging Partial Failures ⚠️

1. Assuming "No Error" Means "Success"

In distributed systems, silence is ambiguous. A request that doesn't return might be:

  • Still processing (slow)
  • Lost in transit
  • Processed but response lost
  • Deadlocked

❌ Wrong:

try {
  await sendMessage(data);
  // Assumes success if no exception
} catch (error) {
  // Only handles explicit errors
}

βœ… Right:

try {
  const ack = await sendMessageWithAck(data, timeout=5000);
  if (!ack.confirmed) {
    throw new Error('Message not confirmed by receiver');
  }
} catch (TimeoutError) {
  // Explicit timeout handling - unknown state
  log.warn('Message delivery unknown - might be duplicate on retry');
  throw;
}

2. Not Testing Partial Failure Modes

Most tests assume "all up" or "all down" - neither reflects production.

❌ Wrong:

def test_order_creation():
    order = create_order(user_id=123, product_id=456)
    assert order.status == 'completed'

βœ… Right:

@pytest.mark.parametrize('failure_scenario', [
    'payment_slow',
    'payment_timeout', 
    'inventory_stale',
    'network_partition',
    'partial_inventory_failure'
])
def test_order_creation_partial_failure(failure_scenario, chaos_monkey):
    chaos_monkey.inject_fault(failure_scenario)
    
    order = create_order(user_id=123, product_id=456)
    
    # System should degrade gracefully
    assert order.status in ['completed', 'pending', 'failed']
    assert order.has_valid_state()  # No corrupt data
    
    # Verify compensating actions if failed
    if order.status == 'failed':
        assert not payment_charged(order.id)
        assert inventory_restored(order.product_id)

3. Ignoring Thundering Herd on Recovery

When a failing service recovers, all circuit breakers transition HALF-OPEN β†’ CLOSED simultaneously, causing a spike.

❌ Wrong:

## All circuit breakers test recovery at same time
if time.time() - self.last_failure > 60:
    self.state = 'HALF_OPEN'

βœ… Right:

import random

## Add jitter to recovery attempts
recovery_timeout = 60 + random.uniform(0, 20)  # 60-80 seconds
if time.time() - self.last_failure > recovery_timeout:
    self.state = 'HALF_OPEN'

4. Not Propagating Deadlines

Each service adds its own timeout, causing total latency to exceed SLAs.

❌ Wrong:

// Service A
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
response, err := serviceB.Call(ctx, request)

// Service B (creates new context!)
func (s *ServiceB) Call(ctx context.Context, req Request) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    return s.serviceC.Call(ctx, req)  // Another 30s!
}

βœ… Right:

// Service A sets overall deadline
ctx, cancel := context.WithDeadline(
    context.Background(), 
    time.Now().Add(30*time.Second)
)
response, err := serviceB.Call(ctx, request)

// Service B respects existing deadline
func (s *ServiceB) Call(ctx context.Context, req Request) {
    deadline, ok := ctx.Deadline()
    if !ok {
        // No deadline set - create default
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
        defer cancel()
    }
    
    // Use existing deadline (already propagated in ctx)
    return s.serviceC.Call(ctx, req)
}

5. Logging Without Correlation IDs

You can't trace requests across services without identifiers.

❌ Wrong:

log.info("Processing order")
## Which order? Which user? Which request?

βœ… Right:

import uuid
import structlog

log = structlog.get_logger()

def handle_request(request):
    request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
    
    # Bind context to all subsequent logs
    log = log.bind(
        request_id=request_id,
        user_id=request.user_id,
        service='order-service'
    )
    
    log.info('processing_order', order_id=order.id)
    # Logs: {"request_id": "abc123", "user_id": 456, ...}
    
    # Propagate to downstream services
    response = payment_service.charge(
        user_id=request.user_id,
        headers={'X-Request-ID': request_id}  # Pass along!
    )

With correlation IDs, you can trace the entire request flow:

## Query logs across all services
grep 'request_id=abc123' logs/*.log

## Shows complete timeline:
## 10:23:45.123 [order-service] processing_order request_id=abc123
## 10:23:45.145 [payment-service] charging_card request_id=abc123
## 10:23:45.890 [payment-service] charge_successful request_id=abc123
## 10:23:45.902 [order-service] order_completed request_id=abc123

Key Takeaways 🎯

πŸ“‹ Partial Failure Debugging Checklist

Essential Patterns:

  • βœ… Circuit Breakers: Prevent cascading failures (CLOSED β†’ OPEN β†’ HALF-OPEN)
  • βœ… Bulkheads: Isolate resource pools to contain failures
  • βœ… Timeouts: Set connection timeout < request timeout < total deadline
  • βœ… Graceful Degradation: Critical features work even when optional ones fail
  • βœ… Exponential Backoff: Add jitter to prevent retry storms

Debugging Principles:

  • πŸ” Silence β‰  Success (explicit acknowledgments required)
  • πŸ†” Always use correlation IDs to trace requests
  • ⏰ Propagate deadlines through call chains
  • 🎲 Add jitter to all timing-based operations
  • πŸ’Ύ Test partial failure modes with chaos engineering

Anti-Patterns to Avoid:

  • ❌ Holding database locks during network calls
  • ❌ Unbounded retry budgets (causes amplification)
  • ❌ Assuming "all up" or "all down" (partial failures are normal)
  • ❌ Creating new timeout contexts at each layer
  • ❌ Missing idempotency keys on retryable operations

Monitoring Requirements:

MetricAlert ThresholdIndicates
Circuit breaker open rate>10% servicesWidespread issues
P99 latency>2x P50Resource exhaustion
Timeout rate>5%Cascading timeouts
Retry rate>20%Instability
Thread pool saturation>80%Need bulkheads

Further Study πŸ“š

Essential Reading:

Advanced Topics:

  • Chaos Engineering: Use tools like Chaos Monkey, Gremlin, or Litmus to inject faults systematically
  • Distributed Tracing: Implement OpenTelemetry for end-to-end request tracking
  • Consensus Algorithms: Study Raft/Paxos for handling split-brain scenarios
  • Backpressure: Learn reactive streams for propagating load signals upstream

πŸ’‘ Practice Tip: Set up a local microservices environment (Docker Compose) and use tc (Linux traffic control) to inject network delays, packet loss, and partitions. Experience debugging partial failures hands-on!