You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

Debugging in Distributed Reality

Understanding failure modes in distributed systems

Debugging in Distributed Reality

Master distributed systems debugging with free flashcards and spaced repetition practice. This lesson covers distributed tracing, log aggregation strategies, debugging race conditions across services, and handling cascading failuresβ€”essential skills for building resilient microservices and cloud-native applications.

Welcome to the Chaos: Debugging Distributed Systems 🌐

Debugging a single-threaded application on your local machine? That's like finding your keys in your apartment. Debugging a distributed system? That's like finding your keys somewhere in a city where different neighborhoods exist in different time zones, some streets randomly disappear, and your keys might be in multiple places simultaneously. Welcome to distributed reality! πŸ’«

In distributed systems, the traditional "step through with a debugger" approach crumbles. You're dealing with:

  • Multiple processes across different machines
  • Network latency and failures
  • Asynchronous operations happening concurrently
  • Partial failures where some components work while others don't
  • Clock skew between machines
  • Eventual consistency challenges

The debugging techniques you learned for monolithic applications still apply, but now you need an entirely new toolkit. Let's build it together.

Core Concepts: The Distributed Debugging Arsenal πŸ”§

1. Distributed Tracing: Following the Breadcrumb Trail 🍞

When a user request touches 15 different microservices, how do you track what happened? Distributed tracing creates a "trace" that follows a request through your entire system.

How it works:

User Request β†’ API Gateway β†’ Auth Service β†’ Order Service β†’ Payment Service β†’ Notification Service
     β”‚              β”‚              β”‚              β”‚                β”‚                    β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    TRACE ID: abc-123-def
                              (propagated through all services)

Each service adds spans (time segments showing what it did) to the trace:

ServiceSpan IDParent SpanDurationStatus
API Gatewayspan-1null245msβœ… OK
Auth Servicespan-2span-112msβœ… OK
Order Servicespan-3span-1187msβœ… OK
Payment Servicespan-4span-33200ms⚠️ SLOW
Notificationspan-5span-345ms❌ FAILED

Key components:

  • Trace ID: Unique identifier for the entire request journey
  • Span ID: Unique identifier for each operation within the trace
  • Parent Span ID: Links spans into a hierarchy
  • Baggage: Key-value pairs propagated across service boundaries

Implementation pattern in C#:

using System.Diagnostics;

public class OrderService
{
    private static ActivitySource activitySource = 
        new ActivitySource("OrderService");
    
    public async Task<Order> CreateOrder(OrderRequest request)
    {
        using var activity = activitySource.StartActivity("CreateOrder");
        
        // Add custom tags for debugging
        activity?.SetTag("order.id", request.OrderId);
        activity?.SetTag("user.id", request.UserId);
        
        try
        {
            var order = await ProcessOrder(request);
            activity?.SetStatus(ActivityStatusCode.Ok);
            return order;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            throw;
        }
    }
}

πŸ’‘ Pro Tip: Use OpenTelemetryβ€”it's the industry standard that works with Jaeger, Zipkin, and cloud providers. Don't reinvent this wheel!

2. Structured Logging: Making Logs Searchable πŸ“‹

In a distributed system, you might have thousands of log files across hundreds of servers. Plain text logs like "Error occurred" are useless. Structured logging saves the day by making logs machine-readable.

Bad logging (unstructured):

logger.LogInformation($"User {userId} ordered item {itemId} for ${price}");
// Output: "User 12345 ordered item ABC-789 for $49.99"
// Problem: Can't easily search for all orders over $100

Good logging (structured):

logger.LogInformation(
    "Order created: {UserId} ordered {ItemId} for {Price}",
    userId, itemId, price
);
// Output (JSON): {"message":"Order created","userId":12345,"itemId":"ABC-789","price":49.99,"timestamp":"2024-01-15T10:30:00Z"}
// Now you can query: WHERE price > 100

Essential fields for distributed logs:

FieldPurposeExample
timestampWhen it happened (use UTC!)2024-01-15T10:30:00Z
traceIdConnect logs across servicesabc-123-def
spanIdConnect to specific operationspan-3
serviceWhich service logged thisorder-service
levelSeverity (DEBUG/INFO/WARN/ERROR)ERROR
messageHuman-readable descriptionPayment timeout
contextBusiness-relevant data{userId, orderId, amount}

3. Correlation IDs: The Golden Thread 🧡

A correlation ID is a unique identifier that travels with a request through all services. It's your golden thread through the distributed maze.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  REQUEST: correlation-id: req-789-xyz                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  Service A: [req-789-xyz] Received request                 β”‚
β”‚      ↓                                                      β”‚
β”‚  Service A: [req-789-xyz] Calling Service B                β”‚
β”‚      ↓                                                      β”‚
β”‚  Service B: [req-789-xyz] Processing data                  β”‚
β”‚      ↓                                                      β”‚
β”‚  Service B: [req-789-xyz] Calling Service C                β”‚
β”‚      ↓                                                      β”‚
β”‚  Service C: [req-789-xyz] ERROR: Database timeout          β”‚
β”‚      ↓                                                      β”‚
β”‚  Service B: [req-789-xyz] ERROR: Downstream failure        β”‚
β”‚      ↓                                                      β”‚
β”‚  Service A: [req-789-xyz] ERROR: Request failed            β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        Now you can grep all logs for "req-789-xyz"!

Implementation middleware:

public class CorrelationMiddleware
{
    private readonly RequestDelegate _next;
    private const string CorrelationIdHeader = "X-Correlation-Id";
    
    public async Task InvokeAsync(HttpContext context)
    {
        // Extract or generate correlation ID
        string correlationId = context.Request.Headers[CorrelationIdHeader]
            .FirstOrDefault() ?? Guid.NewGuid().ToString();
        
        // Add to response headers for debugging
        context.Response.Headers[CorrelationIdHeader] = correlationId;
        
        // Store in context for logging
        context.Items["CorrelationId"] = correlationId;
        
        // Propagate to downstream services
        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            await _next(context);
        }
    }
}

4. The Observability Trinity: Metrics, Logs, Traces πŸ“Š

Observability is answering "what's wrong?" without having predicted the failure mode. The three pillars:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   OBSERVABILITY TRINITY                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   πŸ“ˆ METRICS    β”‚   πŸ“‹ LOGS       β”‚   πŸ” TRACES            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ What:           β”‚ What:           β”‚ What:                  β”‚
β”‚ Aggregated data β”‚ Event records   β”‚ Request journeys       β”‚
β”‚                 β”‚                 β”‚                        β”‚
β”‚ When:           β”‚ When:           β”‚ When:                  β”‚
β”‚ "CPU is 90%"    β”‚ "Error at 10:30"β”‚ "Request took 3s"      β”‚
β”‚                 β”‚                 β”‚                        β”‚
β”‚ Tools:          β”‚ Tools:          β”‚ Tools:                 β”‚
β”‚ Prometheus      β”‚ ELK/Splunk      β”‚ Jaeger/Zipkin          β”‚
β”‚ Grafana         β”‚ Loki            β”‚ OpenTelemetry          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When to use each:

  • Metrics: "Is there a problem?" (dashboards, alerts)
  • Logs: "What's the context?" (debugging specific issues)
  • Traces: "Where's the bottleneck?" (performance investigation)

Example workflow:

  1. 🚨 Alert fires: "API latency > 2s" (from metrics)
  2. πŸ” Check dashboard: Which endpoint? (metrics drill-down)
  3. πŸ“‹ Search logs: Find errors with that endpoint (logs)
  4. πŸ” View trace: See the slow database query (traces)

5. Debugging Race Conditions Across Services ⚑

Race conditions in distributed systems are nightmares because:

  • Events happen on different machines
  • Network delays are unpredictable
  • You can't use a single lock

Classic distributed race condition:

// Service A: Inventory Service
public async Task<bool> ReserveItem(string itemId)
{
    var stock = await _db.GetStock(itemId);  // Read: stock = 1
    
    // Meanwhile, Service B on another machine does the same...
    
    if (stock > 0)
    {
        await _db.UpdateStock(itemId, stock - 1);  // Both decrement!
        return true;  // Now stock = -1 (oversold!)
    }
    return false;
}

Solution patterns:

1. Optimistic Locking (version numbers):

public async Task<bool> ReserveItem(string itemId)
{
    while (true)
    {
        var item = await _db.GetItemWithVersion(itemId);
        
        if (item.Stock <= 0) return false;
        
        var updated = await _db.UpdateIfVersionMatches(
            itemId,
            newStock: item.Stock - 1,
            expectedVersion: item.Version
        );
        
        if (updated) return true;
        
        // Version mismatch = someone else modified it, retry
        await Task.Delay(Random.Shared.Next(10, 50)); // Exponential backoff
    }
}

2. Distributed Locks (use with caution!):

public async Task<bool> ReserveItem(string itemId)
{
    var lockKey = $"lock:item:{itemId}";
    
    // Try to acquire distributed lock (Redis, Consul, etc.)
    await using var distributedLock = await _lockProvider.AcquireAsync(
        lockKey,
        timeout: TimeSpan.FromSeconds(5)
    );
    
    if (distributedLock == null)
    {
        throw new TimeoutException("Could not acquire lock");
    }
    
    var stock = await _db.GetStock(itemId);
    if (stock > 0)
    {
        await _db.UpdateStock(itemId, stock - 1);
        return true;
    }
    return false;
}

⚠️ Warning: Distributed locks are tricky! Watch for:

  • Lock holder crashes β†’ use timeouts and lease renewal
  • Clock skew β†’ locks expire at different times on different machines
  • Network partitions β†’ two nodes might both think they have the lock

πŸ’‘ Pro Tip: Prefer idempotency over locks when possible. Design operations so doing them twice gives the same result.

6. Cascading Failures: The Domino Effect 🎲

A cascading failure occurs when one component's failure causes others to fail, like dominoes falling.

Classic scenario:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CASCADING FAILURE TIMELINE                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                         β”‚
β”‚  t=0s:  πŸ’Ύ Database slows down (90% CPU)               β”‚
β”‚          ↓                                              β”‚
β”‚  t=5s:  πŸ”Œ API requests queue up (waiting for DB)      β”‚
β”‚          ↓                                              β”‚
β”‚  t=10s: πŸ”₯ API servers run out of threads               β”‚
β”‚          ↓                                              β”‚
β”‚  t=15s: ⚑ Load balancer health checks fail             β”‚
β”‚          ↓                                              β”‚
β”‚  t=20s: 🌊 Traffic floods to remaining servers          β”‚
β”‚          ↓                                              β”‚
β”‚  t=25s: πŸ’₯ All servers crash (out of memory)           β”‚
β”‚          ↓                                              β”‚
β”‚  t=30s: 🚨 TOTAL OUTAGE                                 β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Defense mechanisms:

1. Circuit Breakers:

using Polly;
using Polly.CircuitBreaker;

public class PaymentServiceClient
{
    private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
    
    public PaymentServiceClient()
    {
        _circuitBreaker = Policy
            .Handle<HttpRequestException>()
            .CircuitBreakerAsync(
                handledEventsAllowedBeforeBreaking: 5,
                durationOfBreak: TimeSpan.FromSeconds(30)
            );
    }
    
    public async Task<PaymentResult> ProcessPayment(Payment payment)
    {
        try
        {
            return await _circuitBreaker.ExecuteAsync(async () =>
            {
                return await CallPaymentService(payment);
            });
        }
        catch (BrokenCircuitException)
        {
            // Circuit is open - fail fast instead of waiting
            return new PaymentResult 
            { 
                Success = false, 
                Error = "Payment service unavailable" 
            };
        }
    }
}

Circuit breaker states:

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ CLOSED  β”‚ ← Normal operation, requests flow
     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β”‚ 5 failures
          ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  OPEN   β”‚ ← Fails fast, no requests sent
     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β”‚ 30 seconds pass
          ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚HALF-OPENβ”‚ ← Test with 1 request
     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
       Success?β”‚
      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
      ↓         ↓
   CLOSED     OPEN

2. Bulkheads (Isolation):

// Separate thread pools for different operations
public class ServiceClient
{
    private readonly Policy _criticalBulkhead;
    private readonly Policy _nonCriticalBulkhead;
    
    public ServiceClient()
    {
        // Critical operations get 80% of resources
        _criticalBulkhead = Policy.BulkheadAsync(
            maxParallelization: 80,
            maxQueuingActions: 20
        );
        
        // Non-critical get 20%
        _nonCriticalBulkhead = Policy.BulkheadAsync(
            maxParallelization: 20,
            maxQueuingActions: 5
        );
    }
    
    public async Task<Order> CreateOrder(OrderRequest request)
    {
        // Use critical bulkhead - won't be starved by analytics
        return await _criticalBulkhead.ExecuteAsync(
            async () => await ProcessOrder(request)
        );
    }
    
    public async Task<Report> GenerateReport()
    {
        // Use non-critical bulkhead - can't overwhelm system
        return await _nonCriticalBulkhead.ExecuteAsync(
            async () => await CreateReport()
        );
    }
}

3. Timeouts (Aggressive):

public async Task<User> GetUser(string userId)
{
    using var cts = new CancellationTokenSource();
    cts.CancelAfter(TimeSpan.FromSeconds(2)); // Aggressive timeout
    
    try
    {
        return await _httpClient.GetAsync(
            $"/users/{userId}", 
            cts.Token
        );
    }
    catch (OperationCanceledException)
    {
        // Log and return cached/default value
        _logger.LogWarning("GetUser timeout for {UserId}", userId);
        return await _cache.GetUserOrDefault(userId);
    }
}

πŸ’‘ Mnemonic for resilience patterns: "CRT" - Circuit breakers, Retries, Timeouts

7. Time and Ordering: Clocks Are Lies ⏰

In distributed systems, you can't trust clocks. Different machines have different times, even with NTP synchronization.

The problem:

// Server A (clock: 10:00:00)
var eventA = new Event 
{ 
    Timestamp = DateTime.UtcNow, // 10:00:00
    Data = "User logged in" 
};

// Server B (clock: 09:59:58 - 2 seconds behind!)
var eventB = new Event 
{ 
    Timestamp = DateTime.UtcNow, // 09:59:58
    Data = "User clicked checkout" 
};

// When merged, events are out of order!
// Says user checked out BEFORE logging in!

Solutions:

1. Vector Clocks (track causality):

public class VectorClock
{
    private Dictionary<string, int> _clocks = new();
    
    public void Increment(string nodeId)
    {
        if (!_clocks.ContainsKey(nodeId))
            _clocks[nodeId] = 0;
        _clocks[nodeId]++;
    }
    
    public void Merge(VectorClock other)
    {
        foreach (var (nodeId, timestamp) in other._clocks)
        {
            _clocks[nodeId] = Math.Max(
                _clocks.GetValueOrDefault(nodeId, 0),
                timestamp
            );
        }
    }
    
    // A happened before B if all A's clocks ≀ B's clocks
    public bool HappenedBefore(VectorClock other)
    {
        return _clocks.All(kvp => 
            kvp.Value <= other._clocks.GetValueOrDefault(kvp.Key, 0)
        ) && !this.Equals(other);
    }
}

2. Sequence Numbers (ordered events):

public class EventLog
{
    private long _sequenceNumber = 0;
    
    public Event RecordEvent(string data)
    {
        var seqNum = Interlocked.Increment(ref _sequenceNumber);
        
        return new Event
        {
            SequenceNumber = seqNum,
            NodeId = Environment.MachineName,
            Timestamp = DateTimeOffset.UtcNow,
            Data = data
        };
    }
    
    // Merge events from multiple nodes by sequence number
    public List<Event> MergeAndSort(List<Event> events)
    {
        return events
            .OrderBy(e => e.SequenceNumber)
            .ThenBy(e => e.NodeId) // Break ties deterministically
            .ToList();
    }
}

πŸ€” Did you know? Google's Spanner database uses atomic clocks and GPS receivers in every datacenter to keep time accurate within 7ms across the globe!

Examples: Debugging Real Distributed Problems πŸ”

Example 1: The Mysterious Duplicate Orders πŸ›’

Problem: Users occasionally get charged twice for the same order.

Investigation steps:

Step 1 - Check metrics:

-- Query your metrics database
SELECT 
    COUNT(*) as duplicate_orders
FROM orders
GROUP BY user_id, cart_items
HAVING COUNT(*) > 1;
-- Result: 127 duplicates in the last week

Step 2 - Search logs with correlation ID:

// Found in logs:
{
  "correlationId": "req-abc-123",
  "service": "order-service",
  "message": "Creating order",
  "timestamp": "2024-01-15T10:30:00.000Z"
}
{
  "correlationId": "req-abc-123",
  "service": "order-service",
  "message": "Creating order",
  "timestamp": "2024-01-15T10:30:00.015Z"
}
// Same correlation ID, TWO creates!

Step 3 - View distributed trace:

Trace ID: req-abc-123
β”œβ”€ API Gateway (45ms)
β”‚  β”œβ”€ Order Service Attempt 1 (2500ms) ← TIMEOUT!
β”‚  └─ Order Service Attempt 2 (150ms)  ← RETRY succeeded
└─ Both attempts hit the database!

Root cause: The client retried after a timeout, but the first request actually succeededβ€”it was just slow. The operation wasn't idempotent.

Fix:

public class OrderService
{
    public async Task<Order> CreateOrder(OrderRequest request)
    {
        var idempotencyKey = request.IdempotencyKey; // Client provides!
        
        // Check if we already processed this request
        var existing = await _db.GetOrderByIdempotencyKey(idempotencyKey);
        if (existing != null)
        {
            _logger.LogInformation(
                "Duplicate request detected: {IdempotencyKey}",
                idempotencyKey
            );
            return existing; // Return existing order
        }
        
        // Process order and store with idempotency key
        var order = await ProcessNewOrder(request);
        order.IdempotencyKey = idempotencyKey;
        await _db.SaveOrder(order);
        
        return order;
    }
}

Key lesson: In distributed systems, assume every request will be retried. Make operations idempotent!

Example 2: The Vanishing Logs πŸ‘»

Problem: Production error happened, but logs are missing.

Investigation:

Theory 1 - Logs not shipped:

## Check log shipping agent status
systemctl status filebeat
## Result: active, no errors

Theory 2 - Logs filtered out:

## Check logging configuration
Serilog:
  MinimumLevel: "Warning"  # ← AHA!
## Debug and Info logs are dropped!

Theory 3 - Wrong time window:

// Application uses local time
logger.LogError("Error at {Time}", DateTime.Now); // Local time!

// But searching in UTC
Elasticsearch query: timestamp:[2024-01-15T10:00:00 TO 2024-01-15T11:00:00]
// If server is in PST (UTC-8), logs are 8 hours off!

Root cause: The application logged in local time, but the log aggregator indexed in UTC. Searching for "10 AM UTC" missed logs actually at "10 AM PST" (18:00 UTC).

Fix:

// ALWAYS use UTC for logging
logger.LogError(
    "Error occurred at {Timestamp}",
    DateTimeOffset.UtcNow // UTC timestamp with offset
);

// Better: Use structured logging with automatic UTC
public class LoggingConfiguration
{
    public static void Configure()
    {
        Log.Logger = new LoggerConfiguration()
            .Enrich.WithProperty("Environment", "Production")
            .WriteTo.Console(
                new JsonFormatter(
                    renderMessage: true,
                    formatProvider: CultureInfo.InvariantCulture
                )
            )
            .CreateLogger();
    }
}

Key lesson: Time zones are evil. Always log in UTC. Always search in UTC. Always display to users in their local time.

Example 3: The Intermittent 500 Errors 🎲

Problem: API returns HTTP 500 errors randomly, ~2% of requests.

Investigation:

Step 1 - Check error rate per instance:

-- Query metrics grouped by server
SELECT 
    server_id,
    COUNT(*) as errors
FROM api_errors
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY server_id;

-- Result:
-- server-1: 0 errors
-- server-2: 0 errors  
-- server-3: 1,247 errors ← CULPRIT!

Step 2 - View logs from server-3:

{
  "level": "ERROR",
  "service": "api",
  "server": "server-3",
  "message": "Socket exception: Connection reset",
  "stackTrace": "..."
}

Step 3 - Check distributed trace:

Trace ID: req-def-456
β”œβ”€ Load Balancer β†’ server-3 (12ms)
β”œβ”€ server-3 β†’ Database (0ms) ← FAILED IMMEDIATELY
└─ Error: Connection reset

Step 4 - Check infrastructure:

## On server-3, check database connections
netstat -an | grep 5432 | grep ESTABLISHED | wc -l
## Result: 0 connections!

## Check database connection pool
ps aux | grep postgres
## Connection pool shows: MAX_CONNECTIONS=100, IN_USE=100
## Pool exhausted!

Root cause: Database connection pool on server-3 was exhausted. Connections weren't being properly returned to the pool.

Fix:

public class DatabaseService
{
    // BAD: Connection not disposed
    public async Task<User> GetUserBad(int userId)
    {
        var connection = await _connectionPool.GetConnection();
        var user = await connection.QueryAsync<User>(
            "SELECT * FROM users WHERE id = @Id",
            new { Id = userId }
        );
        return user.FirstOrDefault();
        // Connection never returned to pool!
    }
    
    // GOOD: Using statement ensures disposal
    public async Task<User> GetUserGood(int userId)
    {
        await using var connection = await _connectionPool.GetConnection();
        var user = await connection.QueryAsync<User>(
            "SELECT * FROM users WHERE id = @Id",
            new { Id = userId }
        );
        return user.FirstOrDefault();
    } // Connection automatically returned here
}

Key lesson: Resource leaks in distributed systems are insidious. One small leak can take down an entire service. Always dispose resources properly!

Example 4: The Cascading Latency Spike πŸ“ˆ

Problem: API latency suddenly jumps from 100ms to 5 seconds.

Investigation with distributed tracing:

β”Œβ”€ Normal Request (100ms total) ──────────────────┐
β”‚                                                  β”‚
β”‚ API Gateway    β–ˆβ–ˆβ–ˆβ–ˆ 5ms                          β”‚
β”‚ Auth Service   β–ˆβ–ˆ 2ms                            β”‚
β”‚ Order Service  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 50ms                 β”‚
β”‚ Database       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 40ms                     β”‚
β”‚ Cache          β–ˆβ–ˆ 3ms                            β”‚
β”‚                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€ Slow Request (5000ms total) ───────────────────┐
β”‚                                                  β”‚
β”‚ API Gateway    β–ˆβ–ˆβ–ˆβ–ˆ 5ms                          β”‚
β”‚ Auth Service   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 4500ms   β”‚ ← BOTTLENECK!
β”‚ Order Service  (waiting)                         β”‚
β”‚ Database       (not reached)                     β”‚
β”‚                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Drilling into Auth Service trace:

Auth Service Span (4500ms):
β”œβ”€ Validate Token (5ms)
β”œβ”€ Check Permissions (4490ms) ← PROBLEM!
β”‚  β”œβ”€ Call User Service (4485ms)
β”‚  β”‚  β”œβ”€ HTTP Request (4480ms)
β”‚  β”‚  β”‚  └─ Read Timeout (4500ms timeout)

Root cause: User Service was down, but Auth Service was waiting the full 4.5s timeout before failing.

Fix - Add circuit breaker and aggressive timeout:

public class AuthService
{
    private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
    private readonly AsyncTimeoutPolicy _timeout;
    
    public AuthService()
    {
        _timeout = Policy.TimeoutAsync(
            TimeSpan.FromMilliseconds(500), // Aggressive!
            TimeoutStrategy.Pessimistic
        );
        
        _circuitBreaker = Policy
            .Handle<HttpRequestException>()
            .Or<TimeoutRejectedException>()
            .CircuitBreakerAsync(
                handledEventsAllowedBeforeBreaking: 3,
                durationOfBreak: TimeSpan.FromSeconds(30)
            );
    }
    
    public async Task<bool> CheckPermissions(string userId)
    {
        try
        {
            return await _circuitBreaker.ExecuteAsync(async () =>
            {
                return await _timeout.ExecuteAsync(async () =>
                {
                    var user = await _userServiceClient.GetUser(userId);
                    return user.HasPermission();
                });
            });
        }
        catch (BrokenCircuitException)
        {
            // Fall back to cached permissions
            return await _cache.GetPermissions(userId);
        }
        catch (TimeoutRejectedException)
        {
            _logger.LogWarning("User service timeout for {UserId}", userId);
            return await _cache.GetPermissions(userId);
        }
    }
}

Key lesson: Slow is worse than down. Use aggressive timeouts and circuit breakers to fail fast and prevent cascading delays.

Common Mistakes to Avoid ⚠️

1. Not Propagating Context 🧡

❌ Bad:

public async Task ProcessOrder(Order order)
{
    // Start new HTTP request without propagating trace context
    var response = await _httpClient.GetAsync("/api/payment");
    // This request is DISCONNECTED from the parent trace!
}

βœ… Good:

public async Task ProcessOrder(Order order)
{
    // HttpClient automatically propagates Activity context
    using var activity = Activity.Current;
    activity?.SetTag("order.id", order.Id);
    
    var response = await _httpClient.GetAsync("/api/payment");
    // Trace context is automatically in the HTTP headers
}

2. Logging Sensitive Data πŸ”

❌ Bad:

logger.LogInformation(
    "Processing payment: {CreditCard}",
    payment.CreditCardNumber  // DON'T LOG THIS!
);

βœ… Good:

logger.LogInformation(
    "Processing payment: {PaymentId} for {Amount}",
    payment.Id,
    payment.Amount  // Log identifiers, not secrets
);

3. Assuming Synchronous Clocks ⏰

❌ Bad:

if (event1.Timestamp > event2.Timestamp)
{
    // Assume event1 happened after event2
    // WRONG if events from different servers!
}

βœ… Good:

if (event1.SequenceNumber > event2.SequenceNumber && 
    event1.NodeId == event2.NodeId)
{
    // Compare sequence numbers from same node
}
else
{
    // Use vector clocks for cross-node ordering
    if (event1.VectorClock.HappenedBefore(event2.VectorClock))
    {
        // event1 causally before event2
    }
}

4. No Timeout on External Calls πŸ’£

❌ Bad:

// This can hang forever if service is unresponsive
var result = await _httpClient.GetAsync("/api/data");

βœ… Good:

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2));
try
{
    var result = await _httpClient.GetAsync("/api/data", cts.Token);
}
catch (OperationCanceledException)
{
    // Handle timeout
}

5. Ignoring Partial Failures 🎯

❌ Bad:

public async Task<OrderResult> CreateOrder(Order order)
{
    await _orderDb.SaveOrder(order);      // Succeeds
    await _inventoryService.Reserve();    // Fails
    await _paymentService.Charge();       // Never called
    
    // Now order is saved but inventory not reserved!
    // Inconsistent state!
}

βœ… Good:

public async Task<OrderResult> CreateOrder(Order order)
{
    // Use Saga pattern or distributed transaction
    var saga = new OrderSaga(order);
    
    try
    {
        await saga.ExecuteStep(() => _orderDb.SaveOrder(order));
        await saga.ExecuteStep(() => _inventoryService.Reserve());
        await saga.ExecuteStep(() => _paymentService.Charge());
        
        await saga.Complete();
    }
    catch (Exception ex)
    {
        // Compensate: undo all completed steps
        await saga.Compensate();
        throw;
    }
}

6. Over-Relying on Distributed Locks πŸ”’

❌ Bad:

// Locks for everything = performance killer
await using var lock1 = await _lockService.Acquire("users");
await using var lock2 = await _lockService.Acquire("orders");
// Holding two locks = high chance of deadlock!

βœ… Good:

// Use optimistic concurrency when possible
var user = await _db.GetUser(userId);
user.Version++;
var updated = await _db.UpdateIfVersionMatches(user);
if (!updated)
{
    // Retry or fail
}

7. Not Testing Under Failure ⚑

❌ Bad:

// Only test happy path
[Test]
public async Task CreateOrder_Success()
{
    var result = await _orderService.CreateOrder(order);
    Assert.True(result.Success);
}

βœ… Good:

[Test]
public async Task CreateOrder_WhenPaymentServiceDown_ReturnsError()
{
    _paymentServiceMock
        .Setup(x => x.Charge(It.IsAny<Payment>()))
        .ThrowsAsync(new HttpRequestException());
    
    var result = await _orderService.CreateOrder(order);
    
    Assert.False(result.Success);
    Assert.Equal("Payment service unavailable", result.Error);
}

[Test]
public async Task CreateOrder_WhenPaymentServiceSlow_Timeouts()
{
    _paymentServiceMock
        .Setup(x => x.Charge(It.IsAny<Payment>()))
        .Returns(async () => 
        {
            await Task.Delay(TimeSpan.FromSeconds(10));
            return new PaymentResult { Success = true };
        });
    
    await Assert.ThrowsAsync<TimeoutException>(
        () => _orderService.CreateOrder(order)
    );
}

Key Takeaways πŸŽ“

πŸ“‹ Distributed Debugging Quick Reference

Concept Key Point Tool/Pattern
Tracing Track requests across services OpenTelemetry, Jaeger
Logging Use structured JSON logs Serilog, ELK Stack
Correlation IDs Unique ID per request Middleware propagation
Metrics Aggregate data for alerting Prometheus, Grafana
Circuit Breakers Fail fast when downstream fails Polly, Resilience4j
Timeouts Never wait forever CancellationToken, Polly
Idempotency Safe to retry operations Idempotency keys
Time Always use UTC DateTimeOffset.UtcNow
Ordering Use sequence numbers or vector clocks Lamport timestamps
Resilience CRT: Circuit breakers, Retries, Timeouts Polly policies

The Golden Rules πŸ†

  1. Observability First: You can't debug what you can't observe. Invest in tracing, logging, and metrics BEFORE you have problems.

  2. Assume Failure: Every network call will fail eventually. Every service will be down eventually. Design for it.

  3. Fail Fast: Timeouts should be aggressive (seconds, not minutes). Circuit breakers should trip quickly.

  4. Make It Idempotent: Every operation should be safe to retry. Use idempotency keys.

  5. UTC Always: Never use local time in distributed systems. Ever.

  6. Correlation Everywhere: Every log, trace, and metric should have a correlation ID.

  7. Test Failures: Unit tests for failures are more important than tests for success.

  8. Structured Everything: Logs, metrics, and traces should all be structured and searchable.

🧠 Memory Device: The "DISTRIBUTED" Mnemonic

  • Distributed tracing tracks requests
  • Idempotency prevents duplicates
  • Structured logging enables search
  • Timeouts prevent hangs
  • Retries need exponential backoff
  • Isolation with bulkheads
  • Breakers stop cascading failures
  • UTC time, always
  • Tags in logs for context
  • Error rates in metrics
  • Deadlines for all operations

πŸ“š Further Study

  1. Distributed Systems Observability - Cindy Sridharan's comprehensive guide: https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/

  2. OpenTelemetry Documentation - Industry standard for tracing and observability: https://opentelemetry.io/docs/

  3. Google SRE Book - Monitoring Distributed Systems - Battle-tested practices from Google: https://sre.google/sre-book/monitoring-distributed-systems/


πŸ’‘ Practice tip: Set up a local microservices environment with Docker Compose, Jaeger, and Grafana. Deliberately break services and practice using distributed tracing to debug them. The hands-on experience is invaluable!

🎯 Next steps: Study distributed transactions and the Saga pattern for handling complex workflows across services. Then explore chaos engineering tools like Chaos Monkey to test your system's resilience.