Debugging in Distributed Reality

Q: Complete the timeout implementation: ```csharp using var cts = new {{1}}(); cts.{{2}}(TimeSpan.FromSeconds(2)); try { var result = await _httpClient.GetAsync(url, cts.{{3}}); } catch (OperationCanceledException) { // Handle timeout } ```

["CancellationTokenSource","CancelAfter","Token"]

Understanding failure modes in distributed systems

Debugging in Distributed Reality

Master distributed systems debugging with free flashcards and spaced repetition practice. This lesson covers distributed tracing, log aggregation strategies, debugging race conditions across services, and handling cascading failures—essential skills for building resilient microservices and cloud-native applications.

Welcome to the Chaos: Debugging Distributed Systems 🌐

Debugging a single-threaded application on your local machine? That's like finding your keys in your apartment. Debugging a distributed system? That's like finding your keys somewhere in a city where different neighborhoods exist in different time zones, some streets randomly disappear, and your keys might be in multiple places simultaneously. Welcome to distributed reality! 💫

In distributed systems, the traditional "step through with a debugger" approach crumbles. You're dealing with:

Multiple processes across different machines
Network latency and failures
Asynchronous operations happening concurrently
Partial failures where some components work while others don't
Clock skew between machines
Eventual consistency challenges

The debugging techniques you learned for monolithic applications still apply, but now you need an entirely new toolkit. Let's build it together.

Core Concepts: The Distributed Debugging Arsenal 🔧

When a user request touches 15 different microservices, how do you track what happened? Distributed tracing creates a "trace" that follows a request through your entire system.

How it works:

User Request → API Gateway → Auth Service → Order Service → Payment Service → Notification Service
     │              │              │              │                │                    │
     └──────────────┴──────────────┴──────────────┴────────────────┴────────────────────┘
                                    TRACE ID: abc-123-def
                              (propagated through all services)

Each service adds spans (time segments showing what it did) to the trace:

Service	Span ID	Parent Span	Duration	Status
API Gateway	span-1	null	245ms	✅ OK
Auth Service	span-2	span-1	12ms	✅ OK
Order Service	span-3	span-1	187ms	✅ OK
Payment Service	span-4	span-3	3200ms	⚠️ SLOW
Notification	span-5	span-3	45ms	❌ FAILED

Key components:

Trace ID: Unique identifier for the entire request journey
Span ID: Unique identifier for each operation within the trace
Parent Span ID: Links spans into a hierarchy
Baggage: Key-value pairs propagated across service boundaries

Implementation pattern in C#:

using System.Diagnostics;

public class OrderService
{
    private static ActivitySource activitySource = 
        new ActivitySource("OrderService");
    
    public async Task<Order> CreateOrder(OrderRequest request)
    {
        using var activity = activitySource.StartActivity("CreateOrder");
        
        // Add custom tags for debugging
        activity?.SetTag("order.id", request.OrderId);
        activity?.SetTag("user.id", request.UserId);
        
        try
        {
            var order = await ProcessOrder(request);
            activity?.SetStatus(ActivityStatusCode.Ok);
            return order;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            throw;
        }
    }
}

💡 Pro Tip: Use OpenTelemetry—it's the industry standard that works with Jaeger, Zipkin, and cloud providers. Don't reinvent this wheel!

2. Structured Logging: Making Logs Searchable 📋

In a distributed system, you might have thousands of log files across hundreds of servers. Plain text logs like "Error occurred" are useless. Structured logging saves the day by making logs machine-readable.

Bad logging (unstructured):

logger.LogInformation($"User {userId} ordered item {itemId} for ${price}");
// Output: "User 12345 ordered item ABC-789 for $49.99"
// Problem: Can't easily search for all orders over $100

Good logging (structured):

logger.LogInformation(
    "Order created: {UserId} ordered {ItemId} for {Price}",
    userId, itemId, price
);
// Output (JSON): {"message":"Order created","userId":12345,"itemId":"ABC-789","price":49.99,"timestamp":"2024-01-15T10:30:00Z"}
// Now you can query: WHERE price > 100

Essential fields for distributed logs:

Field	Purpose	Example
timestamp	When it happened (use UTC!)	2024-01-15T10:30:00Z
traceId	Connect logs across services	abc-123-def
spanId	Connect to specific operation	span-3
service	Which service logged this	order-service
level	Severity (DEBUG/INFO/WARN/ERROR)	ERROR
message	Human-readable description	Payment timeout
context	Business-relevant data	{userId, orderId, amount}

3. Correlation IDs: The Golden Thread 🧵

A correlation ID is a unique identifier that travels with a request through all services. It's your golden thread through the distributed maze.

┌─────────────────────────────────────────────────────────────┐
│  REQUEST: correlation-id: req-789-xyz                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Service A: [req-789-xyz] Received request                 │
│      ↓                                                      │
│  Service A: [req-789-xyz] Calling Service B                │
│      ↓                                                      │
│  Service B: [req-789-xyz] Processing data                  │
│      ↓                                                      │
│  Service B: [req-789-xyz] Calling Service C                │
│      ↓                                                      │
│  Service C: [req-789-xyz] ERROR: Database timeout          │
│      ↓                                                      │
│  Service B: [req-789-xyz] ERROR: Downstream failure        │
│      ↓                                                      │
│  Service A: [req-789-xyz] ERROR: Request failed            │
│                                                             │
└─────────────────────────────────────────────────────────────┘
        Now you can grep all logs for "req-789-xyz"!

Implementation middleware:

public class CorrelationMiddleware
{
    private readonly RequestDelegate _next;
    private const string CorrelationIdHeader = "X-Correlation-Id";
    
    public async Task InvokeAsync(HttpContext context)
    {
        // Extract or generate correlation ID
        string correlationId = context.Request.Headers[CorrelationIdHeader]
            .FirstOrDefault() ?? Guid.NewGuid().ToString();
        
        // Add to response headers for debugging
        context.Response.Headers[CorrelationIdHeader] = correlationId;
        
        // Store in context for logging
        context.Items["CorrelationId"] = correlationId;
        
        // Propagate to downstream services
        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            await _next(context);
        }
    }
}

4. The Observability Trinity: Metrics, Logs, Traces 📊

Observability is answering "what's wrong?" without having predicted the failure mode. The three pillars:

┌─────────────────────────────────────────────────────────────┐
│                   OBSERVABILITY TRINITY                     │
├─────────────────┬─────────────────┬────────────────────────┤
│   📈 METRICS    │   📋 LOGS       │   🔍 TRACES            │
├─────────────────┼─────────────────┼────────────────────────┤
│ What:           │ What:           │ What:                  │
│ Aggregated data │ Event records   │ Request journeys       │
│                 │                 │                        │
│ When:           │ When:           │ When:                  │
│ "CPU is 90%"    │ "Error at 10:30"│ "Request took 3s"      │
│                 │                 │                        │
│ Tools:          │ Tools:          │ Tools:                 │
│ Prometheus      │ ELK/Splunk      │ Jaeger/Zipkin          │
│ Grafana         │ Loki            │ OpenTelemetry          │
└─────────────────┴─────────────────┴────────────────────────┘

When to use each:

Metrics: "Is there a problem?" (dashboards, alerts)
Logs: "What's the context?" (debugging specific issues)
Traces: "Where's the bottleneck?" (performance investigation)

Example workflow:

🚨 Alert fires: "API latency > 2s" (from metrics)
🔍 Check dashboard: Which endpoint? (metrics drill-down)
📋 Search logs: Find errors with that endpoint (logs)
🔍 View trace: See the slow database query (traces)

5. Debugging Race Conditions Across Services ⚡

Race conditions in distributed systems are nightmares because:

Events happen on different machines
Network delays are unpredictable
You can't use a single lock

Classic distributed race condition:

// Service A: Inventory Service
public async Task<bool> ReserveItem(string itemId)
{
    var stock = await _db.GetStock(itemId);  // Read: stock = 1
    
    // Meanwhile, Service B on another machine does the same...
    
    if (stock > 0)
    {
        await _db.UpdateStock(itemId, stock - 1);  // Both decrement!
        return true;  // Now stock = -1 (oversold!)
    }
    return false;
}

Solution patterns:

1. Optimistic Locking (version numbers):

public async Task<bool> ReserveItem(string itemId)
{
    while (true)
    {
        var item = await _db.GetItemWithVersion(itemId);
        
        if (item.Stock <= 0) return false;
        
        var updated = await _db.UpdateIfVersionMatches(
            itemId,
            newStock: item.Stock - 1,
            expectedVersion: item.Version
        );
        
        if (updated) return true;
        
        // Version mismatch = someone else modified it, retry
        await Task.Delay(Random.Shared.Next(10, 50)); // Exponential backoff
    }
}

2. Distributed Locks (use with caution!):

public async Task<bool> ReserveItem(string itemId)
{
    var lockKey = $"lock:item:{itemId}";
    
    // Try to acquire distributed lock (Redis, Consul, etc.)
    await using var distributedLock = await _lockProvider.AcquireAsync(
        lockKey,
        timeout: TimeSpan.FromSeconds(5)
    );
    
    if (distributedLock == null)
    {
        throw new TimeoutException("Could not acquire lock");
    }
    
    var stock = await _db.GetStock(itemId);
    if (stock > 0)
    {
        await _db.UpdateStock(itemId, stock - 1);
        return true;
    }
    return false;
}

⚠️ Warning: Distributed locks are tricky! Watch for:

Lock holder crashes → use timeouts and lease renewal
Clock skew → locks expire at different times on different machines
Network partitions → two nodes might both think they have the lock

💡 Pro Tip: Prefer idempotency over locks when possible. Design operations so doing them twice gives the same result.

6. Cascading Failures: The Domino Effect 🎲

A cascading failure occurs when one component's failure causes others to fail, like dominoes falling.

Classic scenario:

┌─────────────────────────────────────────────────────────┐
│  CASCADING FAILURE TIMELINE                             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  t=0s:  💾 Database slows down (90% CPU)               │
│          ↓                                              │
│  t=5s:  🔌 API requests queue up (waiting for DB)      │
│          ↓                                              │
│  t=10s: 🔥 API servers run out of threads               │
│          ↓                                              │
│  t=15s: ⚡ Load balancer health checks fail             │
│          ↓                                              │
│  t=20s: 🌊 Traffic floods to remaining servers          │
│          ↓                                              │
│  t=25s: 💥 All servers crash (out of memory)           │
│          ↓                                              │
│  t=30s: 🚨 TOTAL OUTAGE                                 │
│                                                         │
└─────────────────────────────────────────────────────────┘

Defense mechanisms:

1. Circuit Breakers:

using Polly;
using Polly.CircuitBreaker;

public class PaymentServiceClient
{
    private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
    
    public PaymentServiceClient()
    {
        _circuitBreaker = Policy
            .Handle<HttpRequestException>()
            .CircuitBreakerAsync(
                handledEventsAllowedBeforeBreaking: 5,
                durationOfBreak: TimeSpan.FromSeconds(30)
            );
    }
    
    public async Task<PaymentResult> ProcessPayment(Payment payment)
    {
        try
        {
            return await _circuitBreaker.ExecuteAsync(async () =>
            {
                return await CallPaymentService(payment);
            });
        }
        catch (BrokenCircuitException)
        {
            // Circuit is open - fail fast instead of waiting
            return new PaymentResult 
            { 
                Success = false, 
                Error = "Payment service unavailable" 
            };
        }
    }
}

Circuit breaker states:

     ┌─────────┐
     │ CLOSED  │ ← Normal operation, requests flow
     └────┬────┘
          │ 5 failures
          ↓
     ┌─────────┐
     │  OPEN   │ ← Fails fast, no requests sent
     └────┬────┘
          │ 30 seconds pass
          ↓
     ┌─────────┐
     │HALF-OPEN│ ← Test with 1 request
     └────┬────┘
       Success?│
      ┌────┴────┐
      ↓         ↓
   CLOSED     OPEN

2. Bulkheads (Isolation):

// Separate thread pools for different operations
public class ServiceClient
{
    private readonly Policy _criticalBulkhead;
    private readonly Policy _nonCriticalBulkhead;
    
    public ServiceClient()
    {
        // Critical operations get 80% of resources
        _criticalBulkhead = Policy.BulkheadAsync(
            maxParallelization: 80,
            maxQueuingActions: 20
        );
        
        // Non-critical get 20%
        _nonCriticalBulkhead = Policy.BulkheadAsync(
            maxParallelization: 20,
            maxQueuingActions: 5
        );
    }
    
    public async Task<Order> CreateOrder(OrderRequest request)
    {
        // Use critical bulkhead - won't be starved by analytics
        return await _criticalBulkhead.ExecuteAsync(
            async () => await ProcessOrder(request)
        );
    }
    
    public async Task<Report> GenerateReport()
    {
        // Use non-critical bulkhead - can't overwhelm system
        return await _nonCriticalBulkhead.ExecuteAsync(
            async () => await CreateReport()
        );
    }
}

3. Timeouts (Aggressive):

public async Task<User> GetUser(string userId)
{
    using var cts = new CancellationTokenSource();
    cts.CancelAfter(TimeSpan.FromSeconds(2)); // Aggressive timeout
    
    try
    {
        return await _httpClient.GetAsync(
            $"/users/{userId}", 
            cts.Token
        );
    }
    catch (OperationCanceledException)
    {
        // Log and return cached/default value
        _logger.LogWarning("GetUser timeout for {UserId}", userId);
        return await _cache.GetUserOrDefault(userId);
    }
}

💡 Mnemonic for resilience patterns: "CRT" - Circuit breakers, Retries, Timeouts

7. Time and Ordering: Clocks Are Lies ⏰

In distributed systems, you can't trust clocks. Different machines have different times, even with NTP synchronization.

The problem:

// Server A (clock: 10:00:00)
var eventA = new Event 
{ 
    Timestamp = DateTime.UtcNow, // 10:00:00
    Data = "User logged in" 
};

// Server B (clock: 09:59:58 - 2 seconds behind!)
var eventB = new Event 
{ 
    Timestamp = DateTime.UtcNow, // 09:59:58
    Data = "User clicked checkout" 
};

// When merged, events are out of order!
// Says user checked out BEFORE logging in!

Solutions:

1. Vector Clocks (track causality):

public class VectorClock
{
    private Dictionary<string, int> _clocks = new();
    
    public void Increment(string nodeId)
    {
        if (!_clocks.ContainsKey(nodeId))
            _clocks[nodeId] = 0;
        _clocks[nodeId]++;
    }
    
    public void Merge(VectorClock other)
    {
        foreach (var (nodeId, timestamp) in other._clocks)
        {
            _clocks[nodeId] = Math.Max(
                _clocks.GetValueOrDefault(nodeId, 0),
                timestamp
            );
        }
    }
    
    // A happened before B if all A's clocks ≤ B's clocks
    public bool HappenedBefore(VectorClock other)
    {
        return _clocks.All(kvp => 
            kvp.Value <= other._clocks.GetValueOrDefault(kvp.Key, 0)
        ) && !this.Equals(other);
    }
}

2. Sequence Numbers (ordered events):

public class EventLog
{
    private long _sequenceNumber = 0;
    
    public Event RecordEvent(string data)
    {
        var seqNum = Interlocked.Increment(ref _sequenceNumber);
        
        return new Event
        {
            SequenceNumber = seqNum,
            NodeId = Environment.MachineName,
            Timestamp = DateTimeOffset.UtcNow,
            Data = data
        };
    }
    
    // Merge events from multiple nodes by sequence number
    public List<Event> MergeAndSort(List<Event> events)
    {
        return events
            .OrderBy(e => e.SequenceNumber)
            .ThenBy(e => e.NodeId) // Break ties deterministically
            .ToList();
    }
}

🤔 Did you know? Google's Spanner database uses atomic clocks and GPS receivers in every datacenter to keep time accurate within 7ms across the globe!

Examples: Debugging Real Distributed Problems 🔍

Example 1: The Mysterious Duplicate Orders 🛒

Problem: Users occasionally get charged twice for the same order.

Investigation steps:

Step 1 - Check metrics:

-- Query your metrics database
SELECT 
    COUNT(*) as duplicate_orders
FROM orders
GROUP BY user_id, cart_items
HAVING COUNT(*) > 1;
-- Result: 127 duplicates in the last week

Step 2 - Search logs with correlation ID:

// Found in logs:
{
  "correlationId": "req-abc-123",
  "service": "order-service",
  "message": "Creating order",
  "timestamp": "2024-01-15T10:30:00.000Z"
}
{
  "correlationId": "req-abc-123",
  "service": "order-service",
  "message": "Creating order",
  "timestamp": "2024-01-15T10:30:00.015Z"
}
// Same correlation ID, TWO creates!

Step 3 - View distributed trace:

Trace ID: req-abc-123
├─ API Gateway (45ms)
│  ├─ Order Service Attempt 1 (2500ms) ← TIMEOUT!
│  └─ Order Service Attempt 2 (150ms)  ← RETRY succeeded
└─ Both attempts hit the database!

Root cause: The client retried after a timeout, but the first request actually succeeded—it was just slow. The operation wasn't idempotent.

Fix:

public class OrderService
{
    public async Task<Order> CreateOrder(OrderRequest request)
    {
        var idempotencyKey = request.IdempotencyKey; // Client provides!
        
        // Check if we already processed this request
        var existing = await _db.GetOrderByIdempotencyKey(idempotencyKey);
        if (existing != null)
        {
            _logger.LogInformation(
                "Duplicate request detected: {IdempotencyKey}",
                idempotencyKey
            );
            return existing; // Return existing order
        }
        
        // Process order and store with idempotency key
        var order = await ProcessNewOrder(request);
        order.IdempotencyKey = idempotencyKey;
        await _db.SaveOrder(order);
        
        return order;
    }
}

Key lesson: In distributed systems, assume every request will be retried. Make operations idempotent!

Example 2: The Vanishing Logs 👻

Problem: Production error happened, but logs are missing.

Investigation:

Theory 1 - Logs not shipped:

## Check log shipping agent status
systemctl status filebeat
## Result: active, no errors

Theory 2 - Logs filtered out:

## Check logging configuration
Serilog:
  MinimumLevel: "Warning"  # ← AHA!
## Debug and Info logs are dropped!

Theory 3 - Wrong time window:

// Application uses local time
logger.LogError("Error at {Time}", DateTime.Now); // Local time!

// But searching in UTC
Elasticsearch query: timestamp:[2024-01-15T10:00:00 TO 2024-01-15T11:00:00]
// If server is in PST (UTC-8), logs are 8 hours off!

Root cause: The application logged in local time, but the log aggregator indexed in UTC. Searching for "10 AM UTC" missed logs actually at "10 AM PST" (18:00 UTC).

Fix:

// ALWAYS use UTC for logging
logger.LogError(
    "Error occurred at {Timestamp}",
    DateTimeOffset.UtcNow // UTC timestamp with offset
);

// Better: Use structured logging with automatic UTC
public class LoggingConfiguration
{
    public static void Configure()
    {
        Log.Logger = new LoggerConfiguration()
            .Enrich.WithProperty("Environment", "Production")
            .WriteTo.Console(
                new JsonFormatter(
                    renderMessage: true,
                    formatProvider: CultureInfo.InvariantCulture
                )
            )
            .CreateLogger();
    }
}

Key lesson: Time zones are evil. Always log in UTC. Always search in UTC. Always display to users in their local time.

Example 3: The Intermittent 500 Errors 🎲

Problem: API returns HTTP 500 errors randomly, ~2% of requests.

Investigation:

Step 1 - Check error rate per instance:

-- Query metrics grouped by server
SELECT 
    server_id,
    COUNT(*) as errors
FROM api_errors
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY server_id;

-- Result:
-- server-1: 0 errors
-- server-2: 0 errors  
-- server-3: 1,247 errors ← CULPRIT!

Step 2 - View logs from server-3:

{
  "level": "ERROR",
  "service": "api",
  "server": "server-3",
  "message": "Socket exception: Connection reset",
  "stackTrace": "..."
}

Step 3 - Check distributed trace:

Trace ID: req-def-456
├─ Load Balancer → server-3 (12ms)
├─ server-3 → Database (0ms) ← FAILED IMMEDIATELY
└─ Error: Connection reset

Step 4 - Check infrastructure:

## On server-3, check database connections
netstat -an | grep 5432 | grep ESTABLISHED | wc -l
## Result: 0 connections!

## Check database connection pool
ps aux | grep postgres
## Connection pool shows: MAX_CONNECTIONS=100, IN_USE=100
## Pool exhausted!

Root cause: Database connection pool on server-3 was exhausted. Connections weren't being properly returned to the pool.

Fix:

public class DatabaseService
{
    // BAD: Connection not disposed
    public async Task<User> GetUserBad(int userId)
    {
        var connection = await _connectionPool.GetConnection();
        var user = await connection.QueryAsync<User>(
            "SELECT * FROM users WHERE id = @Id",
            new { Id = userId }
        );
        return user.FirstOrDefault();
        // Connection never returned to pool!
    }
    
    // GOOD: Using statement ensures disposal
    public async Task<User> GetUserGood(int userId)
    {
        await using var connection = await _connectionPool.GetConnection();
        var user = await connection.QueryAsync<User>(
            "SELECT * FROM users WHERE id = @Id",
            new { Id = userId }
        );
        return user.FirstOrDefault();
    } // Connection automatically returned here
}

Key lesson: Resource leaks in distributed systems are insidious. One small leak can take down an entire service. Always dispose resources properly!

Example 4: The Cascading Latency Spike 📈

Problem: API latency suddenly jumps from 100ms to 5 seconds.

Investigation with distributed tracing:

┌─ Normal Request (100ms total) ──────────────────┐
│                                                  │
│ API Gateway    ████ 5ms                          │
│ Auth Service   ██ 2ms                            │
│ Order Service  ████████████ 50ms                 │
│ Database       ████████ 40ms                     │
│ Cache          ██ 3ms                            │
│                                                  │
└──────────────────────────────────────────────────┘

┌─ Slow Request (5000ms total) ───────────────────┐
│                                                  │
│ API Gateway    ████ 5ms                          │
│ Auth Service   ████████████████████████ 4500ms   │ ← BOTTLENECK!
│ Order Service  (waiting)                         │
│ Database       (not reached)                     │
│                                                  │
└──────────────────────────────────────────────────┘

Drilling into Auth Service trace:

Auth Service Span (4500ms):
├─ Validate Token (5ms)
├─ Check Permissions (4490ms) ← PROBLEM!
│  ├─ Call User Service (4485ms)
│  │  ├─ HTTP Request (4480ms)
│  │  │  └─ Read Timeout (4500ms timeout)

Root cause: User Service was down, but Auth Service was waiting the full 4.5s timeout before failing.

Fix - Add circuit breaker and aggressive timeout:

public class AuthService
{
    private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
    private readonly AsyncTimeoutPolicy _timeout;
    
    public AuthService()
    {
        _timeout = Policy.TimeoutAsync(
            TimeSpan.FromMilliseconds(500), // Aggressive!
            TimeoutStrategy.Pessimistic
        );
        
        _circuitBreaker = Policy
            .Handle<HttpRequestException>()
            .Or<TimeoutRejectedException>()
            .CircuitBreakerAsync(
                handledEventsAllowedBeforeBreaking: 3,
                durationOfBreak: TimeSpan.FromSeconds(30)
            );
    }
    
    public async Task<bool> CheckPermissions(string userId)
    {
        try
        {
            return await _circuitBreaker.ExecuteAsync(async () =>
            {
                return await _timeout.ExecuteAsync(async () =>
                {
                    var user = await _userServiceClient.GetUser(userId);
                    return user.HasPermission();
                });
            });
        }
        catch (BrokenCircuitException)
        {
            // Fall back to cached permissions
            return await _cache.GetPermissions(userId);
        }
        catch (TimeoutRejectedException)
        {
            _logger.LogWarning("User service timeout for {UserId}", userId);
            return await _cache.GetPermissions(userId);
        }
    }
}

Key lesson: Slow is worse than down. Use aggressive timeouts and circuit breakers to fail fast and prevent cascading delays.

Common Mistakes to Avoid ⚠️

1. Not Propagating Context 🧵

❌ Bad:

public async Task ProcessOrder(Order order)
{
    // Start new HTTP request without propagating trace context
    var response = await _httpClient.GetAsync("/api/payment");
    // This request is DISCONNECTED from the parent trace!
}

✅ Good:

public async Task ProcessOrder(Order order)
{
    // HttpClient automatically propagates Activity context
    using var activity = Activity.Current;
    activity?.SetTag("order.id", order.Id);
    
    var response = await _httpClient.GetAsync("/api/payment");
    // Trace context is automatically in the HTTP headers
}

2. Logging Sensitive Data 🔐

❌ Bad:

logger.LogInformation(
    "Processing payment: {CreditCard}",
    payment.CreditCardNumber  // DON'T LOG THIS!
);

✅ Good:

logger.LogInformation(
    "Processing payment: {PaymentId} for {Amount}",
    payment.Id,
    payment.Amount  // Log identifiers, not secrets
);

3. Assuming Synchronous Clocks ⏰

❌ Bad:

if (event1.Timestamp > event2.Timestamp)
{
    // Assume event1 happened after event2
    // WRONG if events from different servers!
}

✅ Good:

if (event1.SequenceNumber > event2.SequenceNumber && 
    event1.NodeId == event2.NodeId)
{
    // Compare sequence numbers from same node
}
else
{
    // Use vector clocks for cross-node ordering
    if (event1.VectorClock.HappenedBefore(event2.VectorClock))
    {
        // event1 causally before event2
    }
}

4. No Timeout on External Calls 💣

❌ Bad:

// This can hang forever if service is unresponsive
var result = await _httpClient.GetAsync("/api/data");

✅ Good:

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2));
try
{
    var result = await _httpClient.GetAsync("/api/data", cts.Token);
}
catch (OperationCanceledException)
{
    // Handle timeout
}

5. Ignoring Partial Failures 🎯

❌ Bad:

public async Task<OrderResult> CreateOrder(Order order)
{
    await _orderDb.SaveOrder(order);      // Succeeds
    await _inventoryService.Reserve();    // Fails
    await _paymentService.Charge();       // Never called
    
    // Now order is saved but inventory not reserved!
    // Inconsistent state!
}

✅ Good:

public async Task<OrderResult> CreateOrder(Order order)
{
    // Use Saga pattern or distributed transaction
    var saga = new OrderSaga(order);
    
    try
    {
        await saga.ExecuteStep(() => _orderDb.SaveOrder(order));
        await saga.ExecuteStep(() => _inventoryService.Reserve());
        await saga.ExecuteStep(() => _paymentService.Charge());
        
        await saga.Complete();
    }
    catch (Exception ex)
    {
        // Compensate: undo all completed steps
        await saga.Compensate();
        throw;
    }
}

6. Over-Relying on Distributed Locks 🔒

❌ Bad:

// Locks for everything = performance killer
await using var lock1 = await _lockService.Acquire("users");
await using var lock2 = await _lockService.Acquire("orders");
// Holding two locks = high chance of deadlock!

✅ Good:

// Use optimistic concurrency when possible
var user = await _db.GetUser(userId);
user.Version++;
var updated = await _db.UpdateIfVersionMatches(user);
if (!updated)
{
    // Retry or fail
}

7. Not Testing Under Failure ⚡

❌ Bad:

// Only test happy path
[Test]
public async Task CreateOrder_Success()
{
    var result = await _orderService.CreateOrder(order);
    Assert.True(result.Success);
}

✅ Good:

[Test]
public async Task CreateOrder_WhenPaymentServiceDown_ReturnsError()
{
    _paymentServiceMock
        .Setup(x => x.Charge(It.IsAny<Payment>()))
        .ThrowsAsync(new HttpRequestException());
    
    var result = await _orderService.CreateOrder(order);
    
    Assert.False(result.Success);
    Assert.Equal("Payment service unavailable", result.Error);
}

[Test]
public async Task CreateOrder_WhenPaymentServiceSlow_Timeouts()
{
    _paymentServiceMock
        .Setup(x => x.Charge(It.IsAny<Payment>()))
        .Returns(async () => 
        {
            await Task.Delay(TimeSpan.FromSeconds(10));
            return new PaymentResult { Success = true };
        });
    
    await Assert.ThrowsAsync<TimeoutException>(
        () => _orderService.CreateOrder(order)
    );
}

Key Takeaways 🎓

📋 Distributed Debugging Quick Reference

Concept	Key Point	Tool/Pattern
Tracing	Track requests across services	OpenTelemetry, Jaeger
Logging	Use structured JSON logs	Serilog, ELK Stack
Correlation IDs	Unique ID per request	Middleware propagation
Metrics	Aggregate data for alerting	Prometheus, Grafana
Circuit Breakers	Fail fast when downstream fails	Polly, Resilience4j
Timeouts	Never wait forever	CancellationToken, Polly
Idempotency	Safe to retry operations	Idempotency keys
Time	Always use UTC	DateTimeOffset.UtcNow
Ordering	Use sequence numbers or vector clocks	Lamport timestamps
Resilience	CRT: Circuit breakers, Retries, Timeouts	Polly policies

The Golden Rules 🏆

Observability First: You can't debug what you can't observe. Invest in tracing, logging, and metrics BEFORE you have problems.
Assume Failure: Every network call will fail eventually. Every service will be down eventually. Design for it.
Fail Fast: Timeouts should be aggressive (seconds, not minutes). Circuit breakers should trip quickly.
Make It Idempotent: Every operation should be safe to retry. Use idempotency keys.
UTC Always: Never use local time in distributed systems. Ever.
Correlation Everywhere: Every log, trace, and metric should have a correlation ID.
Test Failures: Unit tests for failures are more important than tests for success.
Structured Everything: Logs, metrics, and traces should all be structured and searchable.

🧠 Memory Device: The "DISTRIBUTED" Mnemonic

Distributed tracing tracks requests
Idempotency prevents duplicates
Structured logging enables search
Timeouts prevent hangs
Retries need exponential backoff
Isolation with bulkheads
Breakers stop cascading failures
UTC time, always
Tags in logs for context
Error rates in metrics
Deadlines for all operations

📚 Further Study

Distributed Systems Observability - Cindy Sridharan's comprehensive guide: https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
OpenTelemetry Documentation - Industry standard for tracing and observability: https://opentelemetry.io/docs/
Google SRE Book - Monitoring Distributed Systems - Battle-tested practices from Google: https://sre.google/sre-book/monitoring-distributed-systems/

💡 Practice tip: Set up a local microservices environment with Docker Compose, Jaeger, and Grafana. Deliberately break services and practice using distributed tracing to debug them. The hands-on experience is invaluable!

🎯 Next steps: Study distributed transactions and the Saga pattern for handling complex workflows across services. Then explore chaos engineering tools like Chaos Monkey to test your system's resilience.

📝

Ready to practice?

This lesson has 15 questions to help you learn

Debugging in Distributed Reality

Debugging in Distributed Reality

Welcome to the Chaos: Debugging Distributed Systems 🌐

Core Concepts: The Distributed Debugging Arsenal 🔧

1. Distributed Tracing: Following the Breadcrumb Trail 🍞

2. Structured Logging: Making Logs Searchable 📋

3. Correlation IDs: The Golden Thread 🧵

4. The Observability Trinity: Metrics, Logs, Traces 📊

5. Debugging Race Conditions Across Services ⚡

6. Cascading Failures: The Domino Effect 🎲

7. Time and Ordering: Clocks Are Lies ⏰

Examples: Debugging Real Distributed Problems 🔍

Example 1: The Mysterious Duplicate Orders 🛒

Example 2: The Vanishing Logs 👻

Example 3: The Intermittent 500 Errors 🎲

Example 4: The Cascading Latency Spike 📈

Common Mistakes to Avoid ⚠️

1. Not Propagating Context 🧵

2. Logging Sensitive Data 🔐

3. Assuming Synchronous Clocks ⏰

4. No Timeout on External Calls 💣

5. Ignoring Partial Failures 🎯

6. Over-Relying on Distributed Locks 🔒

7. Not Testing Under Failure ⚡

Key Takeaways 🎓

📋 Distributed Debugging Quick Reference

The Golden Rules 🏆

🧠 Memory Device: The "DISTRIBUTED" Mnemonic

📚 Further Study