Debugging in Distributed Reality
Understanding failure modes in distributed systems
Debugging in Distributed Reality
Master distributed systems debugging with free flashcards and spaced repetition practice. This lesson covers distributed tracing, log aggregation strategies, debugging race conditions across services, and handling cascading failuresβessential skills for building resilient microservices and cloud-native applications.
Welcome to the Chaos: Debugging Distributed Systems π
Debugging a single-threaded application on your local machine? That's like finding your keys in your apartment. Debugging a distributed system? That's like finding your keys somewhere in a city where different neighborhoods exist in different time zones, some streets randomly disappear, and your keys might be in multiple places simultaneously. Welcome to distributed reality! π«
In distributed systems, the traditional "step through with a debugger" approach crumbles. You're dealing with:
- Multiple processes across different machines
- Network latency and failures
- Asynchronous operations happening concurrently
- Partial failures where some components work while others don't
- Clock skew between machines
- Eventual consistency challenges
The debugging techniques you learned for monolithic applications still apply, but now you need an entirely new toolkit. Let's build it together.
Core Concepts: The Distributed Debugging Arsenal π§
1. Distributed Tracing: Following the Breadcrumb Trail π
When a user request touches 15 different microservices, how do you track what happened? Distributed tracing creates a "trace" that follows a request through your entire system.
How it works:
User Request β API Gateway β Auth Service β Order Service β Payment Service β Notification Service
β β β β β β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββββββ
TRACE ID: abc-123-def
(propagated through all services)
Each service adds spans (time segments showing what it did) to the trace:
| Service | Span ID | Parent Span | Duration | Status |
|---|---|---|---|---|
| API Gateway | span-1 | null | 245ms | β OK |
| Auth Service | span-2 | span-1 | 12ms | β OK |
| Order Service | span-3 | span-1 | 187ms | β OK |
| Payment Service | span-4 | span-3 | 3200ms | β οΈ SLOW |
| Notification | span-5 | span-3 | 45ms | β FAILED |
Key components:
- Trace ID: Unique identifier for the entire request journey
- Span ID: Unique identifier for each operation within the trace
- Parent Span ID: Links spans into a hierarchy
- Baggage: Key-value pairs propagated across service boundaries
Implementation pattern in C#:
using System.Diagnostics;
public class OrderService
{
private static ActivitySource activitySource =
new ActivitySource("OrderService");
public async Task<Order> CreateOrder(OrderRequest request)
{
using var activity = activitySource.StartActivity("CreateOrder");
// Add custom tags for debugging
activity?.SetTag("order.id", request.OrderId);
activity?.SetTag("user.id", request.UserId);
try
{
var order = await ProcessOrder(request);
activity?.SetStatus(ActivityStatusCode.Ok);
return order;
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
throw;
}
}
}
π‘ Pro Tip: Use OpenTelemetryβit's the industry standard that works with Jaeger, Zipkin, and cloud providers. Don't reinvent this wheel!
2. Structured Logging: Making Logs Searchable π
In a distributed system, you might have thousands of log files across hundreds of servers. Plain text logs like "Error occurred" are useless. Structured logging saves the day by making logs machine-readable.
Bad logging (unstructured):
logger.LogInformation($"User {userId} ordered item {itemId} for ${price}");
// Output: "User 12345 ordered item ABC-789 for $49.99"
// Problem: Can't easily search for all orders over $100
Good logging (structured):
logger.LogInformation(
"Order created: {UserId} ordered {ItemId} for {Price}",
userId, itemId, price
);
// Output (JSON): {"message":"Order created","userId":12345,"itemId":"ABC-789","price":49.99,"timestamp":"2024-01-15T10:30:00Z"}
// Now you can query: WHERE price > 100
Essential fields for distributed logs:
| Field | Purpose | Example |
|---|---|---|
| timestamp | When it happened (use UTC!) | 2024-01-15T10:30:00Z |
| traceId | Connect logs across services | abc-123-def |
| spanId | Connect to specific operation | span-3 |
| service | Which service logged this | order-service |
| level | Severity (DEBUG/INFO/WARN/ERROR) | ERROR |
| message | Human-readable description | Payment timeout |
| context | Business-relevant data | {userId, orderId, amount} |
3. Correlation IDs: The Golden Thread π§΅
A correlation ID is a unique identifier that travels with a request through all services. It's your golden thread through the distributed maze.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REQUEST: correlation-id: req-789-xyz β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Service A: [req-789-xyz] Received request β
β β β
β Service A: [req-789-xyz] Calling Service B β
β β β
β Service B: [req-789-xyz] Processing data β
β β β
β Service B: [req-789-xyz] Calling Service C β
β β β
β Service C: [req-789-xyz] ERROR: Database timeout β
β β β
β Service B: [req-789-xyz] ERROR: Downstream failure β
β β β
β Service A: [req-789-xyz] ERROR: Request failed β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now you can grep all logs for "req-789-xyz"!
Implementation middleware:
public class CorrelationMiddleware
{
private readonly RequestDelegate _next;
private const string CorrelationIdHeader = "X-Correlation-Id";
public async Task InvokeAsync(HttpContext context)
{
// Extract or generate correlation ID
string correlationId = context.Request.Headers[CorrelationIdHeader]
.FirstOrDefault() ?? Guid.NewGuid().ToString();
// Add to response headers for debugging
context.Response.Headers[CorrelationIdHeader] = correlationId;
// Store in context for logging
context.Items["CorrelationId"] = correlationId;
// Propagate to downstream services
using (LogContext.PushProperty("CorrelationId", correlationId))
{
await _next(context);
}
}
}
4. The Observability Trinity: Metrics, Logs, Traces π
Observability is answering "what's wrong?" without having predicted the failure mode. The three pillars:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β OBSERVABILITY TRINITY β βββββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββββββββββ€ β π METRICS β π LOGS β π TRACES β βββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββββββ€ β What: β What: β What: β β Aggregated data β Event records β Request journeys β β β β β β When: β When: β When: β β "CPU is 90%" β "Error at 10:30"β "Request took 3s" β β β β β β Tools: β Tools: β Tools: β β Prometheus β ELK/Splunk β Jaeger/Zipkin β β Grafana β Loki β OpenTelemetry β βββββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββββββββββββ
When to use each:
- Metrics: "Is there a problem?" (dashboards, alerts)
- Logs: "What's the context?" (debugging specific issues)
- Traces: "Where's the bottleneck?" (performance investigation)
Example workflow:
- π¨ Alert fires: "API latency > 2s" (from metrics)
- π Check dashboard: Which endpoint? (metrics drill-down)
- π Search logs: Find errors with that endpoint (logs)
- π View trace: See the slow database query (traces)
5. Debugging Race Conditions Across Services β‘
Race conditions in distributed systems are nightmares because:
- Events happen on different machines
- Network delays are unpredictable
- You can't use a single lock
Classic distributed race condition:
// Service A: Inventory Service
public async Task<bool> ReserveItem(string itemId)
{
var stock = await _db.GetStock(itemId); // Read: stock = 1
// Meanwhile, Service B on another machine does the same...
if (stock > 0)
{
await _db.UpdateStock(itemId, stock - 1); // Both decrement!
return true; // Now stock = -1 (oversold!)
}
return false;
}
Solution patterns:
1. Optimistic Locking (version numbers):
public async Task<bool> ReserveItem(string itemId)
{
while (true)
{
var item = await _db.GetItemWithVersion(itemId);
if (item.Stock <= 0) return false;
var updated = await _db.UpdateIfVersionMatches(
itemId,
newStock: item.Stock - 1,
expectedVersion: item.Version
);
if (updated) return true;
// Version mismatch = someone else modified it, retry
await Task.Delay(Random.Shared.Next(10, 50)); // Exponential backoff
}
}
2. Distributed Locks (use with caution!):
public async Task<bool> ReserveItem(string itemId)
{
var lockKey = $"lock:item:{itemId}";
// Try to acquire distributed lock (Redis, Consul, etc.)
await using var distributedLock = await _lockProvider.AcquireAsync(
lockKey,
timeout: TimeSpan.FromSeconds(5)
);
if (distributedLock == null)
{
throw new TimeoutException("Could not acquire lock");
}
var stock = await _db.GetStock(itemId);
if (stock > 0)
{
await _db.UpdateStock(itemId, stock - 1);
return true;
}
return false;
}
β οΈ Warning: Distributed locks are tricky! Watch for:
- Lock holder crashes β use timeouts and lease renewal
- Clock skew β locks expire at different times on different machines
- Network partitions β two nodes might both think they have the lock
π‘ Pro Tip: Prefer idempotency over locks when possible. Design operations so doing them twice gives the same result.
6. Cascading Failures: The Domino Effect π²
A cascading failure occurs when one component's failure causes others to fail, like dominoes falling.
Classic scenario:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β CASCADING FAILURE TIMELINE β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β t=0s: πΎ Database slows down (90% CPU) β β β β β t=5s: π API requests queue up (waiting for DB) β β β β β t=10s: π₯ API servers run out of threads β β β β β t=15s: β‘ Load balancer health checks fail β β β β β t=20s: π Traffic floods to remaining servers β β β β β t=25s: π₯ All servers crash (out of memory) β β β β β t=30s: π¨ TOTAL OUTAGE β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Defense mechanisms:
1. Circuit Breakers:
using Polly;
using Polly.CircuitBreaker;
public class PaymentServiceClient
{
private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
public PaymentServiceClient()
{
_circuitBreaker = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30)
);
}
public async Task<PaymentResult> ProcessPayment(Payment payment)
{
try
{
return await _circuitBreaker.ExecuteAsync(async () =>
{
return await CallPaymentService(payment);
});
}
catch (BrokenCircuitException)
{
// Circuit is open - fail fast instead of waiting
return new PaymentResult
{
Success = false,
Error = "Payment service unavailable"
};
}
}
}
Circuit breaker states:
βββββββββββ
β CLOSED β β Normal operation, requests flow
ββββββ¬βββββ
β 5 failures
β
βββββββββββ
β OPEN β β Fails fast, no requests sent
ββββββ¬βββββ
β 30 seconds pass
β
βββββββββββ
βHALF-OPENβ β Test with 1 request
ββββββ¬βββββ
Success?β
ββββββ΄βββββ
β β
CLOSED OPEN
2. Bulkheads (Isolation):
// Separate thread pools for different operations
public class ServiceClient
{
private readonly Policy _criticalBulkhead;
private readonly Policy _nonCriticalBulkhead;
public ServiceClient()
{
// Critical operations get 80% of resources
_criticalBulkhead = Policy.BulkheadAsync(
maxParallelization: 80,
maxQueuingActions: 20
);
// Non-critical get 20%
_nonCriticalBulkhead = Policy.BulkheadAsync(
maxParallelization: 20,
maxQueuingActions: 5
);
}
public async Task<Order> CreateOrder(OrderRequest request)
{
// Use critical bulkhead - won't be starved by analytics
return await _criticalBulkhead.ExecuteAsync(
async () => await ProcessOrder(request)
);
}
public async Task<Report> GenerateReport()
{
// Use non-critical bulkhead - can't overwhelm system
return await _nonCriticalBulkhead.ExecuteAsync(
async () => await CreateReport()
);
}
}
3. Timeouts (Aggressive):
public async Task<User> GetUser(string userId)
{
using var cts = new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromSeconds(2)); // Aggressive timeout
try
{
return await _httpClient.GetAsync(
$"/users/{userId}",
cts.Token
);
}
catch (OperationCanceledException)
{
// Log and return cached/default value
_logger.LogWarning("GetUser timeout for {UserId}", userId);
return await _cache.GetUserOrDefault(userId);
}
}
π‘ Mnemonic for resilience patterns: "CRT" - Circuit breakers, Retries, Timeouts
7. Time and Ordering: Clocks Are Lies β°
In distributed systems, you can't trust clocks. Different machines have different times, even with NTP synchronization.
The problem:
// Server A (clock: 10:00:00)
var eventA = new Event
{
Timestamp = DateTime.UtcNow, // 10:00:00
Data = "User logged in"
};
// Server B (clock: 09:59:58 - 2 seconds behind!)
var eventB = new Event
{
Timestamp = DateTime.UtcNow, // 09:59:58
Data = "User clicked checkout"
};
// When merged, events are out of order!
// Says user checked out BEFORE logging in!
Solutions:
1. Vector Clocks (track causality):
public class VectorClock
{
private Dictionary<string, int> _clocks = new();
public void Increment(string nodeId)
{
if (!_clocks.ContainsKey(nodeId))
_clocks[nodeId] = 0;
_clocks[nodeId]++;
}
public void Merge(VectorClock other)
{
foreach (var (nodeId, timestamp) in other._clocks)
{
_clocks[nodeId] = Math.Max(
_clocks.GetValueOrDefault(nodeId, 0),
timestamp
);
}
}
// A happened before B if all A's clocks β€ B's clocks
public bool HappenedBefore(VectorClock other)
{
return _clocks.All(kvp =>
kvp.Value <= other._clocks.GetValueOrDefault(kvp.Key, 0)
) && !this.Equals(other);
}
}
2. Sequence Numbers (ordered events):
public class EventLog
{
private long _sequenceNumber = 0;
public Event RecordEvent(string data)
{
var seqNum = Interlocked.Increment(ref _sequenceNumber);
return new Event
{
SequenceNumber = seqNum,
NodeId = Environment.MachineName,
Timestamp = DateTimeOffset.UtcNow,
Data = data
};
}
// Merge events from multiple nodes by sequence number
public List<Event> MergeAndSort(List<Event> events)
{
return events
.OrderBy(e => e.SequenceNumber)
.ThenBy(e => e.NodeId) // Break ties deterministically
.ToList();
}
}
π€ Did you know? Google's Spanner database uses atomic clocks and GPS receivers in every datacenter to keep time accurate within 7ms across the globe!
Examples: Debugging Real Distributed Problems π
Example 1: The Mysterious Duplicate Orders π
Problem: Users occasionally get charged twice for the same order.
Investigation steps:
Step 1 - Check metrics:
-- Query your metrics database
SELECT
COUNT(*) as duplicate_orders
FROM orders
GROUP BY user_id, cart_items
HAVING COUNT(*) > 1;
-- Result: 127 duplicates in the last week
Step 2 - Search logs with correlation ID:
// Found in logs:
{
"correlationId": "req-abc-123",
"service": "order-service",
"message": "Creating order",
"timestamp": "2024-01-15T10:30:00.000Z"
}
{
"correlationId": "req-abc-123",
"service": "order-service",
"message": "Creating order",
"timestamp": "2024-01-15T10:30:00.015Z"
}
// Same correlation ID, TWO creates!
Step 3 - View distributed trace:
Trace ID: req-abc-123
ββ API Gateway (45ms)
β ββ Order Service Attempt 1 (2500ms) β TIMEOUT!
β ββ Order Service Attempt 2 (150ms) β RETRY succeeded
ββ Both attempts hit the database!
Root cause: The client retried after a timeout, but the first request actually succeededβit was just slow. The operation wasn't idempotent.
Fix:
public class OrderService
{
public async Task<Order> CreateOrder(OrderRequest request)
{
var idempotencyKey = request.IdempotencyKey; // Client provides!
// Check if we already processed this request
var existing = await _db.GetOrderByIdempotencyKey(idempotencyKey);
if (existing != null)
{
_logger.LogInformation(
"Duplicate request detected: {IdempotencyKey}",
idempotencyKey
);
return existing; // Return existing order
}
// Process order and store with idempotency key
var order = await ProcessNewOrder(request);
order.IdempotencyKey = idempotencyKey;
await _db.SaveOrder(order);
return order;
}
}
Key lesson: In distributed systems, assume every request will be retried. Make operations idempotent!
Example 2: The Vanishing Logs π»
Problem: Production error happened, but logs are missing.
Investigation:
Theory 1 - Logs not shipped:
## Check log shipping agent status
systemctl status filebeat
## Result: active, no errors
Theory 2 - Logs filtered out:
## Check logging configuration
Serilog:
MinimumLevel: "Warning" # β AHA!
## Debug and Info logs are dropped!
Theory 3 - Wrong time window:
// Application uses local time
logger.LogError("Error at {Time}", DateTime.Now); // Local time!
// But searching in UTC
Elasticsearch query: timestamp:[2024-01-15T10:00:00 TO 2024-01-15T11:00:00]
// If server is in PST (UTC-8), logs are 8 hours off!
Root cause: The application logged in local time, but the log aggregator indexed in UTC. Searching for "10 AM UTC" missed logs actually at "10 AM PST" (18:00 UTC).
Fix:
// ALWAYS use UTC for logging
logger.LogError(
"Error occurred at {Timestamp}",
DateTimeOffset.UtcNow // UTC timestamp with offset
);
// Better: Use structured logging with automatic UTC
public class LoggingConfiguration
{
public static void Configure()
{
Log.Logger = new LoggerConfiguration()
.Enrich.WithProperty("Environment", "Production")
.WriteTo.Console(
new JsonFormatter(
renderMessage: true,
formatProvider: CultureInfo.InvariantCulture
)
)
.CreateLogger();
}
}
Key lesson: Time zones are evil. Always log in UTC. Always search in UTC. Always display to users in their local time.
Example 3: The Intermittent 500 Errors π²
Problem: API returns HTTP 500 errors randomly, ~2% of requests.
Investigation:
Step 1 - Check error rate per instance:
-- Query metrics grouped by server
SELECT
server_id,
COUNT(*) as errors
FROM api_errors
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY server_id;
-- Result:
-- server-1: 0 errors
-- server-2: 0 errors
-- server-3: 1,247 errors β CULPRIT!
Step 2 - View logs from server-3:
{
"level": "ERROR",
"service": "api",
"server": "server-3",
"message": "Socket exception: Connection reset",
"stackTrace": "..."
}
Step 3 - Check distributed trace:
Trace ID: req-def-456
ββ Load Balancer β server-3 (12ms)
ββ server-3 β Database (0ms) β FAILED IMMEDIATELY
ββ Error: Connection reset
Step 4 - Check infrastructure:
## On server-3, check database connections
netstat -an | grep 5432 | grep ESTABLISHED | wc -l
## Result: 0 connections!
## Check database connection pool
ps aux | grep postgres
## Connection pool shows: MAX_CONNECTIONS=100, IN_USE=100
## Pool exhausted!
Root cause: Database connection pool on server-3 was exhausted. Connections weren't being properly returned to the pool.
Fix:
public class DatabaseService
{
// BAD: Connection not disposed
public async Task<User> GetUserBad(int userId)
{
var connection = await _connectionPool.GetConnection();
var user = await connection.QueryAsync<User>(
"SELECT * FROM users WHERE id = @Id",
new { Id = userId }
);
return user.FirstOrDefault();
// Connection never returned to pool!
}
// GOOD: Using statement ensures disposal
public async Task<User> GetUserGood(int userId)
{
await using var connection = await _connectionPool.GetConnection();
var user = await connection.QueryAsync<User>(
"SELECT * FROM users WHERE id = @Id",
new { Id = userId }
);
return user.FirstOrDefault();
} // Connection automatically returned here
}
Key lesson: Resource leaks in distributed systems are insidious. One small leak can take down an entire service. Always dispose resources properly!
Example 4: The Cascading Latency Spike π
Problem: API latency suddenly jumps from 100ms to 5 seconds.
Investigation with distributed tracing:
ββ Normal Request (100ms total) βββββββββββββββββββ
β β
β API Gateway ββββ 5ms β
β Auth Service ββ 2ms β
β Order Service ββββββββββββ 50ms β
β Database ββββββββ 40ms β
β Cache ββ 3ms β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Slow Request (5000ms total) ββββββββββββββββββββ
β β
β API Gateway ββββ 5ms β
β Auth Service ββββββββββββββββββββββββ 4500ms β β BOTTLENECK!
β Order Service (waiting) β
β Database (not reached) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Drilling into Auth Service trace:
Auth Service Span (4500ms):
ββ Validate Token (5ms)
ββ Check Permissions (4490ms) β PROBLEM!
β ββ Call User Service (4485ms)
β β ββ HTTP Request (4480ms)
β β β ββ Read Timeout (4500ms timeout)
Root cause: User Service was down, but Auth Service was waiting the full 4.5s timeout before failing.
Fix - Add circuit breaker and aggressive timeout:
public class AuthService
{
private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
private readonly AsyncTimeoutPolicy _timeout;
public AuthService()
{
_timeout = Policy.TimeoutAsync(
TimeSpan.FromMilliseconds(500), // Aggressive!
TimeoutStrategy.Pessimistic
);
_circuitBreaker = Policy
.Handle<HttpRequestException>()
.Or<TimeoutRejectedException>()
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 3,
durationOfBreak: TimeSpan.FromSeconds(30)
);
}
public async Task<bool> CheckPermissions(string userId)
{
try
{
return await _circuitBreaker.ExecuteAsync(async () =>
{
return await _timeout.ExecuteAsync(async () =>
{
var user = await _userServiceClient.GetUser(userId);
return user.HasPermission();
});
});
}
catch (BrokenCircuitException)
{
// Fall back to cached permissions
return await _cache.GetPermissions(userId);
}
catch (TimeoutRejectedException)
{
_logger.LogWarning("User service timeout for {UserId}", userId);
return await _cache.GetPermissions(userId);
}
}
}
Key lesson: Slow is worse than down. Use aggressive timeouts and circuit breakers to fail fast and prevent cascading delays.
Common Mistakes to Avoid β οΈ
1. Not Propagating Context π§΅
β Bad:
public async Task ProcessOrder(Order order)
{
// Start new HTTP request without propagating trace context
var response = await _httpClient.GetAsync("/api/payment");
// This request is DISCONNECTED from the parent trace!
}
β Good:
public async Task ProcessOrder(Order order)
{
// HttpClient automatically propagates Activity context
using var activity = Activity.Current;
activity?.SetTag("order.id", order.Id);
var response = await _httpClient.GetAsync("/api/payment");
// Trace context is automatically in the HTTP headers
}
2. Logging Sensitive Data π
β Bad:
logger.LogInformation(
"Processing payment: {CreditCard}",
payment.CreditCardNumber // DON'T LOG THIS!
);
β Good:
logger.LogInformation(
"Processing payment: {PaymentId} for {Amount}",
payment.Id,
payment.Amount // Log identifiers, not secrets
);
3. Assuming Synchronous Clocks β°
β Bad:
if (event1.Timestamp > event2.Timestamp)
{
// Assume event1 happened after event2
// WRONG if events from different servers!
}
β Good:
if (event1.SequenceNumber > event2.SequenceNumber &&
event1.NodeId == event2.NodeId)
{
// Compare sequence numbers from same node
}
else
{
// Use vector clocks for cross-node ordering
if (event1.VectorClock.HappenedBefore(event2.VectorClock))
{
// event1 causally before event2
}
}
4. No Timeout on External Calls π£
β Bad:
// This can hang forever if service is unresponsive
var result = await _httpClient.GetAsync("/api/data");
β Good:
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2));
try
{
var result = await _httpClient.GetAsync("/api/data", cts.Token);
}
catch (OperationCanceledException)
{
// Handle timeout
}
5. Ignoring Partial Failures π―
β Bad:
public async Task<OrderResult> CreateOrder(Order order)
{
await _orderDb.SaveOrder(order); // Succeeds
await _inventoryService.Reserve(); // Fails
await _paymentService.Charge(); // Never called
// Now order is saved but inventory not reserved!
// Inconsistent state!
}
β Good:
public async Task<OrderResult> CreateOrder(Order order)
{
// Use Saga pattern or distributed transaction
var saga = new OrderSaga(order);
try
{
await saga.ExecuteStep(() => _orderDb.SaveOrder(order));
await saga.ExecuteStep(() => _inventoryService.Reserve());
await saga.ExecuteStep(() => _paymentService.Charge());
await saga.Complete();
}
catch (Exception ex)
{
// Compensate: undo all completed steps
await saga.Compensate();
throw;
}
}
6. Over-Relying on Distributed Locks π
β Bad:
// Locks for everything = performance killer
await using var lock1 = await _lockService.Acquire("users");
await using var lock2 = await _lockService.Acquire("orders");
// Holding two locks = high chance of deadlock!
β Good:
// Use optimistic concurrency when possible
var user = await _db.GetUser(userId);
user.Version++;
var updated = await _db.UpdateIfVersionMatches(user);
if (!updated)
{
// Retry or fail
}
7. Not Testing Under Failure β‘
β Bad:
// Only test happy path
[Test]
public async Task CreateOrder_Success()
{
var result = await _orderService.CreateOrder(order);
Assert.True(result.Success);
}
β Good:
[Test]
public async Task CreateOrder_WhenPaymentServiceDown_ReturnsError()
{
_paymentServiceMock
.Setup(x => x.Charge(It.IsAny<Payment>()))
.ThrowsAsync(new HttpRequestException());
var result = await _orderService.CreateOrder(order);
Assert.False(result.Success);
Assert.Equal("Payment service unavailable", result.Error);
}
[Test]
public async Task CreateOrder_WhenPaymentServiceSlow_Timeouts()
{
_paymentServiceMock
.Setup(x => x.Charge(It.IsAny<Payment>()))
.Returns(async () =>
{
await Task.Delay(TimeSpan.FromSeconds(10));
return new PaymentResult { Success = true };
});
await Assert.ThrowsAsync<TimeoutException>(
() => _orderService.CreateOrder(order)
);
}
Key Takeaways π
π Distributed Debugging Quick Reference
| Concept | Key Point | Tool/Pattern |
|---|---|---|
| Tracing | Track requests across services | OpenTelemetry, Jaeger |
| Logging | Use structured JSON logs | Serilog, ELK Stack |
| Correlation IDs | Unique ID per request | Middleware propagation |
| Metrics | Aggregate data for alerting | Prometheus, Grafana |
| Circuit Breakers | Fail fast when downstream fails | Polly, Resilience4j |
| Timeouts | Never wait forever | CancellationToken, Polly |
| Idempotency | Safe to retry operations | Idempotency keys |
| Time | Always use UTC | DateTimeOffset.UtcNow |
| Ordering | Use sequence numbers or vector clocks | Lamport timestamps |
| Resilience | CRT: Circuit breakers, Retries, Timeouts | Polly policies |
The Golden Rules π
Observability First: You can't debug what you can't observe. Invest in tracing, logging, and metrics BEFORE you have problems.
Assume Failure: Every network call will fail eventually. Every service will be down eventually. Design for it.
Fail Fast: Timeouts should be aggressive (seconds, not minutes). Circuit breakers should trip quickly.
Make It Idempotent: Every operation should be safe to retry. Use idempotency keys.
UTC Always: Never use local time in distributed systems. Ever.
Correlation Everywhere: Every log, trace, and metric should have a correlation ID.
Test Failures: Unit tests for failures are more important than tests for success.
Structured Everything: Logs, metrics, and traces should all be structured and searchable.
π§ Memory Device: The "DISTRIBUTED" Mnemonic
- Distributed tracing tracks requests
- Idempotency prevents duplicates
- Structured logging enables search
- Timeouts prevent hangs
- Retries need exponential backoff
- Isolation with bulkheads
- Breakers stop cascading failures
- UTC time, always
- Tags in logs for context
- Error rates in metrics
- Deadlines for all operations
π Further Study
Distributed Systems Observability - Cindy Sridharan's comprehensive guide: https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
OpenTelemetry Documentation - Industry standard for tracing and observability: https://opentelemetry.io/docs/
Google SRE Book - Monitoring Distributed Systems - Battle-tested practices from Google: https://sre.google/sre-book/monitoring-distributed-systems/
π‘ Practice tip: Set up a local microservices environment with Docker Compose, Jaeger, and Grafana. Deliberately break services and practice using distributed tracing to debug them. The hands-on experience is invaluable!
π― Next steps: Study distributed transactions and the Saga pattern for handling complex workflows across services. Then explore chaos engineering tools like Chaos Monkey to test your system's resilience.