Turning Scars into Architecture
Converting incident learnings into better system design
Turning Scars into Architecture
Learn how to transform production failures into robust system designs with free flashcards and spaced repetition practice. This lesson covers post-mortem analysis, defensive design patterns, and circuit breakersβessential skills for building resilient systems under pressure. When bugs escape to production, the difference between a mature engineer and a novice isn't just fixing the bugβit's ensuring it can never happen again.
Welcome ποΈ
Every seasoned developer carries battle scars: the midnight database migration that corrupted records, the race condition that crashed payment processing, the memory leak that took down production. But what separates great engineers from the rest is what happens after the fire is out.
Turning scars into architecture means systematically analyzing failures and encoding the lessons learned directly into your system's structure. It's defensive programming elevated to an architectural principle. Instead of just patching holes, you redesign the ship so those holes can't form.
This lesson will teach you how to:
- Conduct effective post-mortem analysis π
- Design defensive systems that fail safely π‘οΈ
- Implement circuit breakers and bulkheads π
- Build idempotent operations π
- Create comprehensive observability π
Core Concepts π‘
The Post-Mortem Mindset
A post-mortem (or "blameless retrospective") is a structured analysis of what went wrong, conducted without assigning blame to individuals. The goal is organizational learning, not punishment.
The Five Whys Technique helps uncover root causes:
π΄ INCIDENT: Payment processor crashed
|
β Why?
π³ Memory leak in transaction handler
|
β Why?
π¦ Objects weren't being disposed
|
β Why?
π Exception in disposal path prevented cleanup
|
β Why?
β οΈ No unit test for disposal error cases
|
β Why?
π Code review checklist didn't include
resource management verification
The true root cause isn't the memory leakβit's the missing checklist item that would have caught it.
π‘ Key Principle: Every incident reveals not just a code bug, but a process gap that allowed that bug to reach production.
Defensive Design Patterns π‘οΈ
Defensive architecture assumes everything will fail. Your code should be paranoid about:
- Network calls timing out
- Dependencies returning garbage data
- Users providing malicious input
- Race conditions under load
- Running out of memory, disk, connections
Input Validation at Boundaries
Validate aggressively at system boundaries, not just at the UI:
public class PaymentRequest
{
private decimal _amount;
public decimal Amount
{
get => _amount;
set
{
if (value <= 0)
throw new ArgumentException("Amount must be positive");
if (value > 1_000_000)
throw new ArgumentException("Amount exceeds maximum");
_amount = value;
}
}
}
Fail Fast Principle: Detect errors as early as possible and fail loudly. Silent failures are debugging nightmares.
Circuit Breaker Pattern π
When a dependency fails, stop calling it for a cooldown period. This prevents cascade failures where one broken service brings down everything that depends on it.
βββββββββββββββββββββββββββββββββββββββββββ
β CIRCUIT BREAKER STATE MACHINE β
βββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ
β CLOSED β βββββ Normal operation
β (OK) β All requests pass through
ββββββ¬ββββββ
β
β Failures exceed threshold
β
ββββββββββββ
β OPEN β βββββ Failing fast
β (BLOCKED)β Reject requests immediately
ββββββ¬ββββββ (don't even try the call)
β
β After timeout period
β
ββββββββββββ
β HALF-OPENβ βββββ Testing recovery
β (TESTING)β Allow limited requests
ββββββ¬ββββββ
β
ββββββ΄ββββββββββ
β β
β Success β Failure
CLOSED OPEN
public class CircuitBreaker
{
private int _failureCount = 0;
private DateTime _lastFailureTime;
private CircuitState _state = CircuitState.Closed;
public async Task<T> ExecuteAsync<T>(Func<Task<T>> operation)
{
if (_state == CircuitState.Open)
{
if (DateTime.UtcNow - _lastFailureTime > TimeSpan.FromSeconds(60))
{
_state = CircuitState.HalfOpen;
}
else
{
throw new CircuitBreakerOpenException();
}
}
try
{
var result = await operation();
if (_state == CircuitState.HalfOpen)
{
_state = CircuitState.Closed;
_failureCount = 0;
}
return result;
}
catch (Exception)
{
_failureCount++;
_lastFailureTime = DateTime.UtcNow;
if (_failureCount >= 5)
{
_state = CircuitState.Open;
}
throw;
}
}
}
β οΈ Without circuit breakers: One slow database can make your entire API timeout, causing retry storms that make the problem worse.
β With circuit breakers: Failed dependency fails fast, protecting your system's resources.
Idempotency: Making Operations Retry-Safe π
An idempotent operation produces the same result whether you execute it once or multiple times. This is crucial for retry logic.
Non-idempotent (dangerous to retry):
public void ProcessPayment(decimal amount)
{
account.Balance -= amount; // β Retrying deducts twice!
}
Idempotent (safe to retry):
public void ProcessPayment(string transactionId, decimal amount)
{
if (_processedTransactions.Contains(transactionId))
return; // Already processed
account.Balance -= amount;
_processedTransactions.Add(transactionId);
}
HTTP Method Idempotency:
| Method | Idempotent? | Safe to Retry? |
|---|---|---|
| GET | β Yes | β Yes |
| PUT | β Yes | β Yes |
| DELETE | β Yes | β Yes |
| POST | β No | β οΈ Only with idempotency keys |
| PATCH | β No | β οΈ Depends on implementation |
π‘ Design Tip: Add unique request IDs to all operations that modify state. Store processed IDs to detect duplicates.
Bulkhead Pattern: Isolating Failures π’
In ships, bulkheads are watertight compartments. If one section floods, the others remain safe. Apply this to system design:
βββββββββββββββββββββββββββββββββββββββββββββββ β MONOLITHIC CONNECTION POOL β β ββββββββββββββββββββββββββββββββββββββββββ β β β π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅π΅ β β β β (20 connections shared by all) β β β ββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββ β One slow query can exhaust all connections βββββββββββββββββββββββββββββββββββββββββββββββ β BULKHEAD CONNECTION POOLS β β ββββββββββββββββ ββββββββββββββββ β β β Critical API β β Reports β β β β π΅π΅π΅π΅π΅π΅π΅ β β π’π’π’π’π’ β β β ββββββββββββββββ ββββββββββββββββ β β ββββββββββββββββ ββββββββββββββββ β β β Background β β Analytics β β β β π‘π‘π‘π‘π‘π‘ β β π π π β β β ββββββββββββββββ ββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββ β Slow report query can't starve critical API
public class DatabaseConnectionFactory
{
private readonly ConnectionPool _criticalPool = new ConnectionPool(maxSize: 10);
private readonly ConnectionPool _reportingPool = new ConnectionPool(maxSize: 5);
private readonly ConnectionPool _backgroundPool = new ConnectionPool(maxSize: 3);
public IDbConnection GetConnection(WorkloadType type)
{
return type switch
{
WorkloadType.Critical => _criticalPool.GetConnection(),
WorkloadType.Reporting => _reportingPool.GetConnection(),
WorkloadType.Background => _backgroundPool.GetConnection(),
_ => throw new ArgumentException("Unknown workload type")
};
}
}
Observability: Making Systems Debuggable π
You can't fix what you can't see. Mature systems have three pillars of observability:
1. Metrics (aggregated numbers)
public class PaymentProcessor
{
private static readonly Counter PaymentAttempts =
Metrics.CreateCounter("payment_attempts_total", "Total payment attempts");
private static readonly Histogram PaymentDuration =
Metrics.CreateHistogram("payment_duration_seconds", "Payment processing time");
public async Task<PaymentResult> ProcessAsync(PaymentRequest request)
{
PaymentAttempts.Inc();
using (PaymentDuration.NewTimer())
{
return await _processor.ProcessAsync(request);
}
}
}
2. Logs (discrete events)
_logger.LogWarning(
"Payment processing slow: {Duration}ms for transaction {TransactionId}",
duration.TotalMilliseconds,
transactionId
);
3. Traces (request flow through services)
using var activity = _activitySource.StartActivity("ProcessPayment");
activity?.SetTag("transaction.id", transactionId);
activity?.SetTag("payment.amount", amount);
TRACE VISUALIZATION: ββββββββββββββββββββββββββββββββββββββββββββββββββ β Request ID: abc-123 β ββββββββββββββββββββββββββββββββββββββββββββββββββ€ β API Gateway [ββββ] 50ms β β ββ Auth Service [ββ] 20ms β β ββ Payment Service [ββββββββ] 80ms β β ββ Database [ββββ] 40ms β β ββ Fraud Check [ββββββ] 60ms β οΈ SLOW β β ββ Notification [ββ] 20ms β β β β Total: 170ms β ββββββββββββββββββββββββββββββββββββββββββββββββββ
Examples π§
Example 1: Learning from a Race Condition
The Incident: E-commerce site oversold concert tickets. Two users bought the last ticket simultaneously.
The Original (Broken) Code:
public async Task<bool> PurchaseTicket(int eventId, int userId)
{
var availableTickets = await _db.GetAvailableTickets(eventId);
if (availableTickets > 0) // β RACE CONDITION!
{
// Another thread can check between these lines
await _db.DecrementTickets(eventId);
await _db.CreateOrder(userId, eventId);
return true;
}
return false;
}
The Timeline of Failure:
Time Thread A Thread B ββββ βββββββββββββββββββββββββ βββββββββββββββββββββββββ 10:00 Check: 1 ticket available 10:01 Check: 1 ticket available 10:02 Decrement to 0 10:03 Decrement to -1 β 10:04 Create order for User A 10:05 Create order for User B β
Architectural Fix #1: Optimistic Concurrency
public async Task<bool> PurchaseTicket(int eventId, int userId)
{
using var transaction = await _db.BeginTransactionAsync();
var eventData = await _db.Events
.Where(e => e.Id == eventId)
.FirstOrDefaultAsync();
if (eventData.AvailableTickets > 0)
{
var originalVersion = eventData.RowVersion;
eventData.AvailableTickets--;
var rowsAffected = await _db.Database.ExecuteSqlRawAsync(
"UPDATE Events SET AvailableTickets = {0}, RowVersion = RowVersion + 1 " +
"WHERE Id = {1} AND RowVersion = {2}",
eventData.AvailableTickets,
eventId,
originalVersion
);
if (rowsAffected == 0)
{
// Another transaction modified it first
return false; // β
Fail safely
}
await _db.CreateOrder(userId, eventId);
await transaction.CommitAsync();
return true;
}
return false;
}
Architectural Fix #2: Reservation System
Even betterβadd a two-phase commit:
public async Task<ReservationResult> ReserveTicket(int eventId, int userId)
{
var reservation = new TicketReservation
{
EventId = eventId,
UserId = userId,
ExpiresAt = DateTime.UtcNow.AddMinutes(10),
Status = ReservationStatus.Pending
};
using var transaction = await _db.BeginTransactionAsync(
IsolationLevel.Serializable // β
Strongest isolation
);
var available = await _db.Events
.Where(e => e.Id == eventId && e.AvailableTickets > 0)
.ExecuteUpdateAsync(e => e.SetProperty(
p => p.AvailableTickets,
p => p.AvailableTickets - 1
));
if (available == 0)
return ReservationResult.SoldOut;
await _db.Reservations.AddAsync(reservation);
await transaction.CommitAsync();
return ReservationResult.Success(reservation.Id);
}
public async Task<bool> CompleteReservation(Guid reservationId)
{
var reservation = await _db.Reservations.FindAsync(reservationId);
if (reservation == null || reservation.IsExpired())
{
// Background job will return expired reservations to pool
return false;
}
reservation.Status = ReservationStatus.Completed;
await _db.SaveChangesAsync();
return true;
}
π§ Memory Device: "ACID for tickets" - Atomicity, Consistency, Isolation, Durability. Database transactions enforce these properties to prevent race conditions.
Example 2: Surviving Dependency Failure
The Incident: Payment gateway had an outage. All checkout attempts failed for 2 hours.
The Problem: Synchronous blocking calls with no fallback:
public async Task<CheckoutResult> Checkout(Order order)
{
var paymentResult = await _paymentGateway.ChargeCard(order.Total);
if (paymentResult.Success) // β Never returns during outage
{
await _db.SaveOrder(order);
return CheckoutResult.Success;
}
return CheckoutResult.Failed;
}
Architectural Fix: Circuit Breaker + Async Processing
public async Task<CheckoutResult> Checkout(Order order)
{
// Save order as pending immediately
order.Status = OrderStatus.PendingPayment;
await _db.Orders.AddAsync(order);
await _db.SaveChangesAsync();
try
{
// Try payment with circuit breaker
var paymentResult = await _circuitBreaker.ExecuteAsync(
() => _paymentGateway.ChargeCard(order.Total),
fallback: () => Task.FromResult(PaymentResult.Deferred)
);
if (paymentResult.Success)
{
order.Status = OrderStatus.Confirmed;
await _db.SaveChangesAsync();
return CheckoutResult.Success;
}
else if (paymentResult.Deferred)
{
// Queue for retry
await _messageQueue.EnqueueAsync(new ProcessPaymentMessage
{
OrderId = order.Id,
RetryCount = 0
});
return CheckoutResult.Pending(
"Payment processing. You'll receive confirmation within 1 hour."
);
}
}
catch (CircuitBreakerOpenException)
{
// Gateway is down - queue for later
await _messageQueue.EnqueueAsync(new ProcessPaymentMessage
{
OrderId = order.Id,
RetryCount = 0
});
return CheckoutResult.Pending(
"High traffic. Your order is saved and will be processed shortly."
);
}
return CheckoutResult.Failed;
}
Background Processor with Exponential Backoff:
public async Task ProcessPendingPayment(ProcessPaymentMessage message)
{
var order = await _db.Orders.FindAsync(message.OrderId);
if (order.Status != OrderStatus.PendingPayment)
return; // Already processed
try
{
var result = await _paymentGateway.ChargeCard(order.Total);
if (result.Success)
{
order.Status = OrderStatus.Confirmed;
await _db.SaveChangesAsync();
await _emailService.SendConfirmationAsync(order);
}
else
{
await RetryOrCancel(message, order);
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Payment processing failed for order {OrderId}", order.Id);
await RetryOrCancel(message, order);
}
}
private async Task RetryOrCancel(ProcessPaymentMessage message, Order order)
{
if (message.RetryCount < 5)
{
// Exponential backoff: 1min, 2min, 4min, 8min, 16min
var delayMinutes = Math.Pow(2, message.RetryCount);
await _messageQueue.EnqueueAsync(
new ProcessPaymentMessage
{
OrderId = order.Id,
RetryCount = message.RetryCount + 1
},
delay: TimeSpan.FromMinutes(delayMinutes)
);
}
else
{
order.Status = OrderStatus.PaymentFailed;
await _db.SaveChangesAsync();
await _emailService.SendPaymentFailureAsync(order);
}
}
β What changed:
- Orders persist immediately (data never lost)
- Circuit breaker prevents timeout pile-up
- Async processing decouples checkout from payment
- User gets immediate feedback, not a hanging browser
- System degrades gracefully during outages
Example 3: The Cascading Timeout
The Incident: Slow database query caused entire API to become unresponsive.
The Problem Chain:
βββββββββββββββββββββββββββββββββββββββββββββββ β 1. Database runs slow query (60s) β β β β β 2. Web server thread blocks (60s) β β β β β 3. More requests arrive, threads exhausted β β β β β 4. Load balancer times out (30s) β β β β β 5. Retries make it worse β β β β β 6. COMPLETE OUTAGE π₯ β βββββββββββββββββββββββββββββββββββββββββββββββ
Architectural Fix: Timeouts + Bulkheads + Fallbacks
public class DashboardController : ControllerBase
{
private readonly IDbConnection _analyticsDb; // Separate connection pool
private readonly ICache _cache;
[HttpGet("dashboard")]
public async Task<DashboardData> GetDashboard()
{
var tasks = new[]
{
GetRecentOrders(),
GetRevenueStats(),
GetTopProducts()
};
// β
Timeout each query independently
var timeout = TimeSpan.FromSeconds(5);
await Task.WhenAll(tasks.Select(t =>
Task.WhenAny(t, Task.Delay(timeout))
));
return new DashboardData
{
RecentOrders = tasks[0].IsCompletedSuccessfully
? await tasks[0]
: GetCachedOrders(), // β
Fallback to cache
RevenueStats = tasks[1].IsCompletedSuccessfully
? await tasks[1]
: GetEstimatedRevenue(), // β
Fallback to estimate
TopProducts = tasks[2].IsCompletedSuccessfully
? await tasks[2]
: Array.Empty<Product>() // β
Degrade gracefully
};
}
private async Task<Order[]> GetRecentOrders()
{
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
// β
Use separate connection pool for analytics
using var connection = _analyticsDb.CreateConnection();
try
{
return await connection.QueryAsync<Order>(
"SELECT TOP 10 * FROM Orders ORDER BY CreatedAt DESC",
cancellationToken: cts.Token
);
}
catch (OperationCanceledException)
{
_logger.LogWarning("Recent orders query timed out");
return GetCachedOrders();
}
}
}
Query Timeout in SQL (defense in depth):
public class AnalyticsDbContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder options)
{
options.UseSqlServer(
_connectionString,
sqlOptions => sqlOptions
.CommandTimeout(10) // β
Kill long queries at database level
.EnableRetryOnFailure(maxRetryCount: 3)
);
}
}
π‘ Key Principle: "Set timeouts at every layer". Don't rely on a single timeoutβeach layer should have its own, progressively shorter as you go up the stack.
| Layer | Timeout | Reason |
|---|---|---|
| Database | 10s | Kill runaway queries |
| Application | 5s | Release thread pool |
| Load Balancer | 30s | Last resort cutoff |
| Browser | 60s | User experience |
Example 4: The Idempotency Key Pattern
The Incident: Network glitch caused duplicate payment charges when users clicked "Pay" multiple times.
Architectural Fix: Client-Generated Idempotency Keys
Frontend:
class CheckoutService {
async submitPayment(amount: number): Promise<PaymentResult> {
// β
Generate idempotency key on client
const idempotencyKey = crypto.randomUUID();
try {
return await fetch('/api/payments', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Idempotency-Key': idempotencyKey
},
body: JSON.stringify({ amount })
});
} catch (networkError) {
// β
Retry with SAME key
return await this.retryPayment(amount, idempotencyKey);
}
}
}
Backend:
public class PaymentsController : ControllerBase
{
private readonly IDistributedCache _cache;
[HttpPost]
public async Task<IActionResult> ProcessPayment(
[FromBody] PaymentRequest request,
[FromHeader(Name = "Idempotency-Key")] string idempotencyKey)
{
if (string.IsNullOrEmpty(idempotencyKey))
return BadRequest("Idempotency-Key header required");
var cacheKey = $"payment:{idempotencyKey}";
// β
Check if we've seen this request before
var cached = await _cache.GetStringAsync(cacheKey);
if (cached != null)
{
var previousResult = JsonSerializer.Deserialize<PaymentResult>(cached);
return Ok(previousResult); // Return same response
}
// β
Process payment
var result = await _paymentProcessor.ChargeAsync(request.Amount);
// β
Cache result for 24 hours
await _cache.SetStringAsync(
cacheKey,
JsonSerializer.Serialize(result),
new DistributedCacheEntryOptions
{
AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(24)
}
);
return Ok(result);
}
}
π Why this works:
- Same idempotency key = same response (safe to retry)
- Key generated on client = survives network failures
- Cached response = instant for duplicates
- 24-hour expiry = prevents indefinite storage growth
Common Mistakes β οΈ
Mistake 1: Blame-Focused Post-Mortems
β Wrong: "Bob's code caused the outage because he didn't test properly."
β Right: "The deployment process lacks automated integration tests. We'll add a pre-production staging environment with production-like load."
Why it matters: Blame shuts down learning. The goal is fixing systems, not people.
Mistake 2: Logging Without Context
β Wrong:
_logger.LogError("Payment failed");
β Right:
_logger.LogError(
"Payment failed: {Reason} for user {UserId}, transaction {TransactionId}, amount {Amount}",
ex.Message,
userId,
transactionId,
amount
);
Why it matters: Future debugging depends on having rich context. Logs should answer "who, what, when, where, why" without needing to add more logging.
Mistake 3: Forgetting to Test Failure Modes
β Wrong: Only testing happy paths
β Right: Test what happens when:
[Fact]
public async Task Checkout_ShouldQueue_WhenPaymentGatewayDown()
{
_mockGateway.Setup(g => g.ChargeCard(It.IsAny<decimal>()))
.ThrowsAsync(new HttpRequestException("Service unavailable"));
var result = await _checkoutService.Checkout(new Order());
Assert.Equal(CheckoutStatus.Pending, result.Status);
_mockQueue.Verify(q => q.EnqueueAsync(It.IsAny<ProcessPaymentMessage>()), Times.Once);
}
Mistake 4: Synchronous Blocking in Async Code
β Wrong (causes thread pool starvation):
public async Task<Order> GetOrder(int id)
{
var data = _httpClient.GetStringAsync($"/orders/{id}").Result; // β Blocks!
return JsonSerializer.Deserialize<Order>(data);
}
β Right:
public async Task<Order> GetOrder(int id)
{
var data = await _httpClient.GetStringAsync($"/orders/{id}");
return JsonSerializer.Deserialize<Order>(data);
}
Mistake 5: Not Setting Request Deadlines
β Wrong: Unbounded operations
public async Task<Data> FetchData()
{
return await _client.GetAsync("/data"); // β Could hang forever
}
β Right: Always set timeouts
public async Task<Data> FetchData()
{
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
return await _client.GetAsync("/data", cts.Token);
}
Key Takeaways π―
Every incident is a learning opportunity π
Use blameless post-mortems to uncover systemic issues, not individual mistakes.Design for failure from the start π‘οΈ
Assume dependencies will fail, networks will be unreliable, and users will send bad data.Circuit breakers prevent cascade failures π
Stop calling broken dependencies to preserve system resources.Idempotency makes retries safe π
Design operations so executing them multiple times has the same effect as once.Bulkheads isolate failures π’
Partition resources so one failure can't exhaust everything.Set timeouts at every layer β±οΈ
Don't let one slow operation block your entire system.Observability is non-negotiable π
You can't fix what you can't measure. Emit metrics, logs, and traces.Test failure scenarios π§ͺ
Your system's behavior during failures matters more than during success.
π Quick Reference Card
| Pattern | Problem Solved | Implementation |
|---|---|---|
| Circuit Breaker | Cascade failures | Stop calling failed dependencies |
| Bulkhead | Resource exhaustion | Partition connection pools/threads |
| Idempotency Key | Duplicate operations | Track processed request IDs |
| Timeout | Hanging operations | CancellationToken with deadline |
| Retry with Backoff | Transient failures | Exponential delays between retries |
| Fallback | Dependency unavailable | Return cached/default data |
| Health Check | Routing to broken instances | Endpoint reporting service status |
π Further Study
- Microsoft Azure Architecture Patterns - Comprehensive catalog of resilience patterns
- Site Reliability Engineering Book - Google's approach to building reliable systems
- Release It! by Michael Nygard - Classic book on production-ready software design
Remember: Your most valuable code isn't the features you shipβit's the defensive architecture that keeps them running when everything goes wrong. Every scar is a lesson waiting to be encoded into your system's design. ποΈβ¨