Slow vs Broken

Distinguishing degraded performance from actual failures

Slow vs Broken: Debugging Distributed Systems

Master the critical distinction between slow and broken systems with free flashcards and spaced repetition practice. This lesson covers performance degradation patterns, failure detection strategies, and diagnostic techniques—essential concepts for debugging distributed systems under pressure. Understanding whether a system is merely slow or completely broken determines your entire debugging approach and incident response strategy.

💡 Why This Matters: In distributed systems, misdiagnosing slowness as failure (or vice versa) can lead to catastrophic decisions like unnecessary failovers, data loss, or prolonged outages. A slow system may recover; a broken one never will.

Welcome to the Chaos Zone 🌪️

When your monitoring dashboard lights up like a Christmas tree at 3 AM, the first question isn't "What's wrong?" It's "Is it slow or is it broken?" This distinction is the foundation of effective debugging under pressure.

In distributed reality, systems exhibit complex failure modes:

Slow: High latency, degraded throughput, but still processing requests
Broken: Complete failure, no progress, data corruption, or infinite loops

The difference isn't always obvious. A system that's 99.9% slow might be effectively broken for your users. A system that appears broken might just be overwhelmed.

Core Concepts: The Fundamental Distinction 🎯

What Makes a System "Slow"? 🐌

Slow systems are still making forward progress, just not at the expected rate. Key characteristics:

Indicator	What You See	Why It Matters
Requests Complete	Eventually return success	System is functional, just degraded
Latency Increase	P99 goes from 100ms to 5s	Performance issue, not availability
Partial Success	Some requests fast, others slow	Indicates resource contention
Queue Growth	Work backlog increases steadily	Throughput < arrival rate

Example slow system behaviors:

## Slow database queries - still completing
def get_user_data(user_id):
    start = time.time()
    result = db.query("SELECT * FROM users WHERE id = ?", user_id)
    duration = time.time() - start
    
    if duration > 1.0:  # Slow but not broken
        log.warning(f"Slow query: {duration}s")
    
    return result  # Eventually returns

// API endpoint experiencing high latency
app.get('/api/orders', async (req, res) => {
  const start = Date.now();
  
  // Takes 8 seconds instead of 200ms
  const orders = await Order.find({ userId: req.user.id });
  
  const duration = Date.now() - start;
  metrics.recordLatency('orders.fetch', duration);
  
  res.json(orders);  // Responds eventually
});

What Makes a System "Broken"? 💔

Broken systems have stopped making meaningful progress. Critical indicators:

Indicator	What You See	Immediate Action
Requests Timeout	Never complete, hang indefinitely	Circuit breaker, failover
Error Spike	500s, connection refused, crashes	Stop traffic, investigate
Data Corruption	Invalid responses, wrong results	Immediate rollback
Deadlock/Livelock	No progress despite CPU activity	Thread dump, force restart

Example broken system behaviors:

// Deadlock - system completely stuck
func ProcessOrder(orderID string) error {
    mu1.Lock()
    defer mu1.Unlock()
    
    // Another goroutine has mu2 and waits for mu1
    mu2.Lock()  // DEADLOCK - never proceeds
    defer mu2.Unlock()
    
    // This code never executes
    return saveOrder(orderID)
}

// Panic causing complete service failure
fn handle_request(data: &str) -> Result<Response, Error> {
    let parsed = data.parse::<i32>().unwrap();  // PANIC on invalid input
    
    // Service crashes, no recovery
    Ok(Response { value: parsed })
}

## Infinite retry loop - appears active but makes no progress
def send_notification(user_id, message):
    while True:  # BROKEN: Never exits on permanent failure
        try:
            api.send(user_id, message)
            break
        except NetworkError:
            time.sleep(1)  # Retries forever even if user deleted

The Gray Zone: When Slow Becomes Broken ⚠️

The most dangerous scenarios are borderline cases:

┌──────────────────────────────────────────────────────┐
│         SYSTEM STATE SPECTRUM                        │
├──────────────────────────────────────────────────────┤
│                                                      │
│  ✅ Healthy  →  ⚠️ Degraded  →  🔥 Critical  →  💀 Dead │
│                                                      │
│  50ms p99   →   500ms p99   →   30s p99   →  Timeout│
│                                                      │
│              ↑                              ↑        │
│         SLOW ZONE                    BROKEN ZONE    │
│                                                      │
└──────────────────────────────────────────────────────┘

💡 Critical threshold: When timeout duration exceeds user/system tolerance, "slow" effectively becomes "broken" regardless of technical progress.

Diagnostic Techniques: Detective Work 🔍

The Timeout Test Pattern

First rule of distributed debugging: Test with aggressive timeouts

import requests
from requests.exceptions import Timeout

def diagnose_service_health(endpoint):
    """
    Quick test: Is it slow or broken?
    """
    short_timeout = 1.0  # Fail fast
    long_timeout = 30.0   # Wait and see
    
    # Test 1: Can it complete quickly?
    try:
        response = requests.get(endpoint, timeout=short_timeout)
        print("✅ HEALTHY: Fast response")
        return "healthy"
    except Timeout:
        print("⚠️ Might be slow or broken...")
    
    # Test 2: Can it complete eventually?
    try:
        response = requests.get(endpoint, timeout=long_timeout)
        print("🐌 SLOW: Completed but took >1s")
        return "slow"
    except Timeout:
        print("💔 BROKEN: Cannot complete even with 30s")
        return "broken"
    except Exception as e:
        print(f"💀 BROKEN: Error - {e}")
        return "broken"

Progress Monitoring Pattern

Track whether work is advancing, even if slowly:

public class ProgressMonitor {
    private AtomicLong lastProgressTime = new AtomicLong(System.currentTimeMillis());
    private AtomicLong itemsProcessed = new AtomicLong(0);
    
    public void recordProgress() {
        itemsProcessed.incrementAndGet();
        lastProgressTime.set(System.currentTimeMillis());
    }
    
    public SystemState diagnose() {
        long timeSinceProgress = System.currentTimeMillis() - lastProgressTime.get();
        long rate = itemsProcessed.get();
        
        if (timeSinceProgress > 60000) {
            // No progress in 60 seconds
            return SystemState.BROKEN;
        } else if (rate < expectedRate * 0.1) {
            // Processing at <10% expected rate
            return SystemState.SEVERELY_DEGRADED;
        } else if (rate < expectedRate * 0.5) {
            return SystemState.SLOW;
        } else {
            return SystemState.HEALTHY;
        }
    }
}

The Canary Request Pattern

Send synthetic test requests to detect failure modes:

type HealthCheck struct {
    endpoint string
    interval time.Duration
}

func (hc *HealthCheck) ContinuousMonitor() {
    ticker := time.NewTicker(hc.interval)
    
    for range ticker.C {
        start := time.Now()
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        
        resp, err := http.Get(hc.endpoint)
        duration := time.Since(start)
        cancel()
        
        if err != nil {
            if errors.Is(err, context.DeadlineExceeded) {
                log.Error("BROKEN: Request timed out")
                alerting.Trigger("service_broken")
            } else {
                log.Error("BROKEN: Request failed", err)
                alerting.Trigger("service_broken")
            }
        } else if duration > 1*time.Second {
            log.Warn("SLOW: Response took", duration)
            metrics.RecordLatency("canary", duration)
        } else {
            log.Debug("HEALTHY: Fast response")
        }
        
        resp.Body.Close()
    }
}

Real-World Examples 🌍

Example 1: Database Connection Pool Exhaustion (SLOW)

Scenario: Web application suddenly starts responding in 10+ seconds instead of 200ms.

## Symptom: Requests queuing, eventually completing
import psycopg2.pool
import time

class DatabaseService:
    def __init__(self):
        # Pool with only 10 connections
        self.pool = psycopg2.pool.SimpleConnectionPool(1, 10,
            host="db.prod", database="users")
    
    def get_user(self, user_id):
        start = time.time()
        
        # With 100 concurrent requests, 90 wait here
        conn = self.pool.getconn()  # SLOW: Waits for available connection
        wait_time = time.time() - start
        
        if wait_time > 1.0:
            print(f"⚠️ SLOW: Waited {wait_time}s for connection")
        
        try:
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
            result = cursor.fetchone()
            return result  # Eventually succeeds
        finally:
            self.pool.putconn(conn)

Diagnosis:

✅ Requests complete successfully
✅ No errors in logs
⚠️ High latency (P99 = 12s)
⚠️ Connection pool metrics show saturation

Verdict: SLOW - System making progress, just resource-constrained.

Fix: Increase connection pool size or reduce query duration.

Example 2: Distributed Deadlock (BROKEN)

Scenario: Microservice completely stops processing orders.

// Service A
async function processOrder(orderId) {
  const orderLock = await redis.lock(`order:${orderId}`);
  
  // Call Service B, which tries to lock the same order
  const inventory = await serviceB.reserveInventory(orderId);
  
  await orderLock.unlock();
}

// Service B
async function reserveInventory(orderId) {
  const inventoryLock = await redis.lock(`inventory:${orderId}`);
  
  // Calls back to Service A to verify order - DEADLOCK!
  const order = await serviceA.getOrderDetails(orderId);
  
  await inventoryLock.unlock();
}

Diagnosis:

❌ No requests completing
❌ All threads blocked
❌ Locks held indefinitely
❌ System appears active (CPU usage) but makes zero progress

Verdict: BROKEN - Circular dependency creates permanent deadlock.

Fix: Redesign lock acquisition order or use timeouts on locks.

Example 3: Memory Leak Leading to GC Thrashing (SLOW → BROKEN)

Scenario: Java application gradually degrades over 6 hours, then becomes unresponsive.

public class CacheService {
    // Memory leak: cache never evicts
    private static Map<String, byte[]> cache = new HashMap<>();
    
    public byte[] getData(String key) {
        if (!cache.containsKey(key)) {
            byte[] data = fetchFromDatabase(key);
            cache.put(key, data);  // LEAK: Grows forever
        }
        return cache.get(key);
    }
}

// GC behavior over time:
// Hour 1: Minor GC every 10s (10ms pause) - HEALTHY
// Hour 3: Minor GC every 5s (50ms pause) - SLOW
// Hour 5: Full GC every 30s (5s pause) - SEVERELY SLOW
// Hour 6: Continuous Full GC (permanent pause) - BROKEN

Diagnosis timeline:

Time	Heap Usage	GC Behavior	State
0-2h	40%	Quick minor GCs	✅ Healthy
2-4h	70%	Frequent GCs, rising latency	⚠️ Slow
4-6h	95%	Constant full GCs	🔥 Critical
6h+	99%	GC death spiral	💀 Broken

Verdict: Starts SLOW (high latency but functioning), transitions to BROKEN (GC thrashing prevents any useful work).

Fix: Add cache eviction policy (LRU) with size limits.

Example 4: Network Partition (BROKEN)

Scenario: Service can't reach database, retries infinitely.

use std::time::Duration;
use tokio::time::sleep;

async fn fetch_user_data(user_id: u64) -> Result<User, Error> {
    let mut attempts = 0;
    
    loop {
        attempts += 1;
        
        match database::query_user(user_id).await {
            Ok(user) => return Ok(user),
            Err(e) if e.is_network_error() => {
                // Network partition: database unreachable
                println!("Attempt {} failed: {:?}", attempts, e);
                sleep(Duration::from_secs(1)).await;
                // BROKEN: Retries forever, never succeeds
            }
            Err(e) => return Err(e),
        }
    }
}

Diagnosis:

❌ 100% error rate
❌ Connection refused / timeout
❌ No successful requests
✅ Service itself is running

Verdict: BROKEN - Complete inability to fulfill requests, despite service being "up".

Fix: Implement circuit breaker, fail fast instead of infinite retry.

Common Mistakes ⚠️

Mistake 1: Treating Slow as Broken (Premature Failover)

## ❌ WRONG: Immediate failover on first slow response
def call_service(request):
    try:
        response = service.call(request, timeout=0.5)
        return response
    except Timeout:
        # Assumes broken, switches to backup immediately
        print("Primary broken! Failing over...")
        return backup_service.call(request)

Why it's wrong: Temporary slowness (e.g., GC pause, CPU spike) triggers unnecessary failover, potentially overloading backup.

## ✅ RIGHT: Circuit breaker with failure threshold
class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=10):
        self.failures = 0
        self.threshold = failure_threshold
        self.state = "closed"  # closed = normal, open = broken
    
    def call(self, func, *args):
        if self.state == "open":
            raise Exception("Circuit open - service broken")
        
        try:
            result = func(*args)
            self.failures = 0  # Reset on success
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"  # Multiple failures = broken
                print("Circuit opened - service is broken")
            raise

Mistake 2: Treating Broken as Slow (Waiting Forever)

// ❌ WRONG: No timeout, waits indefinitely
func fetchData(url string) ([]byte, error) {
    resp, err := http.Get(url)  // No timeout!
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    
    // If server is broken and never responds, this hangs forever
    return ioutil.ReadAll(resp.Body)
}

Why it's wrong: Broken services consume resources (goroutines, connections) indefinitely, cascading failure.

// ✅ RIGHT: Aggressive timeouts and failure detection
func fetchData(url string) ([]byte, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return nil, err
    }
    
    client := &http.Client{
        Timeout: 5 * time.Second,
    }
    
    resp, err := client.Do(req)
    if err != nil {
        // Timeout or connection error = broken
        return nil, fmt.Errorf("service broken: %w", err)
    }
    defer resp.Body.Close()
    
    return ioutil.ReadAll(resp.Body)
}

Mistake 3: Ignoring the Gray Zone

// ❌ WRONG: Binary classification
function diagnoseSystem(latency) {
  if (latency < 1000) {
    return 'healthy';
  } else {
    return 'broken';
  }
  // Misses the critical "degraded" state
}

Why it's wrong: Real systems have gradual degradation. A 3-second response isn't "broken" but isn't healthy either.

// ✅ RIGHT: Multi-level health states
function diagnoseSystem(latency, errorRate) {
  if (errorRate > 0.5) {
    return 'broken';  // >50% errors = broken
  } else if (latency > 10000) {
    return 'effectively_broken';  // >10s = unusable
  } else if (latency > 2000 || errorRate > 0.1) {
    return 'degraded';  // Slow but functioning
  } else if (latency > 500) {
    return 'slow';  // Noticeable but acceptable
  } else {
    return 'healthy';
  }
}

Mistake 4: Not Measuring Progress

## ❌ WRONG: Only measuring latency
class QueueProcessor:
    def __init__(self):
        self.queue = Queue()
    
    def process(self):
        while True:
            item = self.queue.get()  # Might block forever if broken
            start = time.time()
            handle_item(item)
            latency = time.time() - start
            metrics.record('latency', latency)

Why it's wrong: If queue.get() blocks forever (queue broken), latency metrics never record anything. System appears fine (no high latency reported) but makes zero progress.

## ✅ RIGHT: Track throughput and progress
class QueueProcessor:
    def __init__(self):
        self.queue = Queue()
        self.items_processed = 0
        self.last_process_time = time.time()
    
    def process(self):
        while True:
            try:
                # Timeout allows detecting stuck queue
                item = self.queue.get(timeout=10)
                
                start = time.time()
                handle_item(item)
                duration = time.time() - start
                
                self.items_processed += 1
                self.last_process_time = time.time()
                
                metrics.record('latency', duration)
                metrics.record('throughput', self.items_processed)
                
            except queue.Empty:
                # No progress in 10s - check if broken
                time_since_progress = time.time() - self.last_process_time
                if time_since_progress > 60:
                    alert.trigger('queue_processor_broken')

Decision Framework: Your Action Matrix 🎯

📋 Slow vs Broken Decision Tree

                    ┌──────────────────┐
                    │ Are requests     │
                    │ completing?      │
                    └────────┬─────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
           ┌──┴──┐                       ┌──┴──┐
           │ YES │                       │ NO  │
           └──┬──┘                       └──┬──┘
              │                             │
              ▼                             ▼
    ┌─────────────────┐         ┌──────────────────┐
    │ Is latency      │         │ Are errors       │
    │ acceptable?     │         │ retriable?       │
    └────────┬────────┘         └────────┬─────────┘
             │                           │
      ┌──────┴──────┐         ┌──────────┴──────────┐
      │             │         │                     │
   ┌──┴──┐       ┌──┴──┐   ┌──┴──┐              ┌──┴──┐
   │ YES │       │ NO  │   │ YES │              │ NO  │
   └──┬──┘       └──┬──┘   └──┬──┘              └──┬──┘
      │             │         │                     │
      ▼             ▼         ▼                     ▼
  ✅ HEALTHY    ⚠️ SLOW   🔥 DEGRADED         💀 BROKEN
      │             │         │                     │
      ▼             ▼         ▼                     ▼
   Monitor     Investigate  Retry with          Failover
               scale up     backoff              immediately

Symptom	Classification	Action	Urgency
Requests succeed, latency 2x normal	SLOW	Investigate, scale resources	Medium
Requests timeout after 30s	BROKEN	Circuit break, failover	Critical
50% success rate, slow responses	DEGRADED	Partial failover, shed load	High
100% error rate (500s)	BROKEN	Stop traffic, rollback	Critical
Queue growing but draining	SLOW	Add workers, optimize	Medium
Queue growing, workers idle	BROKEN	Restart workers, fix deadlock	High

Key Metrics for Diagnosis 📊

The Golden Signals (Modified for Slow vs Broken)

class SystemHealth:
    """
    Essential metrics for diagnosing slow vs broken
    """
    def __init__(self):
        self.metrics = {
            'latency_p50': 0,
            'latency_p99': 0,
            'latency_p999': 0,
            'error_rate': 0,
            'success_rate': 0,
            'throughput': 0,
            'timeout_rate': 0,
        }
    
    def diagnose(self):
        # Decision logic based on metrics
        if self.metrics['success_rate'] < 0.01:
            return "BROKEN", "<1% success rate"
        
        if self.metrics['timeout_rate'] > 0.5:
            return "BROKEN", ">50% timeouts"
        
        if self.metrics['latency_p99'] > 30000:  # 30s
            return "EFFECTIVELY_BROKEN", "P99 > 30s"
        
        if self.metrics['error_rate'] > 0.1:
            return "DEGRADED", ">10% errors"
        
        if self.metrics['latency_p99'] > 2000:  # 2s
            return "SLOW", "P99 > 2s"
        
        if self.metrics['throughput'] < self.expected_throughput * 0.5:
            return "SLOW", "Throughput <50% expected"
        
        return "HEALTHY", "All metrics nominal"

Monitoring Dashboard Layout

┌────────────────────────────────────────────────────────┐
│              SYSTEM HEALTH DASHBOARD                   │
├────────────────────────────────────────────────────────┤
│                                                        │
│  STATUS: 🟡 SLOW                                       │
│                                                        │
│  Latency (last 5m):                                   │
│    P50:  ████ 450ms                                   │
│    P99:  ███████████ 2.1s  ⚠️ Above threshold        │
│    P999: █████████████████ 8.5s                       │
│                                                        │
│  Success Rate: ████████████████████ 98.5% ✅          │
│  Error Rate:   ██ 1.5%                                │
│  Timeout Rate: ▌ 0.2%                                 │
│                                                        │
│  Throughput: 850 req/s (expected: 1200 req/s)         │
│  Active Connections: 245 / 300                        │
│                                                        │
│  ⚠️ DIAGNOSIS: System is SLOW, not broken             │
│     - Requests completing successfully                 │
│     - Latency elevated but not timing out             │
│     - Likely cause: resource saturation               │
│                                                        │
└────────────────────────────────────────────────────────┘

Key Takeaways 🎓

📋 Quick Reference Card

Aspect	SLOW 🐌	BROKEN 💔
Progress	Making forward progress	No meaningful progress
Completion	Requests eventually succeed	Requests timeout or error
Errors	Low error rate (<10%)	High error rate (>50%)
Latency	High but bounded	Infinite (timeouts)
Response	Optimize, scale resources	Failover, circuit break
Urgency	Medium - investigate	Critical - immediate action
Recovery	May self-recover	Requires intervention

🔑 Critical Decision Points

First 30 seconds: Can ANY request complete successfully?
Timeout test: Set aggressive timeout (5s), does it ever succeed?
Progress check: Is throughput > 0 or completely stalled?
Error pattern: Retriable errors (slow) or fatal errors (broken)?
Resource state: Saturated (slow) or deadlocked (broken)?

🧠 Remember

Slow can become broken: Degradation has a tipping point
Measure progress, not just latency: Zero throughput = broken
Time matters: A 30-second response is effectively broken for most users
Fail fast on broken: Don't retry indefinitely
Be patient with slow: Allow time for recovery before failover

📚 Further Study

Google SRE Book - Monitoring Distributed Systems - Golden signals and effective monitoring
Martin Fowler - Circuit Breaker Pattern - Handling failure gracefully
AWS - Timeouts, Retries and Backoff with Jitter - Practical patterns for distributed systems

💡 Pro tip: In production, always instrument your systems to distinguish between "no requests" (broken intake), "requests hanging" (broken processing), and "requests slow" (degraded performance). These require completely different debugging approaches!

📝

Ready to practice?

This lesson has 15 questions to help you learn