You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

Slow vs Broken

Distinguishing degraded performance from actual failures

Slow vs Broken: Debugging Distributed Systems

Master the critical distinction between slow and broken systems with free flashcards and spaced repetition practice. This lesson covers performance degradation patterns, failure detection strategies, and diagnostic techniquesβ€”essential concepts for debugging distributed systems under pressure. Understanding whether a system is merely slow or completely broken determines your entire debugging approach and incident response strategy.

πŸ’‘ Why This Matters: In distributed systems, misdiagnosing slowness as failure (or vice versa) can lead to catastrophic decisions like unnecessary failovers, data loss, or prolonged outages. A slow system may recover; a broken one never will.

Welcome to the Chaos Zone πŸŒͺ️

When your monitoring dashboard lights up like a Christmas tree at 3 AM, the first question isn't "What's wrong?" It's "Is it slow or is it broken?" This distinction is the foundation of effective debugging under pressure.

In distributed reality, systems exhibit complex failure modes:

  • Slow: High latency, degraded throughput, but still processing requests
  • Broken: Complete failure, no progress, data corruption, or infinite loops

The difference isn't always obvious. A system that's 99.9% slow might be effectively broken for your users. A system that appears broken might just be overwhelmed.

Core Concepts: The Fundamental Distinction 🎯

What Makes a System "Slow"? 🐌

Slow systems are still making forward progress, just not at the expected rate. Key characteristics:

Indicator What You See Why It Matters
Requests Complete Eventually return success System is functional, just degraded
Latency Increase P99 goes from 100ms to 5s Performance issue, not availability
Partial Success Some requests fast, others slow Indicates resource contention
Queue Growth Work backlog increases steadily Throughput < arrival rate

Example slow system behaviors:

## Slow database queries - still completing
def get_user_data(user_id):
    start = time.time()
    result = db.query("SELECT * FROM users WHERE id = ?", user_id)
    duration = time.time() - start
    
    if duration > 1.0:  # Slow but not broken
        log.warning(f"Slow query: {duration}s")
    
    return result  # Eventually returns
// API endpoint experiencing high latency
app.get('/api/orders', async (req, res) => {
  const start = Date.now();
  
  // Takes 8 seconds instead of 200ms
  const orders = await Order.find({ userId: req.user.id });
  
  const duration = Date.now() - start;
  metrics.recordLatency('orders.fetch', duration);
  
  res.json(orders);  // Responds eventually
});

What Makes a System "Broken"? πŸ’”

Broken systems have stopped making meaningful progress. Critical indicators:

Indicator What You See Immediate Action
Requests Timeout Never complete, hang indefinitely Circuit breaker, failover
Error Spike 500s, connection refused, crashes Stop traffic, investigate
Data Corruption Invalid responses, wrong results Immediate rollback
Deadlock/Livelock No progress despite CPU activity Thread dump, force restart

Example broken system behaviors:

// Deadlock - system completely stuck
func ProcessOrder(orderID string) error {
    mu1.Lock()
    defer mu1.Unlock()
    
    // Another goroutine has mu2 and waits for mu1
    mu2.Lock()  // DEADLOCK - never proceeds
    defer mu2.Unlock()
    
    // This code never executes
    return saveOrder(orderID)
}
// Panic causing complete service failure
fn handle_request(data: &str) -> Result<Response, Error> {
    let parsed = data.parse::<i32>().unwrap();  // PANIC on invalid input
    
    // Service crashes, no recovery
    Ok(Response { value: parsed })
}
## Infinite retry loop - appears active but makes no progress
def send_notification(user_id, message):
    while True:  # BROKEN: Never exits on permanent failure
        try:
            api.send(user_id, message)
            break
        except NetworkError:
            time.sleep(1)  # Retries forever even if user deleted

The Gray Zone: When Slow Becomes Broken ⚠️

The most dangerous scenarios are borderline cases:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         SYSTEM STATE SPECTRUM                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                      β”‚
β”‚  βœ… Healthy  β†’  ⚠️ Degraded  β†’  πŸ”₯ Critical  β†’  πŸ’€ Dead β”‚
β”‚                                                      β”‚
β”‚  50ms p99   β†’   500ms p99   β†’   30s p99   β†’  Timeoutβ”‚
β”‚                                                      β”‚
β”‚              ↑                              ↑        β”‚
β”‚         SLOW ZONE                    BROKEN ZONE    β”‚
β”‚                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Critical threshold: When timeout duration exceeds user/system tolerance, "slow" effectively becomes "broken" regardless of technical progress.

Diagnostic Techniques: Detective Work πŸ”

The Timeout Test Pattern

First rule of distributed debugging: Test with aggressive timeouts

import requests
from requests.exceptions import Timeout

def diagnose_service_health(endpoint):
    """
    Quick test: Is it slow or broken?
    """
    short_timeout = 1.0  # Fail fast
    long_timeout = 30.0   # Wait and see
    
    # Test 1: Can it complete quickly?
    try:
        response = requests.get(endpoint, timeout=short_timeout)
        print("βœ… HEALTHY: Fast response")
        return "healthy"
    except Timeout:
        print("⚠️ Might be slow or broken...")
    
    # Test 2: Can it complete eventually?
    try:
        response = requests.get(endpoint, timeout=long_timeout)
        print("🐌 SLOW: Completed but took >1s")
        return "slow"
    except Timeout:
        print("πŸ’” BROKEN: Cannot complete even with 30s")
        return "broken"
    except Exception as e:
        print(f"πŸ’€ BROKEN: Error - {e}")
        return "broken"

Progress Monitoring Pattern

Track whether work is advancing, even if slowly:

public class ProgressMonitor {
    private AtomicLong lastProgressTime = new AtomicLong(System.currentTimeMillis());
    private AtomicLong itemsProcessed = new AtomicLong(0);
    
    public void recordProgress() {
        itemsProcessed.incrementAndGet();
        lastProgressTime.set(System.currentTimeMillis());
    }
    
    public SystemState diagnose() {
        long timeSinceProgress = System.currentTimeMillis() - lastProgressTime.get();
        long rate = itemsProcessed.get();
        
        if (timeSinceProgress > 60000) {
            // No progress in 60 seconds
            return SystemState.BROKEN;
        } else if (rate < expectedRate * 0.1) {
            // Processing at <10% expected rate
            return SystemState.SEVERELY_DEGRADED;
        } else if (rate < expectedRate * 0.5) {
            return SystemState.SLOW;
        } else {
            return SystemState.HEALTHY;
        }
    }
}

The Canary Request Pattern

Send synthetic test requests to detect failure modes:

type HealthCheck struct {
    endpoint string
    interval time.Duration
}

func (hc *HealthCheck) ContinuousMonitor() {
    ticker := time.NewTicker(hc.interval)
    
    for range ticker.C {
        start := time.Now()
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        
        resp, err := http.Get(hc.endpoint)
        duration := time.Since(start)
        cancel()
        
        if err != nil {
            if errors.Is(err, context.DeadlineExceeded) {
                log.Error("BROKEN: Request timed out")
                alerting.Trigger("service_broken")
            } else {
                log.Error("BROKEN: Request failed", err)
                alerting.Trigger("service_broken")
            }
        } else if duration > 1*time.Second {
            log.Warn("SLOW: Response took", duration)
            metrics.RecordLatency("canary", duration)
        } else {
            log.Debug("HEALTHY: Fast response")
        }
        
        resp.Body.Close()
    }
}

Real-World Examples 🌍

Example 1: Database Connection Pool Exhaustion (SLOW)

Scenario: Web application suddenly starts responding in 10+ seconds instead of 200ms.

## Symptom: Requests queuing, eventually completing
import psycopg2.pool
import time

class DatabaseService:
    def __init__(self):
        # Pool with only 10 connections
        self.pool = psycopg2.pool.SimpleConnectionPool(1, 10,
            host="db.prod", database="users")
    
    def get_user(self, user_id):
        start = time.time()
        
        # With 100 concurrent requests, 90 wait here
        conn = self.pool.getconn()  # SLOW: Waits for available connection
        wait_time = time.time() - start
        
        if wait_time > 1.0:
            print(f"⚠️ SLOW: Waited {wait_time}s for connection")
        
        try:
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
            result = cursor.fetchone()
            return result  # Eventually succeeds
        finally:
            self.pool.putconn(conn)

Diagnosis:

  • βœ… Requests complete successfully
  • βœ… No errors in logs
  • ⚠️ High latency (P99 = 12s)
  • ⚠️ Connection pool metrics show saturation

Verdict: SLOW - System making progress, just resource-constrained.

Fix: Increase connection pool size or reduce query duration.

Example 2: Distributed Deadlock (BROKEN)

Scenario: Microservice completely stops processing orders.

// Service A
async function processOrder(orderId) {
  const orderLock = await redis.lock(`order:${orderId}`);
  
  // Call Service B, which tries to lock the same order
  const inventory = await serviceB.reserveInventory(orderId);
  
  await orderLock.unlock();
}

// Service B
async function reserveInventory(orderId) {
  const inventoryLock = await redis.lock(`inventory:${orderId}`);
  
  // Calls back to Service A to verify order - DEADLOCK!
  const order = await serviceA.getOrderDetails(orderId);
  
  await inventoryLock.unlock();
}

Diagnosis:

  • ❌ No requests completing
  • ❌ All threads blocked
  • ❌ Locks held indefinitely
  • ❌ System appears active (CPU usage) but makes zero progress

Verdict: BROKEN - Circular dependency creates permanent deadlock.

Fix: Redesign lock acquisition order or use timeouts on locks.

Example 3: Memory Leak Leading to GC Thrashing (SLOW β†’ BROKEN)

Scenario: Java application gradually degrades over 6 hours, then becomes unresponsive.

public class CacheService {
    // Memory leak: cache never evicts
    private static Map<String, byte[]> cache = new HashMap<>();
    
    public byte[] getData(String key) {
        if (!cache.containsKey(key)) {
            byte[] data = fetchFromDatabase(key);
            cache.put(key, data);  // LEAK: Grows forever
        }
        return cache.get(key);
    }
}

// GC behavior over time:
// Hour 1: Minor GC every 10s (10ms pause) - HEALTHY
// Hour 3: Minor GC every 5s (50ms pause) - SLOW
// Hour 5: Full GC every 30s (5s pause) - SEVERELY SLOW
// Hour 6: Continuous Full GC (permanent pause) - BROKEN

Diagnosis timeline:

Time Heap Usage GC Behavior State
0-2h 40% Quick minor GCs βœ… Healthy
2-4h 70% Frequent GCs, rising latency ⚠️ Slow
4-6h 95% Constant full GCs πŸ”₯ Critical
6h+ 99% GC death spiral πŸ’€ Broken

Verdict: Starts SLOW (high latency but functioning), transitions to BROKEN (GC thrashing prevents any useful work).

Fix: Add cache eviction policy (LRU) with size limits.

Example 4: Network Partition (BROKEN)

Scenario: Service can't reach database, retries infinitely.

use std::time::Duration;
use tokio::time::sleep;

async fn fetch_user_data(user_id: u64) -> Result<User, Error> {
    let mut attempts = 0;
    
    loop {
        attempts += 1;
        
        match database::query_user(user_id).await {
            Ok(user) => return Ok(user),
            Err(e) if e.is_network_error() => {
                // Network partition: database unreachable
                println!("Attempt {} failed: {:?}", attempts, e);
                sleep(Duration::from_secs(1)).await;
                // BROKEN: Retries forever, never succeeds
            }
            Err(e) => return Err(e),
        }
    }
}

Diagnosis:

  • ❌ 100% error rate
  • ❌ Connection refused / timeout
  • ❌ No successful requests
  • βœ… Service itself is running

Verdict: BROKEN - Complete inability to fulfill requests, despite service being "up".

Fix: Implement circuit breaker, fail fast instead of infinite retry.

Common Mistakes ⚠️

Mistake 1: Treating Slow as Broken (Premature Failover)

## ❌ WRONG: Immediate failover on first slow response
def call_service(request):
    try:
        response = service.call(request, timeout=0.5)
        return response
    except Timeout:
        # Assumes broken, switches to backup immediately
        print("Primary broken! Failing over...")
        return backup_service.call(request)

Why it's wrong: Temporary slowness (e.g., GC pause, CPU spike) triggers unnecessary failover, potentially overloading backup.

## βœ… RIGHT: Circuit breaker with failure threshold
class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=10):
        self.failures = 0
        self.threshold = failure_threshold
        self.state = "closed"  # closed = normal, open = broken
    
    def call(self, func, *args):
        if self.state == "open":
            raise Exception("Circuit open - service broken")
        
        try:
            result = func(*args)
            self.failures = 0  # Reset on success
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"  # Multiple failures = broken
                print("Circuit opened - service is broken")
            raise

Mistake 2: Treating Broken as Slow (Waiting Forever)

// ❌ WRONG: No timeout, waits indefinitely
func fetchData(url string) ([]byte, error) {
    resp, err := http.Get(url)  // No timeout!
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    
    // If server is broken and never responds, this hangs forever
    return ioutil.ReadAll(resp.Body)
}

Why it's wrong: Broken services consume resources (goroutines, connections) indefinitely, cascading failure.

// βœ… RIGHT: Aggressive timeouts and failure detection
func fetchData(url string) ([]byte, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return nil, err
    }
    
    client := &http.Client{
        Timeout: 5 * time.Second,
    }
    
    resp, err := client.Do(req)
    if err != nil {
        // Timeout or connection error = broken
        return nil, fmt.Errorf("service broken: %w", err)
    }
    defer resp.Body.Close()
    
    return ioutil.ReadAll(resp.Body)
}

Mistake 3: Ignoring the Gray Zone

// ❌ WRONG: Binary classification
function diagnoseSystem(latency) {
  if (latency < 1000) {
    return 'healthy';
  } else {
    return 'broken';
  }
  // Misses the critical "degraded" state
}

Why it's wrong: Real systems have gradual degradation. A 3-second response isn't "broken" but isn't healthy either.

// βœ… RIGHT: Multi-level health states
function diagnoseSystem(latency, errorRate) {
  if (errorRate > 0.5) {
    return 'broken';  // >50% errors = broken
  } else if (latency > 10000) {
    return 'effectively_broken';  // >10s = unusable
  } else if (latency > 2000 || errorRate > 0.1) {
    return 'degraded';  // Slow but functioning
  } else if (latency > 500) {
    return 'slow';  // Noticeable but acceptable
  } else {
    return 'healthy';
  }
}

Mistake 4: Not Measuring Progress

## ❌ WRONG: Only measuring latency
class QueueProcessor:
    def __init__(self):
        self.queue = Queue()
    
    def process(self):
        while True:
            item = self.queue.get()  # Might block forever if broken
            start = time.time()
            handle_item(item)
            latency = time.time() - start
            metrics.record('latency', latency)

Why it's wrong: If queue.get() blocks forever (queue broken), latency metrics never record anything. System appears fine (no high latency reported) but makes zero progress.

## βœ… RIGHT: Track throughput and progress
class QueueProcessor:
    def __init__(self):
        self.queue = Queue()
        self.items_processed = 0
        self.last_process_time = time.time()
    
    def process(self):
        while True:
            try:
                # Timeout allows detecting stuck queue
                item = self.queue.get(timeout=10)
                
                start = time.time()
                handle_item(item)
                duration = time.time() - start
                
                self.items_processed += 1
                self.last_process_time = time.time()
                
                metrics.record('latency', duration)
                metrics.record('throughput', self.items_processed)
                
            except queue.Empty:
                # No progress in 10s - check if broken
                time_since_progress = time.time() - self.last_process_time
                if time_since_progress > 60:
                    alert.trigger('queue_processor_broken')

Decision Framework: Your Action Matrix 🎯

πŸ“‹ Slow vs Broken Decision Tree

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Are requests     β”‚
                    β”‚ completing?      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                             β”‚
           β”Œβ”€β”€β”΄β”€β”€β”                       β”Œβ”€β”€β”΄β”€β”€β”
           β”‚ YES β”‚                       β”‚ NO  β”‚
           β””β”€β”€β”¬β”€β”€β”˜                       β””β”€β”€β”¬β”€β”€β”˜
              β”‚                             β”‚
              β–Ό                             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Is latency      β”‚         β”‚ Are errors       β”‚
    β”‚ acceptable?     β”‚         β”‚ retriable?       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                           β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚             β”‚         β”‚                     β”‚
   β”Œβ”€β”€β”΄β”€β”€β”       β”Œβ”€β”€β”΄β”€β”€β”   β”Œβ”€β”€β”΄β”€β”€β”              β”Œβ”€β”€β”΄β”€β”€β”
   β”‚ YES β”‚       β”‚ NO  β”‚   β”‚ YES β”‚              β”‚ NO  β”‚
   β””β”€β”€β”¬β”€β”€β”˜       β””β”€β”€β”¬β”€β”€β”˜   β””β”€β”€β”¬β”€β”€β”˜              β””β”€β”€β”¬β”€β”€β”˜
      β”‚             β”‚         β”‚                     β”‚
      β–Ό             β–Ό         β–Ό                     β–Ό
  βœ… HEALTHY    ⚠️ SLOW   πŸ”₯ DEGRADED         πŸ’€ BROKEN
      β”‚             β”‚         β”‚                     β”‚
      β–Ό             β–Ό         β–Ό                     β–Ό
   Monitor     Investigate  Retry with          Failover
               scale up     backoff              immediately
Symptom Classification Action Urgency
Requests succeed, latency 2x normal SLOW Investigate, scale resources Medium
Requests timeout after 30s BROKEN Circuit break, failover Critical
50% success rate, slow responses DEGRADED Partial failover, shed load High
100% error rate (500s) BROKEN Stop traffic, rollback Critical
Queue growing but draining SLOW Add workers, optimize Medium
Queue growing, workers idle BROKEN Restart workers, fix deadlock High

Key Metrics for Diagnosis πŸ“Š

The Golden Signals (Modified for Slow vs Broken)

class SystemHealth:
    """
    Essential metrics for diagnosing slow vs broken
    """
    def __init__(self):
        self.metrics = {
            'latency_p50': 0,
            'latency_p99': 0,
            'latency_p999': 0,
            'error_rate': 0,
            'success_rate': 0,
            'throughput': 0,
            'timeout_rate': 0,
        }
    
    def diagnose(self):
        # Decision logic based on metrics
        if self.metrics['success_rate'] < 0.01:
            return "BROKEN", "<1% success rate"
        
        if self.metrics['timeout_rate'] > 0.5:
            return "BROKEN", ">50% timeouts"
        
        if self.metrics['latency_p99'] > 30000:  # 30s
            return "EFFECTIVELY_BROKEN", "P99 > 30s"
        
        if self.metrics['error_rate'] > 0.1:
            return "DEGRADED", ">10% errors"
        
        if self.metrics['latency_p99'] > 2000:  # 2s
            return "SLOW", "P99 > 2s"
        
        if self.metrics['throughput'] < self.expected_throughput * 0.5:
            return "SLOW", "Throughput <50% expected"
        
        return "HEALTHY", "All metrics nominal"

Monitoring Dashboard Layout

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              SYSTEM HEALTH DASHBOARD                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                        β”‚
β”‚  STATUS: 🟑 SLOW                                       β”‚
β”‚                                                        β”‚
β”‚  Latency (last 5m):                                   β”‚
β”‚    P50:  β–ˆβ–ˆβ–ˆβ–ˆ 450ms                                   β”‚
β”‚    P99:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 2.1s  ⚠️ Above threshold        β”‚
β”‚    P999: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 8.5s                       β”‚
β”‚                                                        β”‚
β”‚  Success Rate: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 98.5% βœ…          β”‚
β”‚  Error Rate:   β–ˆβ–ˆ 1.5%                                β”‚
β”‚  Timeout Rate: β–Œ 0.2%                                 β”‚
β”‚                                                        β”‚
β”‚  Throughput: 850 req/s (expected: 1200 req/s)         β”‚
β”‚  Active Connections: 245 / 300                        β”‚
β”‚                                                        β”‚
β”‚  ⚠️ DIAGNOSIS: System is SLOW, not broken             β”‚
β”‚     - Requests completing successfully                 β”‚
β”‚     - Latency elevated but not timing out             β”‚
β”‚     - Likely cause: resource saturation               β”‚
β”‚                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Takeaways πŸŽ“

πŸ“‹ Quick Reference Card

Aspect SLOW 🐌 BROKEN πŸ’”
Progress Making forward progress No meaningful progress
Completion Requests eventually succeed Requests timeout or error
Errors Low error rate (<10%) High error rate (>50%)
Latency High but bounded Infinite (timeouts)
Response Optimize, scale resources Failover, circuit break
Urgency Medium - investigate Critical - immediate action
Recovery May self-recover Requires intervention

πŸ”‘ Critical Decision Points

  1. First 30 seconds: Can ANY request complete successfully?
  2. Timeout test: Set aggressive timeout (5s), does it ever succeed?
  3. Progress check: Is throughput > 0 or completely stalled?
  4. Error pattern: Retriable errors (slow) or fatal errors (broken)?
  5. Resource state: Saturated (slow) or deadlocked (broken)?

🧠 Remember

  • Slow can become broken: Degradation has a tipping point
  • Measure progress, not just latency: Zero throughput = broken
  • Time matters: A 30-second response is effectively broken for most users
  • Fail fast on broken: Don't retry indefinitely
  • Be patient with slow: Allow time for recovery before failover

πŸ“š Further Study

πŸ’‘ Pro tip: In production, always instrument your systems to distinguish between "no requests" (broken intake), "requests hanging" (broken processing), and "requests slow" (degraded performance). These require completely different debugging approaches!