Slow vs Broken
Distinguishing degraded performance from actual failures
Slow vs Broken: Debugging Distributed Systems
Master the critical distinction between slow and broken systems with free flashcards and spaced repetition practice. This lesson covers performance degradation patterns, failure detection strategies, and diagnostic techniquesβessential concepts for debugging distributed systems under pressure. Understanding whether a system is merely slow or completely broken determines your entire debugging approach and incident response strategy.
π‘ Why This Matters: In distributed systems, misdiagnosing slowness as failure (or vice versa) can lead to catastrophic decisions like unnecessary failovers, data loss, or prolonged outages. A slow system may recover; a broken one never will.
Welcome to the Chaos Zone πͺοΈ
When your monitoring dashboard lights up like a Christmas tree at 3 AM, the first question isn't "What's wrong?" It's "Is it slow or is it broken?" This distinction is the foundation of effective debugging under pressure.
In distributed reality, systems exhibit complex failure modes:
- Slow: High latency, degraded throughput, but still processing requests
- Broken: Complete failure, no progress, data corruption, or infinite loops
The difference isn't always obvious. A system that's 99.9% slow might be effectively broken for your users. A system that appears broken might just be overwhelmed.
Core Concepts: The Fundamental Distinction π―
What Makes a System "Slow"? π
Slow systems are still making forward progress, just not at the expected rate. Key characteristics:
| Indicator | What You See | Why It Matters |
|---|---|---|
| Requests Complete | Eventually return success | System is functional, just degraded |
| Latency Increase | P99 goes from 100ms to 5s | Performance issue, not availability |
| Partial Success | Some requests fast, others slow | Indicates resource contention |
| Queue Growth | Work backlog increases steadily | Throughput < arrival rate |
Example slow system behaviors:
## Slow database queries - still completing
def get_user_data(user_id):
start = time.time()
result = db.query("SELECT * FROM users WHERE id = ?", user_id)
duration = time.time() - start
if duration > 1.0: # Slow but not broken
log.warning(f"Slow query: {duration}s")
return result # Eventually returns
// API endpoint experiencing high latency
app.get('/api/orders', async (req, res) => {
const start = Date.now();
// Takes 8 seconds instead of 200ms
const orders = await Order.find({ userId: req.user.id });
const duration = Date.now() - start;
metrics.recordLatency('orders.fetch', duration);
res.json(orders); // Responds eventually
});
What Makes a System "Broken"? π
Broken systems have stopped making meaningful progress. Critical indicators:
| Indicator | What You See | Immediate Action |
|---|---|---|
| Requests Timeout | Never complete, hang indefinitely | Circuit breaker, failover |
| Error Spike | 500s, connection refused, crashes | Stop traffic, investigate |
| Data Corruption | Invalid responses, wrong results | Immediate rollback |
| Deadlock/Livelock | No progress despite CPU activity | Thread dump, force restart |
Example broken system behaviors:
// Deadlock - system completely stuck
func ProcessOrder(orderID string) error {
mu1.Lock()
defer mu1.Unlock()
// Another goroutine has mu2 and waits for mu1
mu2.Lock() // DEADLOCK - never proceeds
defer mu2.Unlock()
// This code never executes
return saveOrder(orderID)
}
// Panic causing complete service failure
fn handle_request(data: &str) -> Result<Response, Error> {
let parsed = data.parse::<i32>().unwrap(); // PANIC on invalid input
// Service crashes, no recovery
Ok(Response { value: parsed })
}
## Infinite retry loop - appears active but makes no progress
def send_notification(user_id, message):
while True: # BROKEN: Never exits on permanent failure
try:
api.send(user_id, message)
break
except NetworkError:
time.sleep(1) # Retries forever even if user deleted
The Gray Zone: When Slow Becomes Broken β οΈ
The most dangerous scenarios are borderline cases:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β SYSTEM STATE SPECTRUM β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β β Healthy β β οΈ Degraded β π₯ Critical β π Dead β β β β 50ms p99 β 500ms p99 β 30s p99 β Timeoutβ β β β β β β β SLOW ZONE BROKEN ZONE β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Critical threshold: When timeout duration exceeds user/system tolerance, "slow" effectively becomes "broken" regardless of technical progress.
Diagnostic Techniques: Detective Work π
The Timeout Test Pattern
First rule of distributed debugging: Test with aggressive timeouts
import requests
from requests.exceptions import Timeout
def diagnose_service_health(endpoint):
"""
Quick test: Is it slow or broken?
"""
short_timeout = 1.0 # Fail fast
long_timeout = 30.0 # Wait and see
# Test 1: Can it complete quickly?
try:
response = requests.get(endpoint, timeout=short_timeout)
print("β
HEALTHY: Fast response")
return "healthy"
except Timeout:
print("β οΈ Might be slow or broken...")
# Test 2: Can it complete eventually?
try:
response = requests.get(endpoint, timeout=long_timeout)
print("π SLOW: Completed but took >1s")
return "slow"
except Timeout:
print("π BROKEN: Cannot complete even with 30s")
return "broken"
except Exception as e:
print(f"π BROKEN: Error - {e}")
return "broken"
Progress Monitoring Pattern
Track whether work is advancing, even if slowly:
public class ProgressMonitor {
private AtomicLong lastProgressTime = new AtomicLong(System.currentTimeMillis());
private AtomicLong itemsProcessed = new AtomicLong(0);
public void recordProgress() {
itemsProcessed.incrementAndGet();
lastProgressTime.set(System.currentTimeMillis());
}
public SystemState diagnose() {
long timeSinceProgress = System.currentTimeMillis() - lastProgressTime.get();
long rate = itemsProcessed.get();
if (timeSinceProgress > 60000) {
// No progress in 60 seconds
return SystemState.BROKEN;
} else if (rate < expectedRate * 0.1) {
// Processing at <10% expected rate
return SystemState.SEVERELY_DEGRADED;
} else if (rate < expectedRate * 0.5) {
return SystemState.SLOW;
} else {
return SystemState.HEALTHY;
}
}
}
The Canary Request Pattern
Send synthetic test requests to detect failure modes:
type HealthCheck struct {
endpoint string
interval time.Duration
}
func (hc *HealthCheck) ContinuousMonitor() {
ticker := time.NewTicker(hc.interval)
for range ticker.C {
start := time.Now()
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
resp, err := http.Get(hc.endpoint)
duration := time.Since(start)
cancel()
if err != nil {
if errors.Is(err, context.DeadlineExceeded) {
log.Error("BROKEN: Request timed out")
alerting.Trigger("service_broken")
} else {
log.Error("BROKEN: Request failed", err)
alerting.Trigger("service_broken")
}
} else if duration > 1*time.Second {
log.Warn("SLOW: Response took", duration)
metrics.RecordLatency("canary", duration)
} else {
log.Debug("HEALTHY: Fast response")
}
resp.Body.Close()
}
}
Real-World Examples π
Example 1: Database Connection Pool Exhaustion (SLOW)
Scenario: Web application suddenly starts responding in 10+ seconds instead of 200ms.
## Symptom: Requests queuing, eventually completing
import psycopg2.pool
import time
class DatabaseService:
def __init__(self):
# Pool with only 10 connections
self.pool = psycopg2.pool.SimpleConnectionPool(1, 10,
host="db.prod", database="users")
def get_user(self, user_id):
start = time.time()
# With 100 concurrent requests, 90 wait here
conn = self.pool.getconn() # SLOW: Waits for available connection
wait_time = time.time() - start
if wait_time > 1.0:
print(f"β οΈ SLOW: Waited {wait_time}s for connection")
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
result = cursor.fetchone()
return result # Eventually succeeds
finally:
self.pool.putconn(conn)
Diagnosis:
- β Requests complete successfully
- β No errors in logs
- β οΈ High latency (P99 = 12s)
- β οΈ Connection pool metrics show saturation
Verdict: SLOW - System making progress, just resource-constrained.
Fix: Increase connection pool size or reduce query duration.
Example 2: Distributed Deadlock (BROKEN)
Scenario: Microservice completely stops processing orders.
// Service A
async function processOrder(orderId) {
const orderLock = await redis.lock(`order:${orderId}`);
// Call Service B, which tries to lock the same order
const inventory = await serviceB.reserveInventory(orderId);
await orderLock.unlock();
}
// Service B
async function reserveInventory(orderId) {
const inventoryLock = await redis.lock(`inventory:${orderId}`);
// Calls back to Service A to verify order - DEADLOCK!
const order = await serviceA.getOrderDetails(orderId);
await inventoryLock.unlock();
}
Diagnosis:
- β No requests completing
- β All threads blocked
- β Locks held indefinitely
- β System appears active (CPU usage) but makes zero progress
Verdict: BROKEN - Circular dependency creates permanent deadlock.
Fix: Redesign lock acquisition order or use timeouts on locks.
Example 3: Memory Leak Leading to GC Thrashing (SLOW β BROKEN)
Scenario: Java application gradually degrades over 6 hours, then becomes unresponsive.
public class CacheService {
// Memory leak: cache never evicts
private static Map<String, byte[]> cache = new HashMap<>();
public byte[] getData(String key) {
if (!cache.containsKey(key)) {
byte[] data = fetchFromDatabase(key);
cache.put(key, data); // LEAK: Grows forever
}
return cache.get(key);
}
}
// GC behavior over time:
// Hour 1: Minor GC every 10s (10ms pause) - HEALTHY
// Hour 3: Minor GC every 5s (50ms pause) - SLOW
// Hour 5: Full GC every 30s (5s pause) - SEVERELY SLOW
// Hour 6: Continuous Full GC (permanent pause) - BROKEN
Diagnosis timeline:
| Time | Heap Usage | GC Behavior | State |
|---|---|---|---|
| 0-2h | 40% | Quick minor GCs | β Healthy |
| 2-4h | 70% | Frequent GCs, rising latency | β οΈ Slow |
| 4-6h | 95% | Constant full GCs | π₯ Critical |
| 6h+ | 99% | GC death spiral | π Broken |
Verdict: Starts SLOW (high latency but functioning), transitions to BROKEN (GC thrashing prevents any useful work).
Fix: Add cache eviction policy (LRU) with size limits.
Example 4: Network Partition (BROKEN)
Scenario: Service can't reach database, retries infinitely.
use std::time::Duration;
use tokio::time::sleep;
async fn fetch_user_data(user_id: u64) -> Result<User, Error> {
let mut attempts = 0;
loop {
attempts += 1;
match database::query_user(user_id).await {
Ok(user) => return Ok(user),
Err(e) if e.is_network_error() => {
// Network partition: database unreachable
println!("Attempt {} failed: {:?}", attempts, e);
sleep(Duration::from_secs(1)).await;
// BROKEN: Retries forever, never succeeds
}
Err(e) => return Err(e),
}
}
}
Diagnosis:
- β 100% error rate
- β Connection refused / timeout
- β No successful requests
- β Service itself is running
Verdict: BROKEN - Complete inability to fulfill requests, despite service being "up".
Fix: Implement circuit breaker, fail fast instead of infinite retry.
Common Mistakes β οΈ
Mistake 1: Treating Slow as Broken (Premature Failover)
## β WRONG: Immediate failover on first slow response
def call_service(request):
try:
response = service.call(request, timeout=0.5)
return response
except Timeout:
# Assumes broken, switches to backup immediately
print("Primary broken! Failing over...")
return backup_service.call(request)
Why it's wrong: Temporary slowness (e.g., GC pause, CPU spike) triggers unnecessary failover, potentially overloading backup.
## β
RIGHT: Circuit breaker with failure threshold
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=10):
self.failures = 0
self.threshold = failure_threshold
self.state = "closed" # closed = normal, open = broken
def call(self, func, *args):
if self.state == "open":
raise Exception("Circuit open - service broken")
try:
result = func(*args)
self.failures = 0 # Reset on success
return result
except Exception:
self.failures += 1
if self.failures >= self.threshold:
self.state = "open" # Multiple failures = broken
print("Circuit opened - service is broken")
raise
Mistake 2: Treating Broken as Slow (Waiting Forever)
// β WRONG: No timeout, waits indefinitely
func fetchData(url string) ([]byte, error) {
resp, err := http.Get(url) // No timeout!
if err != nil {
return nil, err
}
defer resp.Body.Close()
// If server is broken and never responds, this hangs forever
return ioutil.ReadAll(resp.Body)
}
Why it's wrong: Broken services consume resources (goroutines, connections) indefinitely, cascading failure.
// β
RIGHT: Aggressive timeouts and failure detection
func fetchData(url string) ([]byte, error) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return nil, err
}
client := &http.Client{
Timeout: 5 * time.Second,
}
resp, err := client.Do(req)
if err != nil {
// Timeout or connection error = broken
return nil, fmt.Errorf("service broken: %w", err)
}
defer resp.Body.Close()
return ioutil.ReadAll(resp.Body)
}
Mistake 3: Ignoring the Gray Zone
// β WRONG: Binary classification
function diagnoseSystem(latency) {
if (latency < 1000) {
return 'healthy';
} else {
return 'broken';
}
// Misses the critical "degraded" state
}
Why it's wrong: Real systems have gradual degradation. A 3-second response isn't "broken" but isn't healthy either.
// β
RIGHT: Multi-level health states
function diagnoseSystem(latency, errorRate) {
if (errorRate > 0.5) {
return 'broken'; // >50% errors = broken
} else if (latency > 10000) {
return 'effectively_broken'; // >10s = unusable
} else if (latency > 2000 || errorRate > 0.1) {
return 'degraded'; // Slow but functioning
} else if (latency > 500) {
return 'slow'; // Noticeable but acceptable
} else {
return 'healthy';
}
}
Mistake 4: Not Measuring Progress
## β WRONG: Only measuring latency
class QueueProcessor:
def __init__(self):
self.queue = Queue()
def process(self):
while True:
item = self.queue.get() # Might block forever if broken
start = time.time()
handle_item(item)
latency = time.time() - start
metrics.record('latency', latency)
Why it's wrong: If queue.get() blocks forever (queue broken), latency metrics never record anything. System appears fine (no high latency reported) but makes zero progress.
## β
RIGHT: Track throughput and progress
class QueueProcessor:
def __init__(self):
self.queue = Queue()
self.items_processed = 0
self.last_process_time = time.time()
def process(self):
while True:
try:
# Timeout allows detecting stuck queue
item = self.queue.get(timeout=10)
start = time.time()
handle_item(item)
duration = time.time() - start
self.items_processed += 1
self.last_process_time = time.time()
metrics.record('latency', duration)
metrics.record('throughput', self.items_processed)
except queue.Empty:
# No progress in 10s - check if broken
time_since_progress = time.time() - self.last_process_time
if time_since_progress > 60:
alert.trigger('queue_processor_broken')
Decision Framework: Your Action Matrix π―
π Slow vs Broken Decision Tree
ββββββββββββββββββββ
β Are requests β
β completing? β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββ΄βββββββββββββββ
β β
ββββ΄βββ ββββ΄βββ
β YES β β NO β
ββββ¬βββ ββββ¬βββ
β β
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Is latency β β Are errors β
β acceptable? β β retriable? β
ββββββββββ¬βββββββββ ββββββββββ¬ββββββββββ
β β
ββββββββ΄βββββββ ββββββββββββ΄βββββββββββ
β β β β
ββββ΄βββ ββββ΄βββ ββββ΄βββ ββββ΄βββ
β YES β β NO β β YES β β NO β
ββββ¬βββ ββββ¬βββ ββββ¬βββ ββββ¬βββ
β β β β
βΌ βΌ βΌ βΌ
β
HEALTHY β οΈ SLOW π₯ DEGRADED π BROKEN
β β β β
βΌ βΌ βΌ βΌ
Monitor Investigate Retry with Failover
scale up backoff immediately
| Symptom | Classification | Action | Urgency |
|---|---|---|---|
| Requests succeed, latency 2x normal | SLOW | Investigate, scale resources | Medium |
| Requests timeout after 30s | BROKEN | Circuit break, failover | Critical |
| 50% success rate, slow responses | DEGRADED | Partial failover, shed load | High |
| 100% error rate (500s) | BROKEN | Stop traffic, rollback | Critical |
| Queue growing but draining | SLOW | Add workers, optimize | Medium |
| Queue growing, workers idle | BROKEN | Restart workers, fix deadlock | High |
Key Metrics for Diagnosis π
The Golden Signals (Modified for Slow vs Broken)
class SystemHealth:
"""
Essential metrics for diagnosing slow vs broken
"""
def __init__(self):
self.metrics = {
'latency_p50': 0,
'latency_p99': 0,
'latency_p999': 0,
'error_rate': 0,
'success_rate': 0,
'throughput': 0,
'timeout_rate': 0,
}
def diagnose(self):
# Decision logic based on metrics
if self.metrics['success_rate'] < 0.01:
return "BROKEN", "<1% success rate"
if self.metrics['timeout_rate'] > 0.5:
return "BROKEN", ">50% timeouts"
if self.metrics['latency_p99'] > 30000: # 30s
return "EFFECTIVELY_BROKEN", "P99 > 30s"
if self.metrics['error_rate'] > 0.1:
return "DEGRADED", ">10% errors"
if self.metrics['latency_p99'] > 2000: # 2s
return "SLOW", "P99 > 2s"
if self.metrics['throughput'] < self.expected_throughput * 0.5:
return "SLOW", "Throughput <50% expected"
return "HEALTHY", "All metrics nominal"
Monitoring Dashboard Layout
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β SYSTEM HEALTH DASHBOARD β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β STATUS: π‘ SLOW β β β β Latency (last 5m): β β P50: ββββ 450ms β β P99: βββββββββββ 2.1s β οΈ Above threshold β β P999: βββββββββββββββββ 8.5s β β β β Success Rate: ββββββββββββββββββββ 98.5% β β β Error Rate: ββ 1.5% β β Timeout Rate: β 0.2% β β β β Throughput: 850 req/s (expected: 1200 req/s) β β Active Connections: 245 / 300 β β β β β οΈ DIAGNOSIS: System is SLOW, not broken β β - Requests completing successfully β β - Latency elevated but not timing out β β - Likely cause: resource saturation β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Takeaways π
π Quick Reference Card
| Aspect | SLOW π | BROKEN π |
|---|---|---|
| Progress | Making forward progress | No meaningful progress |
| Completion | Requests eventually succeed | Requests timeout or error |
| Errors | Low error rate (<10%) | High error rate (>50%) |
| Latency | High but bounded | Infinite (timeouts) |
| Response | Optimize, scale resources | Failover, circuit break |
| Urgency | Medium - investigate | Critical - immediate action |
| Recovery | May self-recover | Requires intervention |
π Critical Decision Points
- First 30 seconds: Can ANY request complete successfully?
- Timeout test: Set aggressive timeout (5s), does it ever succeed?
- Progress check: Is throughput > 0 or completely stalled?
- Error pattern: Retriable errors (slow) or fatal errors (broken)?
- Resource state: Saturated (slow) or deadlocked (broken)?
π§ Remember
- Slow can become broken: Degradation has a tipping point
- Measure progress, not just latency: Zero throughput = broken
- Time matters: A 30-second response is effectively broken for most users
- Fail fast on broken: Don't retry indefinitely
- Be patient with slow: Allow time for recovery before failover
π Further Study
- Google SRE Book - Monitoring Distributed Systems - Golden signals and effective monitoring
- Martin Fowler - Circuit Breaker Pattern - Handling failure gracefully
- AWS - Timeouts, Retries and Backoff with Jitter - Practical patterns for distributed systems
π‘ Pro tip: In production, always instrument your systems to distinguish between "no requests" (broken intake), "requests hanging" (broken processing), and "requests slow" (degraded performance). These require completely different debugging approaches!