The Moment of Crisis
Understanding the psychology and initial response when systems fail under pressure
The Moment of Crisis
Debugging under pressure becomes critical when production systems fail, deadlines loom, and stakeholders demand immediate answers. Master crisis management with free flashcards covering incident response, mental strategies, and systematic troubleshooting techniquesβessential skills for professional software engineers facing high-stakes situations.
Welcome to Crisis Debugging π¨
Every developer eventually faces the moment of crisis: production is down, users are affected, your manager is hovering, and you can feel your heart racing. The temptation to panic-fix, randomly change code, or revert everything becomes overwhelming. Yet the best debugging happens when you maintain composure and follow proven strategies.
This lesson equips you with:
- Mental frameworks for staying calm under pressure
- Triage techniques to prioritize what matters most
- Communication strategies to manage stakeholders
- Systematic approaches that prevent crisis-induced errors
π‘ Key Insight: The worst debugging decisions happen in the first 60 seconds of a crisis. Learning to pause, assess, and plan separates senior engineers from juniors.
Core Concepts: Anatomy of a Crisis
The Crisis Response Cycle π
When a critical bug surfaces, you enter a predictable psychological and technical cycle:
βββββββββββββββββββββββββββββββββββββββββββββββ
β CRISIS RESPONSE CYCLE β
βββββββββββββββββββββββββββββββββββββββββββββββ
π¨ ALERT/DISCOVERY
β
β
π° PANIC RESPONSE (Fight/Flight/Freeze)
β
β
π§ COGNITIVE OVERRIDE (Force calm)
β
β
π TRIAGE & ASSESSMENT
β
ββββββ΄βββββ
β β
π₯ Critical β οΈ Important
β β
β β
π οΈ IMMEDIATE π DOCUMENT
ACTION & QUEUE
β
β
β
RESOLUTION
β
β
π POST-MORTEM
The key to crisis debugging is recognizing where you are in this cycle and consciously moving to the next phase rather than getting stuck in panic.
The Debugging Triangle Under Pressure β οΈ
In crisis situations, you're balancing three competing forces:
| Factor | Pressure | Management Strategy |
|---|---|---|
| β° Time | Every minute costs money/reputation | Set explicit time-boxes: "15 minutes on this approach, then pivot" |
| π― Accuracy | Wrong fix could make it worse | Use safeguards: feature flags, canary deploys, rollback plans |
| π₯ Stakeholders | Manager/customers demanding updates | Scheduled updates (every 15 min) prevent constant interruptions |
The Anti-Pattern: Trying to satisfy all three simultaneously leads to:
- Hasty, untested fixes (sacrificing accuracy for speed)
- Analysis paralysis (sacrificing speed for perfect understanding)
- Information overload from constant context-switching
Mental State Management π§
The Physiological Response: When crisis hits, your body releases cortisol and adrenaline:
- β Tunnel vision (miss obvious clues)
- β Working memory impairment (forget what you just checked)
- β Impulsive decision-making (skip verification steps)
- β Confirmation bias intensifies (see only what confirms your theory)
The 90-Second Reset Technique:
- STOP π - Literally pause for 5 seconds
- BREATHE π¬οΈ - Three deep breaths (activates parasympathetic nervous system)
- ORIENT π§ - Ask: "What do I actually know right now?"
- PLAN π - Write down next 3 concrete steps
- EXECUTE β‘ - Follow the plan for at least 5 minutes before pivoting
π‘ Pro Tip: Keep a physical notepad during incidents. Writing by hand forces you to slow down and engages different brain regions, breaking the panic loop.
The Triage Protocol π₯
Borrowed from emergency medicine, triage means categorizing issues by severity and treating the most critical first.
The SEVER Framework for bug prioritization:
| Letter | Factor | Questions to Ask |
|---|---|---|
| S | Scope | How many users affected? All? 10%? One customer? |
| E | Effect | What's broken? Data loss? Service down? UI glitch? |
| V | Visibility | Who notices? External customers? Internal team only? |
| E | Escalation | Is it getting worse over time? Spreading? Stable? |
| R | Revenue | Direct financial impact? SLA violations? Legal issues? |
Priority Classification:
π΄ P0 - CRITICAL (All hands on deck)
- Data loss occurring
- Service completely down for >10% users
- Security breach
- Action: Drop everything, team swarm
π P1 - URGENT (Primary focus)
- Major feature broken for subset of users
- Performance degraded >50%
- Revenue-generating flow impaired
- Action: Dedicated engineer(s), frequent updates
π‘ P2 - IMPORTANT (Work during business hours)
- Non-critical feature broken
- Workaround available
- Affects internal tools only
- Action: Schedule investigation, normal pace
The Quick Triage Script (< 2 minutes):
## Mental checklist - answer these fast
triage_questions = [
"Is user data at risk? (YES = P0)",
"Can users complete critical flows? (NO = P0/P1)",
"Is the issue spreading/worsening? (YES = escalate priority)",
"Do we have monitoring/logs? (NO = first restore visibility)",
"Is there a safe rollback? (YES = consider it)"
]
## Decision tree
if data_at_risk or service_down:
priority = "P0"
action = "Immediate mitigation - perfect understanding comes later"
elif revenue_impacted or escalating:
priority = "P1"
action = "Focused debugging - timebox each theory"
else:
priority = "P2+"
action = "Document and queue - don't let it derail planned work"
Communication Under Pressure π’
The biggest mistake in crisis debugging: Going silent while you investigate. Silence creates:
- Anxiety in stakeholders
- Duplicate efforts (others start debugging too)
- Perception that nobody's handling it
The Status Update Template (use every 15-30 minutes):
[TIME] - [PRIORITY] Issue Update
STATUS: [Investigating | Root cause found | Fix in progress | Deployed | Resolved]
WHAT WE KNOW:
- Symptom: [specific observable behavior]
- Scope: [X users / Y% of requests / Z feature]
- Started: [timestamp or "unknown"]
WHAT WE'VE TRIED:
- β
[Thing that worked or ruled out]
- β [Thing that didn't help]
NEXT STEPS:
- [Specific action 1] (ETA: X minutes)
- [Specific action 2] (ETA: Y minutes)
WORKAROUND: [If available] or "None yet"
NEXT UPDATE: [timestamp]
Example:
14:37 - P1 Issue Update
STATUS: Root cause found, fix in progress
WHAT WE KNOW:
- Symptom: Checkout failing with 500 error
- Scope: ~15% of users (those with promo codes)
- Started: ~14:20 UTC after deployment
WHAT WE'VE TRIED:
- β
Checked database - no issues
- β
Found exception in logs: NullPointerException in PromoValidator
- β Rolling back didn't help (bug was dormant)
NEXT STEPS:
- Add null check to PromoValidator (ETA: 5 min)
- Deploy to canary (ETA: 10 min)
- Monitor for 5 min, then full rollout
WORKAROUND: Users can checkout without promo codes
NEXT UPDATE: 15:00 or when deployed
π‘ Why this works:
- Specific timestamps prevent "when did you last check?"
- Eliminated possibilities prevent duplicate debugging
- Clear next steps show you have a plan
- ETAs are short and realistic (under-promise, over-deliver)
Systematic Crisis Debugging Process π¬
When under pressure, discipline matters most. Follow this sequence:
Phase 1: CONTAIN (0-5 minutes)
Goal: Stop the bleeding before understanding the wound.
## Containment checklist
containment_actions = {
"Can we rollback safely?": "Do it now, debug the rolled-back version",
"Can we disable the feature?": "Feature flag off, restore core service",
"Can we route around it?": "Failover, circuit breaker, cache",
"Can we scale resources?": "Sometimes buys time to investigate"
}
## The containment decision tree
if safe_rollback_available:
rollback() # Restore service first
debug_the_rolled_back_version() # Then figure out what went wrong
elif feature_can_be_disabled:
feature_flag_off() # Isolate the problem
investigate_with_reduced_pressure()
elif can_route_traffic:
enable_failover() # Keep users flowing
fix_the_broken_path()
else:
# No quick containment - must debug under full pressure
proceed_to_phase_2()
Phase 2: INVESTIGATE (5-30 minutes)
Goal: Form and test hypotheses systematically.
The Hypothesis-Driven Investigation Loop:
ββββββββββββββββββββββββββββββββββββββββ
β 1. OBSERVE β
β What's the specific symptom? β
β Gather logs, metrics, error traces β
ββββββββββββ¬ββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββ
β 2. HYPOTHESIZE β
β Brainstorm 2-3 possible causes β
β (Don't commit to one too early!) β
ββββββββββββ¬ββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββ
β 3. PREDICT β
β "If X is the cause, I should see Y" β
β Make falsifiable predictions β
ββββββββββββ¬ββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββ
β 4. TEST β
β Look for Y (the prediction) β
β ONE test at a time! β
ββββββββββββ¬ββββββββββββββββββββββββββββ
β
ββββββββ΄βββββββ
β β
β β
Confirmed? Ruled out?
β β
β β
Fix it! Try next hypothesis
β
ββββββββ (loop back to step 2)
Critical Debugging Discipline:
// β WRONG: Panic debugging (changing multiple things)
function panicDebug() {
// Change database timeout
db.setTimeout(5000);
// Also restart service
service.restart();
// And clear cache
cache.clear();
// And update library
updateDependency('problematic-lib');
// Now if it works, which change fixed it??
// If it breaks worse, which change caused it??
}
// β
RIGHT: Controlled debugging (one variable at a time)
function systematicDebug() {
// HYPOTHESIS: Database timeout is too low
// PREDICTION: If true, I'll see timeout errors in logs
const timeoutErrors = logs.filter(e => e.type === 'TIMEOUT');
if (timeoutErrors.length > 0) {
// Evidence supports hypothesis
// Test ONE change
db.setTimeout(5000);
// Monitor for 2 minutes
waitAndVerify(120000);
if (issueResolved()) {
// Success! We know exactly what fixed it
logRootCause('Database timeout was too low');
} else {
// Didn't work, rollback and try next hypothesis
db.setTimeout(ORIGINAL_VALUE);
nextHypothesis();
}
}
}
Phase 3: FIX (30-60 minutes)
Goal: Deploy the smallest safe change that resolves the issue.
## Fix deployment safety checklist
def deploy_crisis_fix(fix_code):
# 1. Can this fix make things WORSE?
risk_assessment = analyze_blast_radius(fix_code)
if risk_assessment == "HIGH":
get_second_pair_of_eyes() # Don't deploy alone
# 2. Do we have a quick rollback?
ensure_feature_flag_or_quick_revert_available()
# 3. Can we test in isolation?
if canary_environment_available:
deploy_to_canary(fix_code)
monitor(duration="5min", metrics=["error_rate", "latency"])
if canary_healthy:
proceed_to_full_deployment()
else:
rollback_canary()
return "Fix made it worse, investigating further"
# 4. Deploy with observability
deploy_with_monitoring(fix_code, alert_on=["error_spike", "latency_increase"])
# 5. Verify the fix
verify_symptom_resolved()
verify_no_new_errors_introduced()
return "Fix deployed and verified"
Real-World Examples π
Example 1: The Production Database Meltdown πΎ
The Crisis:
09:47 UTC - Monitoring alerts:
- API latency: 200ms β 15000ms (75x increase)
- Database CPU: 40% β 98%
- Error rate: 0.1% β 12%
- Customer support tickets: 50 in 3 minutes
The Panic Response (what junior dev did):
-- β Started killing queries randomly
KILL QUERY 12847;
KILL QUERY 12849;
KILL QUERY 12851;
-- β Restarted database (made it worse - lost connections)
SYSTEMCTL RESTART postgresql
-- β Changed configuration blindly
SET max_connections = 1000; -- Was 100, now overloaded
SET shared_buffers = '8GB'; -- Crashed the server
The Systematic Response (what senior dev did):
-- β
STEP 1: OBSERVE (30 seconds)
-- Check what queries are running
SELECT pid, query, query_start, state, wait_event
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Found: 47 instances of same slow query, all started ~09:45
-- Query: SELECT * FROM orders WHERE user_id = X (no index on user_id!)
-- β
STEP 2: HYPOTHESIZE
-- "New code deployed at 09:44 introduced N+1 query problem"
-- β
STEP 3: PREDICT
-- "If true, should see deployment correlation and repeated pattern"
-- β
STEP 4: VERIFY
SELECT query, COUNT(*) as occurrences
FROM pg_stat_activity
GROUP BY query
ORDER BY occurrences DESC;
-- Confirmed: Same query repeated 47 times
-- β
STEP 5: IMMEDIATE MITIGATION (don't wait for perfect fix)
-- Kill only the problematic queries
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE query LIKE '%FROM orders WHERE user_id%'
AND state = 'active';
-- β
STEP 6: TEMPORARY FIX
-- Rollback the deployment (takes 2 minutes)
ROLLBACK TO VERSION 2.4.1
-- β
STEP 7: VERIFY
-- Latency back to 200ms within 30 seconds
-- Error rate back to 0.1%
-- β
STEP 8: PROPER FIX (after service restored)
-- Add missing index (in rolled-back version)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- β
STEP 9: RE-DEPLOY (with monitoring)
-- Deploy fixed version 2.4.2 with index in place
Timeline Comparison:
| Time | Panic Approach | Systematic Approach |
|---|---|---|
| 09:47 | Alert fires | Alert fires |
| 09:48 | Random query killing | Gather data (pg_stat_activity) |
| 09:50 | Restart DB (service down!) | Identify root cause (N+1 query) |
| 09:55 | Change config, crashes | Rollback deployment |
| 10:00 | Still down, escalating | β Service restored |
| 10:30 | Finally restored via full rollback | β Proper fix deployed |
Key Lessons:
- π― Observe before acting (30 seconds of data gathering saved 30 minutes)
- π― One change at a time (rollback worked because we knew exactly what it would undo)
- π― Mitigation β root cause fix (kill queries to buy time, then fix properly)
Example 2: The Memory Leak Under Load π₯
The Crisis:
## Symptom: Application crashes every 2 hours in production
## Happens only under high traffic (>1000 req/sec)
## No crashes in staging (max 100 req/sec)
import psutil
import time
## β WRONG: Panic response
def panic_response():
# "Let's just restart it every hour!"
# (Treating symptom, not cause)
while True:
time.sleep(3600)
os.system('systemctl restart app')
# Users experience hourly downtime
# Root cause still exists
## β
RIGHT: Systematic investigation
def systematic_investigation():
# STEP 1: OBSERVE - What changes before crash?
process = psutil.Process(os.getpid())
baseline_memory = process.memory_info().rss
print(f"Baseline memory: {baseline_memory / 1024 / 1024:.2f} MB")
# Monitor memory every minute
for minute in range(120): # 2 hours
time.sleep(60)
current_memory = process.memory_info().rss
growth = current_memory - baseline_memory
print(f"Minute {minute}: {current_memory / 1024 / 1024:.2f} MB "
f"(+{growth / 1024 / 1024:.2f} MB)")
# Output reveals steady growth: ~50MB per minute
# HYPOTHESIS: Memory leak, not handling cleanup
The Investigation:
## STEP 2: Profile memory allocations
import tracemalloc
tracemalloc.start()
## Run for 10 minutes under load
time.sleep(600)
## Get top memory consumers
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(f"{stat.filename}:{stat.lineno}: {stat.size / 1024 / 1024:.2f} MB")
## Output reveals:
## app/cache.py:47: 2847.3 MB β π΄ SMOKING GUN
## app/handlers.py:112: 89.2 MB
## app/models.py:203: 45.1 MB
The Root Cause:
## cache.py (the problematic code)
class RequestCache:
def __init__(self):
self.cache = {} # β Never cleared!
def cache_response(self, user_id, response_data):
# Each request adds to cache
self.cache[user_id] = response_data
# β No eviction policy
# β No size limit
# At 1000 req/sec, adds ~1MB/sec to memory
## β
THE FIX: Add LRU cache with size limit
from functools import lru_cache
from collections import OrderedDict
class RequestCache:
def __init__(self, max_size=10000):
self.cache = OrderedDict()
self.max_size = max_size
def cache_response(self, user_id, response_data):
self.cache[user_id] = response_data
# Evict oldest entries when limit reached
if len(self.cache) > self.max_size:
self.cache.popitem(last=False) # Remove oldest
Deployment Strategy:
## Don't deploy blindly - verify the fix
def verify_fix():
# STEP 1: Load test with fix in staging
run_load_test(duration="4 hours", requests_per_sec=1000)
# STEP 2: Monitor memory profile
memory_snapshots = collect_memory_over_time()
# STEP 3: Verify memory is bounded
assert max(memory_snapshots) < baseline + 500_MB # Reasonable growth
assert memory_is_stable(memory_snapshots) # Not constantly growing
# STEP 4: Deploy to production canary
deploy_to_canary(instances=2)
# STEP 5: Compare canary vs old version
monitor_for(duration="1 hour")
if canary_memory_stable and no_crashes:
deploy_to_all_instances()
else:
rollback_canary()
investigate_further()
Example 3: The Race Condition Heisenbug π
The Crisis: Intermittent data corruption affecting ~0.1% of transactions.
// THE PROBLEM CODE
class OrderProcessor {
constructor() {
this.orderTotal = 0;
}
// β Race condition: Two async operations modify shared state
async processOrder(order) {
// Step 1: Calculate total
const itemTotal = await this.calculateItems(order);
// β οΈ DANGER ZONE: Await causes context switch
// Another processOrder() call could start here!
// Step 2: Add tax
this.orderTotal = itemTotal; // β Might overwrite another order's total
const tax = await this.calculateTax(this.orderTotal);
// Step 3: Save
this.orderTotal += tax; // β Wrong total if interrupted!
await this.saveOrder(order, this.orderTotal);
}
}
// Under high concurrency:
// Thread A: calculates Order #1 = $100
// β awaits tax calculation
// Thread B: calculates Order #2 = $200
// β overwrites this.orderTotal = $200
// Thread A: resumes, adds tax to $200 (wrong order!)
// β Order #1 saved with Order #2's price
The Debugging Challenge:
// β WRONG: Try to reproduce with simple test
async function testRaceCondition() {
const processor = new OrderProcessor();
// This test passes! (Single-threaded)
await processor.processOrder({id: 1, items: [...]});
await processor.processOrder({id: 2, items: [...]});
// β οΈ FALSE CONFIDENCE: Race condition requires true concurrency
}
// β
RIGHT: Reproduce with concurrent load
async function properRaceConditionTest() {
const processor = new OrderProcessor();
// Fire 100 concurrent orders
const orders = Array.from({length: 100}, (_, i) => ({
id: i,
items: [{price: 100}] // All should total $100 + tax
}));
// Process all simultaneously
const results = await Promise.all(
orders.map(order => processor.processOrder(order))
);
// Check for corruption
const incorrectTotals = results.filter(total =>
total < 105 || total > 115 // Expected: ~$110 with tax
);
if (incorrectTotals.length > 0) {
console.log(`π΄ RACE CONDITION CONFIRMED: ${incorrectTotals.length} corrupted`);
}
}
The Fix:
// β
SOLUTION 1: Remove shared state (preferred)
class OrderProcessor {
// No instance variables!
async processOrder(order) {
// Use local variables only
const itemTotal = await this.calculateItems(order);
const tax = await this.calculateTax(itemTotal); // β Use local var
const finalTotal = itemTotal + tax; // β All local, thread-safe
await this.saveOrder(order, finalTotal);
return finalTotal;
}
}
// β
SOLUTION 2: Use mutex/lock (if shared state needed)
const { Mutex } = require('async-mutex');
class OrderProcessor {
constructor() {
this.mutex = new Mutex();
this.orderTotal = 0;
}
async processOrder(order) {
// Acquire lock - only one order processed at a time
const release = await this.mutex.acquire();
try {
const itemTotal = await this.calculateItems(order);
this.orderTotal = itemTotal;
const tax = await this.calculateTax(this.orderTotal);
this.orderTotal += tax;
await this.saveOrder(order, this.orderTotal);
} finally {
release(); // Always release lock
}
}
}
Lesson: Race conditions are the hardest bugs to debug under pressure because:
- They're intermittent (Heisenbugs - disappear when you look closely)
- Standard logging doesn't help (timing matters)
- Reproduced only under production-like load
- Solution: Always design for concurrency from the start
Example 4: The Cascading Failure π₯
The Scenario: One microservice goes down, takes entire system with it.
## THE VULNERABLE ARCHITECTURE
class OrderService:
def create_order(self, user_id, items):
# β No timeout, no retry limit, no fallback
user = UserService.get_user(user_id) # External call
inventory = InventoryService.reserve(items) # External call
payment = PaymentService.charge(user, total) # External call
# If ANY service is slow/down, this request hangs forever
# Under load: Thread pool exhausted β Service appears down
# Other services calling THIS service also hang β Cascade!
## THE CASCADE:
## 1. PaymentService latency spikes (3rd party API slow)
## 2. OrderService threads wait indefinitely
## 3. OrderService stops responding (thread pool exhausted)
## 4. Frontend calls to OrderService timeout
## 5. Frontend appears down to users
## 6. Load balancer marks Frontend unhealthy
## 7. All traffic shifts to remaining Frontends
## 8. Remaining Frontends overloaded β They go down too
## 9. Total system failure from one slow dependency
The Crisis Response:
## β
STEP 1: CIRCUIT BREAKER (stop the bleeding)
from circuitbreaker import circuit
class OrderService:
@circuit(failure_threshold=5, recovery_timeout=60)
def create_order(self, user_id, items):
try:
# Timeouts prevent hanging
user = UserService.get_user(user_id, timeout=2)
inventory = InventoryService.reserve(items, timeout=2)
payment = PaymentService.charge(user, total, timeout=5)
return self.save_order(user, inventory, payment)
except TimeoutError as e:
# Circuit opens after 5 failures
# Future requests fail fast (don't wait)
raise ServiceUnavailable("Order service temporarily unavailable")
## β
STEP 2: GRACEFUL DEGRADATION
class OrderService:
def create_order(self, user_id, items):
try:
user = UserService.get_user(user_id, timeout=2)
except TimeoutError:
# β
Use cached user data (slightly stale is better than down)
user = self.get_cached_user(user_id)
try:
inventory = InventoryService.reserve(items, timeout=2)
except TimeoutError:
# β
Optimistically assume in stock, verify async later
inventory = self.optimistic_reserve(items)
try:
payment = PaymentService.charge(user, total, timeout=5)
except TimeoutError:
# β
Queue payment for later processing
self.queue_payment(user, total)
return {"status": "pending", "order_id": order_id}
## β
STEP 3: BULKHEAD PATTERN (isolate failures)
import concurrent.futures
class OrderService:
def __init__(self):
# Separate thread pools for each dependency
self.user_pool = concurrent.futures.ThreadPoolExecutor(max_workers=10)
self.inventory_pool = concurrent.futures.ThreadPoolExecutor(max_workers=10)
self.payment_pool = concurrent.futures.ThreadPoolExecutor(max_workers=10)
# If PaymentService hangs, it exhausts payment_pool
# But user_pool and inventory_pool still function
# β Partial degradation instead of total failure
Monitoring Dashboard During Crisis:
βββββββββββββββββββββββββββββββββββββββββββββββββββ β SERVICE HEALTH DASHBOARD β βββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β UserService [ββββββββββββ] 98% β β β InventoryService [ββββββββββββ] 92% β β β PaymentService [βββββββββββ] 35% π΄ β β OrderService [βββββββββββ] 87% β οΈ β β Frontend [βββββββββββ] 79% β οΈ β β β β Circuit Breakers: β β ββ Payment β Order: OPEN π΄ (failing fast) β β ββ Order β Frontend: HALF-OPEN β οΈ (testing) β β ββ Others: CLOSED β β β β β Fallback Strategies Active: β β ββ Using cached user data β β ββ Queuing 127 payments β β ββ Optimistic inventory (92% success rate) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Lessons:
- π― Timeouts everywhere: Never wait indefinitely
- π― Circuit breakers: Fail fast when dependency is down
- π― Graceful degradation: Reduced functionality > no functionality
- π― Bulkheads: Isolate failures to prevent cascade
Common Mistakes β οΈ
Mistake 1: The "Shotgun Debug" π«
What it is: Changing multiple things simultaneously hoping something works.
## β WRONG
def fix_performance_issue():
increase_cache_size() # Change 1
update_library_version() # Change 2
modify_database_pool() # Change 3
adjust_thread_count() # Change 4
restart_service() # Change 5
# If it works: Which change fixed it?
# If it fails: Which change broke it further?
# Now you have 5 variables to untangle!
## β
RIGHT
def fix_performance_systematically():
# Test ONE hypothesis at a time
hypothesis = "Cache size is too small"
# Make ONE change
original_cache_size = get_cache_size()
increase_cache_size()
# Measure impact
performance_improved = measure_for(duration=300) # 5 min
if performance_improved:
log_success("Cache size was the issue")
else:
# Revert and try next hypothesis
set_cache_size(original_cache_size)
next_hypothesis()
Why it happens: Panic + time pressure = "try everything!"
The fix: Force yourself to write down your hypothesis before changing anything.
Mistake 2: The "It Works On My Machine" Trap π»
What it is: Debugging in the wrong environment.
## β WRONG
## Developer: "Can't reproduce the bug locally"
## (Local machine has different config, data, load, network)
## Developer adds debug logging
print("Debug: user_id =", user_id)
print("Debug: Starting payment processing")
print("Debug: Payment completed successfully")
## Deploys to production...
## Bug still happens, logs show nothing useful
## β
RIGHT
## Reproduce in production-like environment
## Option 1: Connect to production logs/metrics
tail -f /var/log/app/production.log | grep "ERROR"
## Option 2: Create staging environment with production data
copy_production_database_to_staging()
apply_same_load_pattern()
reproduce_user_journey()
## Option 3: Production debugging (careful!)
enable_verbose_logging_for_single_user(user_id="affected_user")
## Don't enable for ALL users - log volume will crash system
Why it happens: Production has complexity that dev environments don't:
- Scale (1 vs 10,000 concurrent users)
- Data (clean test data vs messy real data)
- Network (localhost vs distributed services)
- Configuration (different env vars, secrets)
The fix: Always debug in an environment that mirrors production.
Mistake 3: Ignoring the Logs π
What it is: Acting on intuition instead of evidence.
// β WRONG
function debugBasedOnAssumption() {
// Developer: "It's probably the database"
// (Hasn't actually checked logs)
// Spends 2 hours optimizing queries
optimizeDatabaseQueries();
addMoreIndexes();
tweakConnectionPool();
// Still broken!
// Finally checks logs...
// Actual error: "Network timeout connecting to Redis"
// Wrong service entirely!
}
// β
RIGHT
function debugBasedOnEvidence() {
// STEP 1: Check the actual error
const errorLogs = getRecentErrors();
console.log(errorLogs);
// Output: "RedisConnectionError: ETIMEDOUT"
// STEP 2: Form hypothesis based on evidence
// "Redis is unreachable or slow"
// STEP 3: Verify
const redisPing = measureRedisLatency();
console.log(`Redis latency: ${redisPing}ms`);
// Output: "Redis latency: 15847ms" (should be <10ms)
// STEP 4: Fix the RIGHT thing
investigateRedisPerformance();
}
Why it happens: Confirmation bias + time pressure = "I know what the problem is!"
The fix: Logs first, hypotheses second. Always.
Mistake 4: The Premature Optimization π
What it is: Fixing performance when the issue is correctness.
// β WRONG: Performance "fix" for correctness bug
func processPayment(amount float64) error {
// Bug: Occasionally charges wrong amount
// Developer's "fix": Make it faster!
// (Doesn't address the bug)
// Add caching
if cached := cache.Get("payment"); cached != nil {
return nil // β Returns cached result for DIFFERENT payment!
}
// Use goroutines
go func() {
charge(amount) // β Race condition now!
}()
return nil // β Returns before payment completes
}
// β
RIGHT: Fix the actual bug first
func processPayment(userID string, amount float64) error {
// STEP 1: Reproduce and understand the bug
log.Printf("Processing payment: user=%s, amount=%.2f", userID, amount)
// STEP 2: Fix correctness issue
// (Bug was: shared variable between requests)
result := chargePayment(userID, amount) // Each call independent
// STEP 3: Verify fix
if result.ChargedAmount != amount {
log.Errorf("CRITICAL: Charged %.2f but expected %.2f",
result.ChargedAmount, amount)
return errors.New("incorrect amount charged")
}
// STEP 4: THEN optimize (if needed)
// But only after correctness is guaranteed!
return nil
}
Why it happens: Misidentifying symptoms (slow = needs optimization) vs root cause (slow because buggy code is retrying).
The fix: Correct first, fast second. A fast bug is still a bug.
Mistake 5: Going Silent π€
What it is: Disappearing into deep debugging while stakeholders panic.
β WRONG:
10:00 - Manager: "The site is down!"
10:01 - You: "I'm on it"
10:02 - [silence]
10:10 - Manager: "Any update??"
10:11 - [still silence]
10:20 - Manager: "Should we call in the VP?"
10:25 - [silence continues]
10:30 - You: "Fixed it!"
10:31 - Manager: "What was it? How did you fix it?
Why didn't you update us??"
β
RIGHT:
10:00 - Manager: "The site is down!"
10:01 - You: "On it. First update in 15 min or when I know more."
10:05 - You: "Status: Identified database connection issue.
Testing connection pool settings. Next update 10:15."
10:12 - You: "Update: Connection pool fix didn't work.
Now checking for query causing locks.
Next update 10:25."
10:18 - You: "Found it: Long-running analytics query blocking writes.
Killing query now."
10:20 - You: "Site restored. Writing up incident report.
Root cause: Analytics query needs separate read replica."
Why it happens: You're focused and forget others are waiting/worrying.
The fix: Set a timer for status updates (every 15-30 min). Communication prevents escalation.
Key Takeaways π―
π Crisis Debugging Quick Reference
| Phase | Time | Action | Key Rule |
|---|---|---|---|
| π¨ Alert | 0-30s | Read alert, check severity | Don't touch anything yet |
| π§ Pause | 30-60s | Deep breath, assess, write down what you know | 90-second reset technique |
| π₯ Triage | 1-2min | SEVER framework: Scope, Effect, Visibility, Escalation, Revenue | P0/P1/P2 priority assignment |
| π’ Communicate | 2-3min | Send initial status: "Investigating X, update in 15min" | Never go silent |
| π‘οΈ Contain | 3-5min | Rollback? Feature flag? Failover? | Stop bleeding first |
| π¬ Investigate | 5-30min | Hypothesis β Predict β Test (one variable at a time) | Logs before assumptions |
| π§ Fix | 30-60min | Smallest safe change, canary deploy, verify | Correct first, fast second |
| π Verify | 1-2hr | Monitor metrics, confirm resolution, watch for regressions | Measure, don't assume |
| π Document | Next day | Post-mortem: timeline, root cause, prevention steps | Learn from every incident |
π§ Mental Model: The Debugging Mantra
"STOP. BREATHE. OBSERVE. ONE CHANGE. VERIFY."
Repeat this every time you feel panic rising.
β‘ Emergency Shortcuts
- Can't find logs? β
grep -r "ERROR" /var/log/ - Don't know what changed? β Check recent deployments, git log, config changes
- Can't reproduce? β Match production: data, load, config, network conditions
- Too many hypotheses? β Write them all down, test highest probability first
- Stakeholders interrupting? β Set explicit update schedule: "Next update at X:YY"
π΄ Red Flags (Stop and reassess)
- You've made >3 changes without improvement
- You can't explain how your fix would work
- You're changing things you don't understand
- You haven't looked at logs in 10+ minutes
- Nobody else knows what you're trying
β Success Indicators
- You can explain the bug's root cause
- Your fix addresses that specific cause
- Metrics confirm resolution
- The fix is documented
- Prevention steps are identified
Further Study π
Books:
- Site Reliability Engineering (Google) - Chapter 14: Managing Incidents
- The Practice of System and Network Administration - Crisis Management section
- Debugging: The 9 Indispensable Rules by David Agans
Resources:
- Incident Response Guide - PagerDuty - Comprehensive incident management framework
- Post-Mortem Culture - Google SRE - Learning from failures
- Chaos Engineering Principles - Proactive failure testing
Practice:
- Set up a test environment and intentionally break things (chaos engineering)
- Do fire drills: Time yourself responding to simulated incidents
- Review real incident reports from companies (search "[Company] incident report")
Remember: Every senior engineer has a story about the crisis they mishandled early in their career. The difference between junior and senior isn't avoiding crisesβit's staying calm and systematic when they hit. π―
You've got this. πͺ