Anti-Patterns to Avoid
Common mistakes that make incidents worse
Anti-Patterns to Avoid
When debugging under pressure, understanding what not to do is just as critical as knowing effective strategies. Master the most common debugging anti-patterns with free flashcards and spaced repetition practice. This lesson covers panic-driven development, cargo cult debugging, and blame-oriented problem-solvingβessential knowledge for maintaining code quality and team effectiveness during critical incidents.
Welcome
π» Every developer has been there: production is down, users are complaining, and your manager is breathing down your neck. In these high-pressure moments, our instincts can actually work against us. We fall into predictable patterns of behavior that feel productive but actually delay resolution and create additional problems.
This lesson examines the most damaging anti-patterns that emerge during crisis debugging. Understanding these pitfalls will help you recognize them in real-time and choose more effective approaches instead.
Core Concepts
π΄ The Panic Loop
The panic loop is perhaps the most common anti-pattern in crisis debugging. It follows a predictable cycle:
π¨ Incident Detected
β
β
π° Panic Response
β
β
π² Random Changes
β
β
β Still Broken
β
β
π± More Panic
β
ββββ (repeat)
When we panic, our prefrontal cortex (responsible for logical thinking) becomes impaired while our amygdala (fight-or-flight response) takes over. This biological response was great for escaping predators but terrible for systematic problem-solving.
Symptoms of the panic loop:
- Making changes without understanding the problem
- Skipping documentation/logging
- Reverting changes immediately if they don't work
- Jumping between multiple theories without testing any thoroughly
- Restarting services hoping the problem "just goes away"
π‘ Breaking the panic loop: The five-second rule. When you notice panic rising, count to five and take one deep breath. This simple action reengages your prefrontal cortex and interrupts the automatic panic response.
π― Cargo Cult Debugging
Cargo cult debugging refers to applying fixes without understanding why they work. The term comes from WWII-era islanders who built fake airstrips hoping planes would land, mimicking what they saw but not understanding the underlying mechanisms.
In debugging, this manifests as:
## Developer adds this without understanding why
try:
result = perform_operation()
except:
pass # "This fixed it last time"
result = None
The code above silently swallows errors, masking the real problem. It might appear to "fix" the immediate issue, but it creates a ticking time bomb.
Common cargo cult patterns:
| Anti-Pattern | What It Looks Like | Why It's Dangerous |
|---|---|---|
| Magic delays | Thread.sleep(1000) |
Masks race conditions without fixing them |
| Blanket try-catch | catch (Exception e) {} |
Hides errors you need to see |
| "Just restart it" | Scheduled nightly reboots | Treats symptoms, not causes |
| Copy-paste solutions | Stack Overflow without understanding | Introduces code you can't maintain |
| Voodoo configuration | max_connections=1000 # "just in case" |
Obscures actual resource needs |
π§ Mnemonic - CARGO:
- Copying without comprehension
- Assuming correlation means causation
- Repeating past fixes blindly
- Guessing instead of measuring
- Omitting root cause analysis
π Blame-Oriented Debugging
When systems fail under pressure, there's often an impulse to find who caused the problem rather than what caused it. This blame-oriented approach is incredibly destructive.
The blame cycle:
π₯ Production Incident
β
β
π "Who deployed last?"
β
β
π° Developer on defensive
β
β
π€« Less information sharing
β
β
β±οΈ Slower resolution
β
β
π Eroded team trust
// Blame-oriented communication
"Why did you deploy this broken code?"
"Didn't you test this?"
"This is obviously your fault."
// Problem-oriented communication
"The deployment at 14:32 correlates with error spikes."
"What tests would have caught this?"
"Let's understand the failure chain."
β οΈ Why blame is toxic: When people fear blame, they:
- Hide mistakes instead of reporting them quickly
- Avoid taking ownership of problems
- Spend energy on defense rather than solutions
- Leave the organization, taking knowledge with them
π‘ The blameless postmortem: Companies like Google, Etsy, and Netflix conduct blameless postmortems after incidents. The focus is entirely on systems and processes, not individuals. Questions like "Why did our deployment process allow this?" replace "Why did Bob deploy this?"
π² Random Walk Debugging
Random walk debugging occurs when developers try fixes randomly without a hypothesis. It's like trying to find your way out of a maze by randomly turningβyou might eventually escape, but you'll waste enormous time and might make things worse.
// Random walk debugging in action
public void fixSlowness() {
// Try 1: Maybe it's the cache?
cache.clear();
// Try 2: Increase thread pool?
executor.setCorePoolSize(100);
// Try 3: Disable some feature?
featureFlags.disable("new_algorithm");
// Try 4: More memory?
System.gc();
// Try 5: Different timeout?
connection.setTimeout(30000);
}
The problem: You've made five changes simultaneously. If performance improves (or worsens), you have no idea which change mattered. You've also potentially:
- Introduced new bugs
- Created unsupported configurations
- Made the system harder to understand
- Wasted time on irrelevant changes
Better approach - Scientific Method:
| Step | Action | Example |
|---|---|---|
| 1. Observe | Gather data about the problem | API response time: 500ms β 5000ms |
| 2. Hypothesize | Form testable theory | "Database query is slow due to missing index" |
| 3. Predict | What should happen if true? | "EXPLAIN shows full table scan" |
| 4. Test | Make ONE change | Add index on user_id column |
| 5. Verify | Measure the result | Response time now 50ms - hypothesis confirmed |
π§ Fix-and-Forget
The fix-and-forget anti-pattern happens when you resolve the immediate crisis but don't address underlying issues or document what happened.
## Fix-and-forget in action
$ service myapp restart
## Service comes back up
$ # Walk away, no investigation
## Three weeks later, same problem
$ service myapp restart
## "Why does this keep happening?"
Consequences:
- Problem recurs repeatedly
- No organizational learning
- Increasing technical debt
- Loss of customer trust
- Burnout from repeated firefighting
What you should do instead:
### Incident #2847 - Service Restart Required
**Date:** 2024-01-15 14:35 UTC
**Duration:** 12 minutes downtime
**Impact:** 50% of users received 503 errors
**Immediate Fix:** Restarted application service
**Root Cause:** Memory leak in session management code
- Sessions not properly garbage collected
- Memory usage grew linearly with user logins
- OOMKiller terminated process at 95% memory
**Permanent Fix:** PR #3421 - Fixed session cleanup
**Prevention:**
- Added memory usage alerting at 80%
- Added unit tests for session lifecycle
- Scheduled weekly memory profiling
**Follow-up:** Review other potential memory leaks (JIRA-5123)
π‘ The 5 Whys technique: Ask "why" five times to get to root causes:
- Why did service crash? β Out of memory
- Why out of memory? β Memory leak
- Why memory leak? β Sessions not cleaned up
- Why not cleaned up? β Missing cleanup code
- Why missing? β No code review checklist for resource management
π’ Communication Failures
During incidents, poor communication amplifies problems:
Silent debugging: Working on a critical issue without telling anyone. Your team doesn't know:
- That the incident is being worked on
- What's been tried already
- Whether they should help
- When to expect resolution
Communication spam: The opposite extremeβconstant updates with no new information:
14:35 "Looking into it"
14:36 "Still looking"
14:37 "Making progress"
14:38 "Almost there"
14:39 "Still working on it"
Better communication pattern:
14:35 π¨ INCIDENT: API returning 500 errors
Impact: All users affected
Incident commander: @alice
War room: #incident-2847
14:40 π UPDATE: Root cause identified
Database connection pool exhausted
Attempted: Restarted app servers (no effect)
Next: Investigating database load
14:55 β
RESOLVED: Blocked query identified and killed
Added query timeout protection
Monitoring for recurrence
Postmortem scheduled tomorrow 10am
Examples
Example 1: The Timeout That Wasn't
π Scenario: A developer notices API timeouts in production during peak traffic.
β Anti-pattern approach (Random Walk + Cargo Cult):
## "I'll just increase all the timeouts!"
requests.get(url, timeout=30) # was 5
db_connection.timeout = 60 # was 10
cache_timeout = 300 # was 60
## "And add some retries for good measure"
for i in range(5):
try:
result = api_call()
break
except:
time.sleep(1) # "This always helps"
Problem: The developer:
- Never measured actual response times
- Made multiple changes simultaneously
- Added retries that multiply load during incidents
- Used magic numbers without understanding
β Better approach (Scientific Method):
## Step 1: Observe
import logging
import time
logger = logging.getLogger(__name__)
def instrumented_api_call():
start = time.time()
try:
result = requests.get(url, timeout=5)
duration = time.time() - start
logger.info(f"API call succeeded in {duration:.3f}s")
return result
except requests.Timeout:
duration = time.time() - start
logger.error(f"API call timed out after {duration:.3f}s")
raise
## Step 2: Analyze logs - discovered 95% succeed in <1s
## Only 5% timeout, and they timeout at exactly 5s
## Step 3: Hypothesis - Some requests legitimately take >5s
## Not a timeout problem, but a slow request problem
## Step 4: Investigate those specific slow requests
## Found: Large report generation requests
## Step 5: Proper solution
def api_call_with_appropriate_timeout(request_type):
timeout = 5 if request_type == "normal" else 30
return requests.get(url, timeout=timeout)
Key lesson: The problem wasn't timeoutsβit was treating all requests the same. Measurement revealed the real issue.
Example 2: The Blame Game Disaster
π Scenario: A deployment causes widespread outages.
β Anti-pattern approach (Blame-Oriented):
CTO: "Who approved this deployment?"
PM: "Engineering said it was tested."
Dev1: "QA signed off on it."
QA: "We only test what we're given."
Dev2: "The requirements were unclear."
PM: "I provided detailed specs!"
β 45 minutes arguing
β Service still down
β Finger-pointing continues
β Better approach (Blameless, Problem-Focused):
Incident Commander: "Status update every 10 minutes.
All focus on restoration."
14:05 - Rolled back deployment
14:15 - Service restored
14:20 - Customer communication sent
[Next day - Blameless Postmortem]
What happened (timeline):
- 13:45 Deployment began
- 13:52 Error rate spiked to 60%
- 14:05 Rollback initiated
- 14:15 Service normal
What went wrong (systems view):
- Database migration script not backwards compatible
- Staging environment had different DB version
- Deployment checklist didn't verify DB compatibility
- No automated rollback on error spike
Action items:
- Add DB version check to deployment script (Owner: Dev1)
- Align staging/prod DB versions (Owner: DevOps)
- Implement auto-rollback on error threshold (Owner: Dev2)
- Update deployment checklist (Owner: PM)
What went right:
- Fast detection (7 minutes)
- Clear rollback procedure
- Good team communication
Key lesson: The blame-oriented meeting wasted 45 minutes and damaged relationships. The blameless approach restored service quickly and improved systems.
Example 3: The Debugging Detective Story
π΅οΈ Scenario: Application crashes intermittently in production.
β Anti-pattern approach (Panic Loop + Fix-and-Forget):
## Monday 3pm - first crash
$ service myapp restart
$ # "Fixed it!"
## Tuesday 11am - crashes again
$ service myapp restart
$ # "Weird, but okay"
## Wednesday 2pm - crashes again
$ service myapp restart
$ vim config.yml # increase memory limit
$ # "Maybe this will help"
## Thursday 9am - crashes again
$ service myapp restart
$ # "This is getting ridiculous"
## Friday 4pm - crashes again
$ service myapp restart
$ # Everyone goes home frustrated
β Better approach (Systematic Investigation):
## Monday 3pm - first crash
$ systemctl status myapp
β myapp.service - My Application
Active: failed (Result: signal) since Mon 15:23:17
Process: 12847 ExecStart=/usr/bin/myapp (code=killed, signal=SEGV)
## Step 1: Preserve evidence
$ journalctl -u myapp > crash_log_monday.txt
$ cp /var/log/myapp/app.log app_log_monday.txt
$ dmesg | tail -50 > kernel_log_monday.txt
## Step 2: Look for patterns
$ grep -i "error\|exception\|fatal" app_log_monday.txt
2024-01-15 15:23:15 ERROR: Failed to allocate buffer
2024-01-15 15:23:16 FATAL: Segmentation fault in module: image_processor
## Step 3: Form hypothesis
## "Crashes related to image processing"
## Step 4: Enable detailed logging
$ vim /etc/myapp/config.yml
log_level: DEBUG
image_processor:
log_operations: true
$ service myapp restart
## Step 5: Wait for next crash with better data
## Tuesday 11am - crash with detailed logs
$ grep "image_processor" app.log | tail -20
DEBUG: Processing image: large_file.tif (250MB)
DEBUG: Allocating buffer: 750MB
ERROR: malloc failed - out of memory
## Step 6: Root cause found
## Large TIFF files require 3x memory for decompression
## Server has 2GB RAM, some images need 750MB
## Step 7: Proper fix (multiple options evaluated)
## Option A: Add memory (expensive)
## Option B: Reject large files (bad UX)
## Option C: Stream processing (complex)
## Option D: Limit concurrent processing + queue (chosen)
import threading
max_concurrent_images = 2
image_semaphore = threading.Semaphore(max_concurrent_images)
def process_image_safely(image_path):
with image_semaphore:
# Only 2 images processed simultaneously
# Others wait in queue
return process_image(image_path)
Key lesson: Systematic investigation with evidence collection solved in 1 day what panic-and-restart couldn't solve in a week.
Example 4: The Copy-Paste Catastrophe
π Scenario: Slow database queries need optimization.
β Anti-pattern approach (Cargo Cult from Stack Overflow):
-- Developer searches "make SQL faster"
-- Copies top Stack Overflow answer
-- Original query
SELECT * FROM users WHERE email = 'test@example.com';
-- "Optimized" with cargo cult patterns
SELECT /*+ PARALLEL(users 8) */
*
FROM users WITH (NOLOCK, INDEX(idx_all_columns))
WHERE email = 'test@example.com'
OPTION (OPTIMIZE FOR UNKNOWN, MAXDOP 8, FAST 1);
Problems:
PARALLELhint doesn't help single-row lookupsNOLOCKcan return dirty/inconsistent dataINDEX(idx_all_columns)probably doesn't existOPTIMIZE FOR UNKNOWNhurts when you DO know values- Developer has no idea what any of this does
β Better approach (Understanding Before Applying):
-- Step 1: Measure current performance
SET STATISTICS TIME ON;
SELECT * FROM users WHERE email = 'test@example.com';
-- Result: 450ms
-- Step 2: Analyze execution plan
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'test@example.com';
-- Shows: Sequential Scan on users (cost=0..35000 rows=1000000)
-- Step 3: Identify actual problem
-- No index on email column! Table scan checking 1M rows.
-- Step 4: Targeted fix
CREATE INDEX idx_users_email ON users(email);
-- Step 5: Verify improvement
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'test@example.com';
-- Shows: Index Scan using idx_users_email (cost=0..8 rows=1)
-- Result: 3ms (150x faster!)
-- Step 6: Document why
/*
Index on email column added 2024-01-15
Reason: User lookup by email is common operation (login, password reset)
Before: 450ms (table scan)
After: 3ms (index scan)
Trade-off: Slightly slower INSERTs (acceptable for this use case)
*/
Key lesson: One simple index based on understanding outperformed a dozen mysterious hints cargo-culted from Stack Overflow.
Common Mistakes
β οΈ Mistake #1: Changing multiple things at once
// β WRONG: Can't tell which change helped
function tryToFix() {
cache.clear();
database.reconnect();
config.reload();
service.restart();
}
// β
RIGHT: Isolated changes
function systematicFix() {
// Try hypothesis 1
cache.clear();
if (!problemSolved()) {
cache.restore();
// Try hypothesis 2
database.reconnect();
if (!problemSolved()) {
// Continue with next hypothesis
}
}
}
β οΈ Mistake #2: Ignoring successful attempts
When debugging intermittent issues, we obsess over failures but ignore successes. Both contain information!
## β WRONG: Only logging failures
if result.failed:
log.error(f"Request failed: {result}")
## β
RIGHT: Compare success vs failure
log.info(f"Request result: success={result.success}, "
f"duration={result.duration}ms, "
f"server={result.server}, "
f"cache_hit={result.cache_hit}")
## Analysis might reveal:
## - Failures happen only from server-3
## - Or only when cache misses
## - Or only during specific time windows
β οΈ Mistake #3: Treating symptoms as root causes
Symptom: Server running out of disk space
β
Symptom: Log files growing too large
β
Symptom: Too many error messages
β
Root Cause: Memory leak causing crashes,
generating error logs
Deleting log files treats the symptom. Fixing the memory leak treats the cause.
β οΈ Mistake #4: "Works on my machine" dismissal
## β WRONG: Dismissive response
"I can't reproduce it locally, so it's probably
user error or a network issue."
## β
RIGHT: Investigate environmental differences
"Interesting, it works in dev but fails in prod.
Let me check:
- Python versions (dev: 3.9, prod: 3.8)
- Dependency versions
- Environment variables
- Data volumes (dev: 100 rows, prod: 10M rows)
- Resource constraints
- Network latency"
π‘ Pro tip: If you can't reproduce locally, reproduce the production environment locally (Docker, VMs, staging servers).
β οΈ Mistake #5: Skipping the rollback option
When a deployment causes issues, there's often pressure to "fix forward" rather than rollback:
## β WRONG: Trying to patch in production under pressure
$ vim app.py # editing live code
$ # "Just this one quick fix..."
## β
RIGHT: Rollback first, fix properly later
$ ./rollback.sh # restore last known good state
$ # Service restored in 2 minutes
$ # Now fix the bug properly in dev with tests
The rollback rule: If you can't fix it in 15 minutes, rollback. You can always deploy the fix later.
Key Takeaways
π― Core Anti-Patterns Summary:
π Quick Reference Card: Anti-Patterns to Avoid
| Anti-Pattern | Recognition Signs | Antidote |
|---|---|---|
| π΄ Panic Loop | Random changes, no documentation, increasing desperation | Five-second pause, one change at a time, hypothesis-driven |
| π― Cargo Cult | "This worked before", copying solutions without understanding | Always ask "Why does this work?" Measure before and after |
| π Blame-Oriented | "Who did this?", defensive responses, finger-pointing | Blameless postmortems, focus on systems not people |
| π² Random Walk | No hypothesis, trying everything, multiple simultaneous changes | Scientific method: observe, hypothesize, test, verify |
| π§ Fix-and-Forget | Same issues recurring, no documentation, no follow-up | Document incidents, identify root causes, track prevention |
π§ Remember the PAUSE framework when pressure builds:
- Pause and breathe (break the panic loop)
- Assess the situation (gather data, not assumptions)
- Understand before acting (form hypothesis)
- Single changes only (isolate variables)
- Examine and document (learn from every incident)
π‘ Golden Rules:
- One change at a time - You can't learn from experiments with multiple variables
- Measure, don't guess - "I think" is not a debugging strategy
- Rollback is not failure - It's the fastest path to stability
- Systems fail, people don't - Blame destroys the learning culture
- Document everything - Your future self will thank you
π§ Debugging Under Pressure Checklist:
β Take a 5-second pause to prevent panic response
β Establish incident timeline and communication channel
β Document current state before making changes
β Form testable hypothesis based on evidence
β Make ONE change and measure its effect
β Keep stakeholders updated (not spammed)
β Consider rollback if fix takes >15 minutes
β After resolution: schedule blameless postmortem
β Document root cause and prevention steps
β Update runbooks and monitoring
π Further Study:
The debugging anti-patterns discussed here are universal across software engineering. Understanding them helps you maintain effectiveness even during the most stressful incidents. Remember: the goal isn't just to fix the current problemβit's to build systems and teams that handle problems better over time.
π Further Study
- Google SRE Book - Postmortem Culture - https://sre.google/sre-book/postmortem-culture/ - Essential reading on blameless postmortems and learning from incidents
- Debugging: The 9 Indispensable Rules - https://debuggingrules.com/ - David Agans' systematic approach to debugging that prevents common anti-patterns
- Etsy's Debriefing Facilitation Guide - https://etsy.github.io/DebriefingFacilitationGuide/ - Practical framework for conducting blameless incident reviews