Anti-Patterns to Avoid

Common mistakes that make incidents worse

Last generated Jan 11, 2026 UTC

Anti-Patterns to Avoid

When debugging under pressure, understanding what not to do is just as critical as knowing effective strategies. Master the most common debugging anti-patterns with free flashcards and spaced repetition practice. This lesson covers panic-driven development, cargo cult debugging, and blame-oriented problem-solving—essential knowledge for maintaining code quality and team effectiveness during critical incidents.

Welcome

💻 Every developer has been there: production is down, users are complaining, and your manager is breathing down your neck. In these high-pressure moments, our instincts can actually work against us. We fall into predictable patterns of behavior that feel productive but actually delay resolution and create additional problems.

This lesson examines the most damaging anti-patterns that emerge during crisis debugging. Understanding these pitfalls will help you recognize them in real-time and choose more effective approaches instead.

Core Concepts

🔴 The Panic Loop

The panic loop is perhaps the most common anti-pattern in crisis debugging. It follows a predictable cycle:

  🚨 Incident Detected
         │
         ↓
  😰 Panic Response
         │
         ↓
  🎲 Random Changes
         │
         ↓
  ❌ Still Broken
         │
         ↓
  😱 More Panic
         │
         └──→ (repeat)

When we panic, our prefrontal cortex (responsible for logical thinking) becomes impaired while our amygdala (fight-or-flight response) takes over. This biological response was great for escaping predators but terrible for systematic problem-solving.

Symptoms of the panic loop:

Making changes without understanding the problem
Skipping documentation/logging
Reverting changes immediately if they don't work
Jumping between multiple theories without testing any thoroughly
Restarting services hoping the problem "just goes away"

💡 Breaking the panic loop: The five-second rule. When you notice panic rising, count to five and take one deep breath. This simple action reengages your prefrontal cortex and interrupts the automatic panic response.

🎯 Cargo Cult Debugging

Cargo cult debugging refers to applying fixes without understanding why they work. The term comes from WWII-era islanders who built fake airstrips hoping planes would land, mimicking what they saw but not understanding the underlying mechanisms.

In debugging, this manifests as:

## Developer adds this without understanding why
try:
    result = perform_operation()
except:
    pass  # "This fixed it last time"
    result = None

The code above silently swallows errors, masking the real problem. It might appear to "fix" the immediate issue, but it creates a ticking time bomb.

Common cargo cult patterns:

Anti-Pattern	What It Looks Like	Why It's Dangerous
Magic delays	`Thread.sleep(1000)`	Masks race conditions without fixing them
Blanket try-catch	`catch (Exception e) {}`	Hides errors you need to see
"Just restart it"	Scheduled nightly reboots	Treats symptoms, not causes
Copy-paste solutions	Stack Overflow without understanding	Introduces code you can't maintain
Voodoo configuration	`max_connections=1000 # "just in case"`	Obscures actual resource needs

🧠 Mnemonic - CARGO:

Copying without comprehension
Assuming correlation means causation
Repeating past fixes blindly
Guessing instead of measuring
Omitting root cause analysis

🎭 Blame-Oriented Debugging

When systems fail under pressure, there's often an impulse to find who caused the problem rather than what caused it. This blame-oriented approach is incredibly destructive.

The blame cycle:

    🔥 Production Incident
           │
           ↓
    👉 "Who deployed last?"
           │
           ↓
    😰 Developer on defensive
           │
           ↓
    🤫 Less information sharing
           │
           ↓
    ⏱️ Slower resolution
           │
           ↓
    📉 Eroded team trust

// Blame-oriented communication
"Why did you deploy this broken code?"
"Didn't you test this?"
"This is obviously your fault."

// Problem-oriented communication
"The deployment at 14:32 correlates with error spikes."
"What tests would have caught this?"
"Let's understand the failure chain."

⚠️ Why blame is toxic: When people fear blame, they:

Hide mistakes instead of reporting them quickly
Avoid taking ownership of problems
Spend energy on defense rather than solutions
Leave the organization, taking knowledge with them

💡 The blameless postmortem: Companies like Google, Etsy, and Netflix conduct blameless postmortems after incidents. The focus is entirely on systems and processes, not individuals. Questions like "Why did our deployment process allow this?" replace "Why did Bob deploy this?"

🎲 Random Walk Debugging

Random walk debugging occurs when developers try fixes randomly without a hypothesis. It's like trying to find your way out of a maze by randomly turning—you might eventually escape, but you'll waste enormous time and might make things worse.

// Random walk debugging in action
public void fixSlowness() {
    // Try 1: Maybe it's the cache?
    cache.clear();
    
    // Try 2: Increase thread pool?
    executor.setCorePoolSize(100);
    
    // Try 3: Disable some feature?
    featureFlags.disable("new_algorithm");
    
    // Try 4: More memory?
    System.gc();
    
    // Try 5: Different timeout?
    connection.setTimeout(30000);
}

The problem: You've made five changes simultaneously. If performance improves (or worsens), you have no idea which change mattered. You've also potentially:

Introduced new bugs
Created unsupported configurations
Made the system harder to understand
Wasted time on irrelevant changes

Better approach - Scientific Method:

Step	Action	Example
1. Observe	Gather data about the problem	API response time: 500ms → 5000ms
2. Hypothesize	Form testable theory	"Database query is slow due to missing index"
3. Predict	What should happen if true?	"EXPLAIN shows full table scan"
4. Test	Make ONE change	Add index on user_id column
5. Verify	Measure the result	Response time now 50ms - hypothesis confirmed

🔧 Fix-and-Forget

The fix-and-forget anti-pattern happens when you resolve the immediate crisis but don't address underlying issues or document what happened.

## Fix-and-forget in action
$ service myapp restart
## Service comes back up
$ # Walk away, no investigation

## Three weeks later, same problem
$ service myapp restart
## "Why does this keep happening?"

Consequences:

Problem recurs repeatedly
No organizational learning
Increasing technical debt
Loss of customer trust
Burnout from repeated firefighting

What you should do instead:

### Incident #2847 - Service Restart Required

**Date:** 2024-01-15 14:35 UTC
**Duration:** 12 minutes downtime
**Impact:** 50% of users received 503 errors

**Immediate Fix:** Restarted application service

**Root Cause:** Memory leak in session management code
- Sessions not properly garbage collected
- Memory usage grew linearly with user logins
- OOMKiller terminated process at 95% memory

**Permanent Fix:** PR #3421 - Fixed session cleanup

**Prevention:**
- Added memory usage alerting at 80%
- Added unit tests for session lifecycle
- Scheduled weekly memory profiling

**Follow-up:** Review other potential memory leaks (JIRA-5123)

💡 The 5 Whys technique: Ask "why" five times to get to root causes:

Why did service crash? → Out of memory
Why out of memory? → Memory leak
Why memory leak? → Sessions not cleaned up
Why not cleaned up? → Missing cleanup code
Why missing? → No code review checklist for resource management

📢 Communication Failures

During incidents, poor communication amplifies problems:

Silent debugging: Working on a critical issue without telling anyone. Your team doesn't know:

That the incident is being worked on
What's been tried already
Whether they should help
When to expect resolution

Communication spam: The opposite extreme—constant updates with no new information:

14:35 "Looking into it"
14:36 "Still looking"
14:37 "Making progress"
14:38 "Almost there"
14:39 "Still working on it"

Better communication pattern:

14:35 🚨 INCIDENT: API returning 500 errors
      Impact: All users affected
      Incident commander: @alice
      War room: #incident-2847

14:40 📊 UPDATE: Root cause identified
      Database connection pool exhausted
      Attempted: Restarted app servers (no effect)
      Next: Investigating database load

14:55 ✅ RESOLVED: Blocked query identified and killed
      Added query timeout protection
      Monitoring for recurrence
      Postmortem scheduled tomorrow 10am

Examples

Example 1: The Timeout That Wasn't

🔍 Scenario: A developer notices API timeouts in production during peak traffic.

❌ Anti-pattern approach (Random Walk + Cargo Cult):

## "I'll just increase all the timeouts!"
requests.get(url, timeout=30)  # was 5
db_connection.timeout = 60     # was 10  
cache_timeout = 300            # was 60

## "And add some retries for good measure"
for i in range(5):
    try:
        result = api_call()
        break
    except:
        time.sleep(1)  # "This always helps"

Problem: The developer:

Never measured actual response times
Made multiple changes simultaneously
Added retries that multiply load during incidents
Used magic numbers without understanding

✅ Better approach (Scientific Method):

## Step 1: Observe
import logging
import time

logger = logging.getLogger(__name__)

def instrumented_api_call():
    start = time.time()
    try:
        result = requests.get(url, timeout=5)
        duration = time.time() - start
        logger.info(f"API call succeeded in {duration:.3f}s")
        return result
    except requests.Timeout:
        duration = time.time() - start
        logger.error(f"API call timed out after {duration:.3f}s")
        raise

## Step 2: Analyze logs - discovered 95% succeed in <1s
## Only 5% timeout, and they timeout at exactly 5s

## Step 3: Hypothesis - Some requests legitimately take >5s
## Not a timeout problem, but a slow request problem

## Step 4: Investigate those specific slow requests
## Found: Large report generation requests

## Step 5: Proper solution
def api_call_with_appropriate_timeout(request_type):
    timeout = 5 if request_type == "normal" else 30
    return requests.get(url, timeout=timeout)

Key lesson: The problem wasn't timeouts—it was treating all requests the same. Measurement revealed the real issue.

Example 2: The Blame Game Disaster

🎭 Scenario: A deployment causes widespread outages.

❌ Anti-pattern approach (Blame-Oriented):

CTO:  "Who approved this deployment?"
PM:   "Engineering said it was tested."
Dev1: "QA signed off on it."
QA:   "We only test what we're given."
Dev2: "The requirements were unclear."
PM:   "I provided detailed specs!"

→ 45 minutes arguing
→ Service still down
→ Finger-pointing continues

✅ Better approach (Blameless, Problem-Focused):

Incident Commander: "Status update every 10 minutes.
                     All focus on restoration."

14:05 - Rolled back deployment
14:15 - Service restored
14:20 - Customer communication sent

[Next day - Blameless Postmortem]

What happened (timeline):
- 13:45 Deployment began
- 13:52 Error rate spiked to 60%
- 14:05 Rollback initiated
- 14:15 Service normal

What went wrong (systems view):
- Database migration script not backwards compatible
- Staging environment had different DB version
- Deployment checklist didn't verify DB compatibility
- No automated rollback on error spike

Action items:
- Add DB version check to deployment script (Owner: Dev1)
- Align staging/prod DB versions (Owner: DevOps)
- Implement auto-rollback on error threshold (Owner: Dev2)
- Update deployment checklist (Owner: PM)

What went right:
- Fast detection (7 minutes)
- Clear rollback procedure
- Good team communication

Key lesson: The blame-oriented meeting wasted 45 minutes and damaged relationships. The blameless approach restored service quickly and improved systems.

Example 3: The Debugging Detective Story

🕵️ Scenario: Application crashes intermittently in production.

❌ Anti-pattern approach (Panic Loop + Fix-and-Forget):

## Monday 3pm - first crash
$ service myapp restart
$ # "Fixed it!" 

## Tuesday 11am - crashes again
$ service myapp restart
$ # "Weird, but okay"

## Wednesday 2pm - crashes again  
$ service myapp restart
$ vim config.yml  # increase memory limit
$ # "Maybe this will help"

## Thursday 9am - crashes again
$ service myapp restart  
$ # "This is getting ridiculous"

## Friday 4pm - crashes again
$ service myapp restart
$ # Everyone goes home frustrated

✅ Better approach (Systematic Investigation):

## Monday 3pm - first crash
$ systemctl status myapp
● myapp.service - My Application
   Active: failed (Result: signal) since Mon 15:23:17
   Process: 12847 ExecStart=/usr/bin/myapp (code=killed, signal=SEGV)

## Step 1: Preserve evidence
$ journalctl -u myapp > crash_log_monday.txt
$ cp /var/log/myapp/app.log app_log_monday.txt
$ dmesg | tail -50 > kernel_log_monday.txt

## Step 2: Look for patterns
$ grep -i "error\|exception\|fatal" app_log_monday.txt
2024-01-15 15:23:15 ERROR: Failed to allocate buffer
2024-01-15 15:23:16 FATAL: Segmentation fault in module: image_processor

## Step 3: Form hypothesis
## "Crashes related to image processing"

## Step 4: Enable detailed logging
$ vim /etc/myapp/config.yml
log_level: DEBUG
image_processor:
  log_operations: true
  
$ service myapp restart

## Step 5: Wait for next crash with better data
## Tuesday 11am - crash with detailed logs
$ grep "image_processor" app.log | tail -20
DEBUG: Processing image: large_file.tif (250MB)
DEBUG: Allocating buffer: 750MB
ERROR: malloc failed - out of memory

## Step 6: Root cause found
## Large TIFF files require 3x memory for decompression
## Server has 2GB RAM, some images need 750MB

## Step 7: Proper fix (multiple options evaluated)

## Option A: Add memory (expensive)
## Option B: Reject large files (bad UX) 
## Option C: Stream processing (complex)
## Option D: Limit concurrent processing + queue (chosen)

import threading

max_concurrent_images = 2
image_semaphore = threading.Semaphore(max_concurrent_images)

def process_image_safely(image_path):
    with image_semaphore:
        # Only 2 images processed simultaneously
        # Others wait in queue
        return process_image(image_path)

Key lesson: Systematic investigation with evidence collection solved in 1 day what panic-and-restart couldn't solve in a week.

Example 4: The Copy-Paste Catastrophe

📋 Scenario: Slow database queries need optimization.

❌ Anti-pattern approach (Cargo Cult from Stack Overflow):

-- Developer searches "make SQL faster"
-- Copies top Stack Overflow answer

-- Original query
SELECT * FROM users WHERE email = 'test@example.com';

-- "Optimized" with cargo cult patterns
SELECT /*+ PARALLEL(users 8) */ 
       * 
FROM   users WITH (NOLOCK, INDEX(idx_all_columns))
WHERE  email = 'test@example.com'
OPTION (OPTIMIZE FOR UNKNOWN, MAXDOP 8, FAST 1);

Problems:

PARALLEL hint doesn't help single-row lookups
NOLOCK can return dirty/inconsistent data
INDEX(idx_all_columns) probably doesn't exist
OPTIMIZE FOR UNKNOWN hurts when you DO know values
Developer has no idea what any of this does

✅ Better approach (Understanding Before Applying):

-- Step 1: Measure current performance
SET STATISTICS TIME ON;
SELECT * FROM users WHERE email = 'test@example.com';
-- Result: 450ms

-- Step 2: Analyze execution plan
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'test@example.com';
-- Shows: Sequential Scan on users (cost=0..35000 rows=1000000)

-- Step 3: Identify actual problem
-- No index on email column! Table scan checking 1M rows.

-- Step 4: Targeted fix
CREATE INDEX idx_users_email ON users(email);

-- Step 5: Verify improvement
EXPLAIN ANALYZE  
SELECT * FROM users WHERE email = 'test@example.com';
-- Shows: Index Scan using idx_users_email (cost=0..8 rows=1)
-- Result: 3ms (150x faster!)

-- Step 6: Document why
/*
  Index on email column added 2024-01-15
  Reason: User lookup by email is common operation (login, password reset)
  Before: 450ms (table scan)
  After: 3ms (index scan)
  Trade-off: Slightly slower INSERTs (acceptable for this use case)
*/

Key lesson: One simple index based on understanding outperformed a dozen mysterious hints cargo-culted from Stack Overflow.

Common Mistakes

⚠️ Mistake #1: Changing multiple things at once

// ❌ WRONG: Can't tell which change helped
function tryToFix() {
  cache.clear();
  database.reconnect();
  config.reload();
  service.restart();
}

// ✅ RIGHT: Isolated changes
function systematicFix() {
  // Try hypothesis 1
  cache.clear();
  if (!problemSolved()) {
    cache.restore();
    
    // Try hypothesis 2  
    database.reconnect();
    if (!problemSolved()) {
      // Continue with next hypothesis
    }
  }
}

⚠️ Mistake #2: Ignoring successful attempts

When debugging intermittent issues, we obsess over failures but ignore successes. Both contain information!

## ❌ WRONG: Only logging failures
if result.failed:
    log.error(f"Request failed: {result}")

## ✅ RIGHT: Compare success vs failure
log.info(f"Request result: success={result.success}, "
         f"duration={result.duration}ms, "
         f"server={result.server}, "
         f"cache_hit={result.cache_hit}")
         
## Analysis might reveal:
## - Failures happen only from server-3
## - Or only when cache misses
## - Or only during specific time windows

⚠️ Mistake #3: Treating symptoms as root causes

Symptom: Server running out of disk space
   ↓
Symptom: Log files growing too large  
   ↓
Symptom: Too many error messages
   ↓
Root Cause: Memory leak causing crashes,
            generating error logs

Deleting log files treats the symptom. Fixing the memory leak treats the cause.

⚠️ Mistake #4: "Works on my machine" dismissal

## ❌ WRONG: Dismissive response
"I can't reproduce it locally, so it's probably
user error or a network issue."

## ✅ RIGHT: Investigate environmental differences
"Interesting, it works in dev but fails in prod.
Let me check:
- Python versions (dev: 3.9, prod: 3.8)
- Dependency versions  
- Environment variables
- Data volumes (dev: 100 rows, prod: 10M rows)
- Resource constraints
- Network latency"

💡 Pro tip: If you can't reproduce locally, reproduce the production environment locally (Docker, VMs, staging servers).

⚠️ Mistake #5: Skipping the rollback option

When a deployment causes issues, there's often pressure to "fix forward" rather than rollback:

## ❌ WRONG: Trying to patch in production under pressure
$ vim app.py  # editing live code
$ # "Just this one quick fix..."

## ✅ RIGHT: Rollback first, fix properly later
$ ./rollback.sh  # restore last known good state
$ # Service restored in 2 minutes
$ # Now fix the bug properly in dev with tests

The rollback rule: If you can't fix it in 15 minutes, rollback. You can always deploy the fix later.

Key Takeaways

🎯 Core Anti-Patterns Summary:

📋 Quick Reference Card: Anti-Patterns to Avoid

Anti-Pattern	Recognition Signs	Antidote
🔴 Panic Loop	Random changes, no documentation, increasing desperation	Five-second pause, one change at a time, hypothesis-driven
🎯 Cargo Cult	"This worked before", copying solutions without understanding	Always ask "Why does this work?" Measure before and after
🎭 Blame-Oriented	"Who did this?", defensive responses, finger-pointing	Blameless postmortems, focus on systems not people
🎲 Random Walk	No hypothesis, trying everything, multiple simultaneous changes	Scientific method: observe, hypothesize, test, verify
🔧 Fix-and-Forget	Same issues recurring, no documentation, no follow-up	Document incidents, identify root causes, track prevention

🧠 Remember the PAUSE framework when pressure builds:

Pause and breathe (break the panic loop)
Assess the situation (gather data, not assumptions)
Understand before acting (form hypothesis)
Single changes only (isolate variables)
Examine and document (learn from every incident)

💡 Golden Rules:

One change at a time - You can't learn from experiments with multiple variables
Measure, don't guess - "I think" is not a debugging strategy
Rollback is not failure - It's the fastest path to stability
Systems fail, people don't - Blame destroys the learning culture
Document everything - Your future self will thank you

🔧 Debugging Under Pressure Checklist:

☐ Take a 5-second pause to prevent panic response
☐ Establish incident timeline and communication channel
☐ Document current state before making changes
☐ Form testable hypothesis based on evidence
☐ Make ONE change and measure its effect
☐ Keep stakeholders updated (not spammed)
☐ Consider rollback if fix takes >15 minutes
☐ After resolution: schedule blameless postmortem
☐ Document root cause and prevention steps
☐ Update runbooks and monitoring

🎓 Further Study:

The debugging anti-patterns discussed here are universal across software engineering. Understanding them helps you maintain effectiveness even during the most stressful incidents. Remember: the goal isn't just to fix the current problem—it's to build systems and teams that handle problems better over time.

📚 Further Study

Google SRE Book - Postmortem Culture - https://sre.google/sre-book/postmortem-culture/ - Essential reading on blameless postmortems and learning from incidents
Debugging: The 9 Indispensable Rules - https://debuggingrules.com/ - David Agans' systematic approach to debugging that prevents common anti-patterns
Etsy's Debriefing Facilitation Guide - https://etsy.github.io/DebriefingFacilitationGuide/ - Practical framework for conducting blameless incident reviews

📝

Ready to practice?

This lesson has 15 questions to help you learn