You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

Triage Before Diagnosis

Stopping the bleeding before understanding the wound

Triage Before Diagnosis: Efficient Debugging Under Pressure

Master effective debugging strategies with free flashcards and spaced repetition practice. This lesson covers triage methodologies, prioritization frameworks, and rapid assessment techniquesβ€”essential skills for debugging systems under pressure when multiple issues compete for attention.

Welcome to Systematic Bug Triage πŸ”

When production is on fire πŸ”₯ and multiple alerts are flooding your dashboard, the instinct to dive deep into the first bug you see can be deadly. Triage before diagnosis is the professional developer's mantra: assess the landscape, prioritize ruthlessly, and allocate your limited time where it will have the greatest impact.

This approach, borrowed from emergency medicine, transforms chaotic crisis response into systematic problem-solving. Just as an ER doctor doesn't treat patients in arrival order, you shouldn't debug issues in the order they appear in your logs.

Core Concepts: The Triage Mindset 🎯

What is Bug Triage?

Bug triage is the systematic process of quickly assessing multiple issues to determine their relative priority, severity, and appropriate response strategy. It's about asking "Which problem should I solve first?" before asking "How do I solve this problem?"

The key distinction:

  • Triage = Assessment, categorization, prioritization (minutes)
  • Diagnosis = Root cause analysis, detailed investigation (hours)

πŸ’‘ Mental Model: Think of triage as the "sorting hat" phase. You're not fixing bugs yetβ€”you're organizing them into buckets that determine your response strategy.

The Three-Tier Severity Model πŸ“Š

Most organizations use a variant of this classification system:

Severity Impact Response Time Examples
πŸ”΄ P0 / Critical System down, data loss, security breach Immediate (drop everything) Payment processing broken, authentication failing, data corruption
🟑 P1 / High Major feature broken, significant user impact Same day Search not working, emails not sending, API returning 500s
🟒 P2 / Medium Minor feature broken, workaround exists This sprint UI glitch, slow performance on edge case, minor visual bug
πŸ”΅ P3 / Low Nice-to-have, cosmetic, future improvement Backlog Color inconsistency, typo, feature request

⚠️ Common Mistake: Treating everything as urgent. If everything is P0, nothing is P0. Proper triage requires honest assessment of actual business impact.

The RICE Framework for Prioritization 🎲

When severity levels aren't enough (multiple P1s competing), use RICE scoring:

  • Reach: How many users affected? (1-1000+)
  • Impact: How severely? (0.25=minimal, 0.5=low, 1=medium, 2=high, 3=massive)
  • Confidence: How sure are you? (50%, 80%, 100%)
  • Effort: Hours/days to fix? (0.5-20+)

RICE Score = (Reach Γ— Impact Γ— Confidence) / Effort

Higher scores = higher priority.

## Example RICE calculation
bugs = [
    {"name": "Checkout broken", "reach": 1000, "impact": 3, "confidence": 1.0, "effort": 4},
    {"name": "Slow loading", "reach": 5000, "impact": 1, "confidence": 0.8, "effort": 8},
    {"name": "Email typo", "reach": 10000, "impact": 0.25, "confidence": 1.0, "effort": 0.5}
]

for bug in bugs:
    score = (bug["reach"] * bug["impact"] * bug["confidence"]) / bug["effort"]
    print(f"{bug['name']}: RICE = {score:.1f}")

## Output:
## Checkout broken: RICE = 750.0  ← Fix this first!
## Slow loading: RICE = 500.0
## Email typo: RICE = 5000.0      ← High score but low actual priority

⚠️ Watch Out: RICE works best for feature prioritization. For critical bugs, severity trumps RICE scores.

The Five-Minute Triage Assessment ⏱️

When a new issue arrives, spend exactly 5 minutes gathering this information:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         TRIAGE CHECKLIST (5 MINUTES)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                β”‚
β”‚  βœ… What broke?                                β”‚
β”‚  βœ… How many users affected?                   β”‚
β”‚  βœ… Is there a workaround?                     β”‚
β”‚  βœ… Is it getting worse?                       β”‚
β”‚  βœ… What's the business impact?                β”‚
β”‚  βœ… Can we roll back?                          β”‚
β”‚  βœ… Do we have logs/repro steps?               β”‚
β”‚                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Pro Tip: Use a template. Create a Slack/Teams bot that auto-asks these questions when someone reports a bug. This standardizes triage and prevents information gaps.

The Blast Radius Concept πŸ’₯

Blast radius measures how widely a bug spreads if left unfixed:

        BLAST RADIUS VISUALIZATION

    Small Radius           Large Radius
    (Contained)            (Spreading)

       β”Œβ”€β”€β”€β”                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ ● β”‚                 β”‚             β”‚
       β””β”€β”€β”€β”˜                 β”‚    β•± β”‚ β•²    β”‚
    One feature           β”‚   β•±  β”‚  β•²   β”‚
    affected              β”‚  ●───┼───●  β”‚
                          β”‚   β•²  β”‚  β•±   β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚    β•² β”‚ β•±    β”‚
    β”‚    ●    β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚  β•±   β•²  β”‚         Multiple systems,
    β”‚ ●     ● β”‚         cascading failures,
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         data corruption risk
  Small feature
  complex fix

Prioritize bugs with large, growing blast radius over bugs with small, contained impactβ€”even if the small bug is easier to fix.

Example: A memory leak that grows 1MB/hour might seem minor, but in 48 hours it will crash your server. Fix it before the prettier UI bug that affects one modal.

Symptom vs. Root Cause During Triage 🌳

During triage, you're collecting symptoms, not diagnosing root causes:

Triage (Symptoms) Diagnosis (Root Causes)
"Users can't log in" "Session token expiry logic has off-by-one error"
"API returns 500 errors" "Database connection pool exhausted under load"
"Page loads slowly" "N+1 query on User.orders relationship"
"Payment processing fails" "Third-party gateway timeout not handled"

🧠 Mnemonic: S.T.O.P. during triage

  • Symptoms only
  • Time-box your investigation
  • Observe impact
  • Prioritize, don't diagnose

The Decision Matrix πŸ“‹

Combine severity and effort to create an action matrix:

           TRIAGE DECISION MATRIX

    Effort β”‚
           β”‚
    High   β”‚  🟑 Schedule     πŸ”΄ All-hands
           β”‚  (Plan fix)     (Emergency)
           β”‚
           β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
           β”‚
    Low    β”‚  🟒 Quick win    🟠 Urgent
           β”‚  (Next sprint)  (Fix today)
           β”‚
           └──────────────────────────────
                Low          High
                       Impact/Severity

πŸ”΄ P0: Drop everything, full team response
🟠 P1: Fix today, reassign resources if needed
🟑 P1/P2: Schedule into sprint, assign owner
🟒 P2/P3: Backlog, fix during cleanup

Real-World Triage Examples πŸ”¬

Example 1: The Production Alert Storm β›ˆοΈ

Scenario: It's 3 PM on Friday. Your monitoring dashboard lights up:

// Incoming alerts (simulated)
const alerts = [
  { 
    id: 1, 
    message: "High CPU usage on web-server-03 (95%)",
    timestamp: "15:02",
    affected_users: "unknown"
  },
  { 
    id: 2, 
    message: "Payment webhook timeouts increased 300%",
    timestamp: "15:03",
    affected_users: "~50 transactions/hour"
  },
  { 
    id: 3, 
    message: "Slow query detected: UserProfile.getAll() (12s avg)",
    timestamp: "15:04",
    affected_users: "Admin dashboard only"
  },
  { 
    id: 4, 
    message: "SSL certificate expires in 14 days",
    timestamp: "15:05",
    affected_users: "0 (future risk)"
  },
  { 
    id: 5, 
    message: "Error rate spike on /api/search (45% errors)",
    timestamp: "15:06",
    affected_users: "~2000 searches/hour"
  }
];

Triage Process:

Alert Impact Blast Radius Priority Action
#2 Payment webhooks πŸ’° Revenue loss Growing (queue backup) πŸ”΄ P0 Investigate immediately
#5 Search errors Major feature broken Stable but high volume 🟠 P1 Parallel investigation
#1 High CPU Possible symptom of #2 or #5 Contained to one server 🟑 Monitor Watch, may resolve with fix
#3 Slow admin query Internal tool only Small, contained 🟒 P2 Schedule for next week
#4 SSL expiry Future risk, no current impact None yet 🟒 P2 Schedule renewal task

Reasoning:

  • Payment webhooks (#2) are P0 because they directly impact revenue and create growing debt (failed payments pile up)
  • Search errors (#5) are P1 because they affect many users, but no monetary loss
  • High CPU (#1) might be a symptom of #2 or #5, so we monitor it but don't directly investigate
  • Admin slowness (#3) and SSL warning (#4) are important but not urgent

πŸ’‘ Key Insight: The CPU alert was a red herring. By triaging first, you avoid spending 30 minutes optimizing server resources when the real issue is the payment webhook timeout.

Example 2: The Deceptive Low Error Rate πŸ“‰

Scenario: Your error monitoring shows this pattern:

## Error rate data
class ErrorMetric:
    def __init__(self, endpoint, rate, total_requests):
        self.endpoint = endpoint
        self.rate = rate  # percentage
        self.total_requests = total_requests
        self.actual_errors = int((rate / 100) * total_requests)

errors = [
    ErrorMetric("/api/checkout", rate=0.5, total_requests=10000),  # 50 errors
    ErrorMetric("/api/newsletter", rate=15.0, total_requests=500),  # 75 errors
    ErrorMetric("/api/profile", rate=2.0, total_requests=8000)     # 160 errors
]

for e in errors:
    print(f"{e.endpoint}: {e.rate}% error rate ({e.actual_errors} errors)")

## Output:
## /api/checkout: 0.5% error rate (50 errors)
## /api/newsletter: 15.0% error rate (75 errors)
## /api/profile: 2.0% error rate (160 errors)

Naive Triage (wrong): "Newsletter has 15% errorsβ€”that's the worst!"

Expert Triage (correct):

  1. Checkout (0.5% = 50 errors) is P0 despite low rate

    • Critical path: blocks purchases
    • Even 0.5% means lost revenue
    • High-value user action
  2. Profile (2.0% = 160 errors) is P1

    • Highest absolute error count
    • Common user action
    • Degraded experience for many
  3. Newsletter (15% = 75 errors) is P2

    • High rate but low impact
    • Non-critical feature
    • Low traffic volume

🧠 Mental Model: Error rate shows symptom severity. Error count shows impact scale. Business criticality determines priority. You need all three.

Example 3: The Cascading Failure Chain ⛓️

Scenario: Multiple services are failing:

// System status
type ServiceStatus struct {
    Name       string
    Status     string
    ErrorMsg   string
    DependsOn  []string
}

statuses := []ServiceStatus{
    {
        Name:      "frontend-web",
        Status:    "degraded",
        ErrorMsg:  "Slow response times, timeouts on user dashboard",
        DependsOn: []string{"api-gateway"},
    },
    {
        Name:      "api-gateway",
        Status:    "degraded",
        ErrorMsg:  "High latency, connection pool saturation",
        DependsOn: []string{"auth-service", "user-service"},
    },
    {
        Name:      "auth-service",
        Status:    "healthy",
        ErrorMsg:  "",
        DependsOn: []string{"redis-cache"},
    },
    {
        Name:      "user-service",
        Status:    "failing",
        ErrorMsg:  "Database connection timeout",
        DependsOn: []string{"postgres-db"},
    },
    {
        Name:      "postgres-db",
        Status:    "degraded",
        ErrorMsg:  "Max connections reached (500/500)",
        DependsOn: []string{},
    },
}

Dependency Map:

        CASCADING FAILURE ANALYSIS

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Frontend-Web β”‚ ← User-facing symptoms
    β”‚  (degraded)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ API-Gateway  β”‚ ← Propagating failures
    β”‚  (degraded)  β”‚
    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜
        β”‚      β”‚
    β”Œβ”€β”€β”€β”΄β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Auth β”‚ β”‚ User-Service β”‚ ← One failing
    β”‚ (OK) β”‚ β”‚  (FAILING)   β”‚
    β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Postgres-DB  β”‚ ← ROOT CAUSE
              β”‚  (degraded)  β”‚   Max connections!
              β”‚  500/500     β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Triage Decision:

  • Don't fix: Frontend (symptom)
  • Don't fix: API Gateway (symptom)
  • Don't fix: User Service (symptom)
  • Fix: Postgres connection limit (root cause)
-- Quick triage action: increase connection limit
ALTER SYSTEM SET max_connections = 1000;
SELECT pg_reload_conf();

-- Then investigate WHY connections maxed out
-- (connection leak? missing connection pooling? traffic spike?)

πŸ’‘ Triage Principle: In cascading failures, trace dependencies backward to find the root. Fixing symptoms wastes time and distracts from the real issue.

Example 4: The Friday Afternoon Dilemma πŸ•

Scenario: It's 4:30 PM Friday. Your team wants to leave at 5:00 PM. Three bugs are reported:

// Bug reports
interface BugReport {
  title: string;
  severity: string;
  estimatedFix: string;
  reportedBy: string;
  workaround: string;
}

const bugs: BugReport[] = [
  {
    title: "Dark mode toggle doesn't persist after refresh",
    severity: "P2 - Minor annoyance",
    estimatedFix: "15 minutes (localStorage bug)",
    reportedBy: "Internal QA",
    workaround: "Users can re-toggle each session"
  },
  {
    title: "Email confirmation links expire after 1 hour (should be 24h)",
    severity: "P1 - Users can't activate accounts",
    estimatedFix: "5 minutes (config change) + 30 min deploy",
    reportedBy: "Customer support (3 tickets)",
    workaround: "Support can manually activate accounts"
  },
  {
    title: "Analytics dashboard shows wrong revenue numbers",
    severity: "P1 - Executives making decisions on bad data",
    estimatedFix: "Unknown (data pipeline investigation needed)",
    reportedBy: "CFO",
    workaround: "Pull reports manually from database"
  }
];

Triage Decision Matrix:

Bug Impact Urgency Fix Time Decision
Email expiry 🟠 Blocking new users High (tickets piling up) 35 min βœ… Fix now (before leaving)
Analytics data πŸ”΄ Business-critical Low (weekend, no decisions) Unknown (hours?) ⏰ Schedule Monday morning
Dark mode 🟒 Cosmetic Low 15 min πŸ“… Backlog

Reasoning:

  • Email bug: Quick fix, actively causing problems right now, workaround is manual (burden on support)
  • Analytics bug: High severity but low urgency (weekend), unknown complexity (investigation might take hours), has workaround
  • Dark mode: Low impact, has workaround (minor inconvenience)

⚠️ Common Mistake: Fixing the "quick win" (dark mode) first because it's easy. But email expiry is actively damaging user experience and creating support burden. Fix that first, even if it takes longer.

πŸ”” Triage Rule: Time-to-fix is a factor, not the deciding factor. A 5-hour fix for a P0 beats a 5-minute fix for a P3.

Common Mistakes in Bug Triage ⚠️

1. The "Squeaky Wheel" Trap πŸ”Š

Mistake: Prioritizing bugs based on who's yelling loudest, not actual impact.

## Anti-pattern: Priority based on noise level
class Bug:
    def __init__(self, title, actual_severity, reported_by, follow_ups):
        self.title = title
        self.actual_severity = actual_severity
        self.reported_by = reported_by
        self.follow_ups = follow_ups  # How many times they've asked

bugs = [
    Bug("Checkout broken", "P0", "Anonymous user", follow_ups=1),
    Bug("Logo 2px off-center", "P3", "CEO", follow_ups=5)
]

## ❌ WRONG: Prioritize by follow_ups (noise)
## βœ… RIGHT: Prioritize by actual_severity (impact)

Solution: Establish a clear severity rubric. Communicate it. Use it consistently, regardless of who reports the bug.

2. Analysis Paralysis πŸ”„

Mistake: Spending 45 minutes triaging when you should be fixing.

// Anti-pattern: Over-analyzing during triage
function triageBug(bug) {
  const impact = assessImpact(bug);        // 5 min βœ…
  const users = countAffectedUsers(bug);   // 5 min βœ…
  const rootCause = findRootCause(bug);    // 30 min ❌ TOO DEEP!
  const similarBugs = searchHistory(bug);  // 15 min ❌ NOT TRIAGE!
  
  // You're diagnosing, not triaging!
}

// βœ… CORRECT: Time-box triage
function triageBugCorrectly(bug) {
  const deadline = Date.now() + (5 * 60 * 1000); // 5-minute timer
  
  const impact = quickImpactCheck(bug);    // Surface-level only
  const priority = assignPriority(impact);
  
  if (Date.now() > deadline) {
    return { priority, needsMoreInfo: true };
  }
  
  return { priority, readyForDiagnosis: true };
}

Solution: Set a timer. If you can't triage in 5 minutes, mark it "needs investigation" and move on. You can deep-dive after prioritization.

3. Ignoring the Business Context πŸ’Ό

Mistake: Technical assessment without business awareness.

## Example: Black Friday e-commerce site
class BugTriage
  def assess_priority(bug, date)
    base_priority = bug.technical_severity
    
    # ❌ WRONG: Ignore calendar/business events
    return base_priority
    
    # βœ… RIGHT: Context matters
    if date == BLACK_FRIDAY && bug.affects_checkout?
      return "CRITICAL" # Even minor checkout bugs are P0 today
    elsif date == SUNDAY_3AM && bug.affects_admin_panel?
      return "LOW" # Admin bugs can wait till Monday
    else
      return base_priority
    end
  end
end

Examples of business context:

  • Product launch week: features > bugs
  • End of quarter: revenue-impacting issues escalate
  • After security breach: all security issues escalate
  • Low-traffic period: good time for risky fixes

4. The "I Can Fix This Quick" Temptation πŸƒ

Mistake: Starting to fix during triage because "it'll only take a minute."

// Anti-pattern: Fixing while triaging
func TriageIncomingIssues(issues []Issue) {
    for _, issue := range issues {
        priority := assessPriority(issue)
        
        // ❌ TEMPTATION: "Oh, this is just a typo, let me fix it real quick..."
        if priority == "P3" && looksEasy(issue) {
            fixIssue(issue) // STOP! You're context-switching!
            continue
        }
        
        addToPriorityQueue(issue, priority)
    }
}

// βœ… CORRECT: Triage everything FIRST, then fix in priority order
func TriageCorrectly(issues []Issue) {
    priorityQueue := []Issue{}
    
    // Phase 1: Triage ALL issues (no fixing!)
    for _, issue := range issues {
        priority := assessPriority(issue)
        priorityQueue = append(priorityQueue, issue)
    }
    
    // Phase 2: Sort by priority
    sort.Slice(priorityQueue, func(i, j int) bool {
        return priorityQueue[i].Priority > priorityQueue[j].Priority
    })
    
    // Phase 3: NOW fix in order
    for _, issue := range priorityQueue {
        fixIssue(issue)
    }
}

Why this matters: That "quick 2-minute fix" often turns into 20 minutes. Meanwhile, a P0 issue sits unaddressed because you were distracted by an easy P3.

5. Solo Triage for Complex Situations πŸ‘₯

Mistake: Triaging a major incident alone when you need multiple perspectives.

Solution: For P0 incidents affecting multiple systems, do collaborative triage:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   COLLABORATIVE TRIAGE PROTOCOL        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                        β”‚
β”‚  1. Incident Commander (IC)            β”‚
β”‚     β†’ Coordinates, makes final call    β”‚
β”‚                                        β”‚
β”‚  2. Technical Lead (TL)                β”‚
β”‚     β†’ Assesses technical severity      β”‚
β”‚                                        β”‚
β”‚  3. Product Owner (PO)                 β”‚
β”‚     β†’ Assesses business impact         β”‚
β”‚                                        β”‚
β”‚  4. Customer Support (CS)              β”‚
β”‚     β†’ Reports user impact              β”‚
β”‚                                        β”‚
β”‚  🎯 15-minute sync:                    β”‚
β”‚     - Each shares their view           β”‚
β”‚     - IC assigns priority              β”‚
β”‚     - Team executes                    β”‚
β”‚                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When to escalate from solo to collaborative triage:

  • Multiple services affected
  • Unclear impact scope
  • Priority conflicts between stakeholders
  • Potential for cascading failures

Key Takeaways 🎯

πŸ“‹ Quick Reference: Triage Essentials

🎯 Golden Rule Triage is about what to fix and when, not how to fix it
⏱️ Time Limit 5 minutes per issue maximum during initial triage
πŸ“Š Priority Formula Impact Γ— Users Affected Γ— Business Criticality
πŸ’₯ Blast Radius Growing > Large > Contained
🎭 Context Matters Same bug has different priority on Black Friday vs Sunday 3 AM
πŸ”— Cascading Failures Trace dependencies backward; fix root cause, not symptoms
🚫 Avoid Analysis paralysis, squeaky wheel prioritization, solo triage for P0s
βœ… Remember Symptoms first, diagnosis later; prioritize by impact, not ease

The Triage Checklist (Print & Post) πŸ“Œ

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
     BUG TRIAGE DECISION CHECKLIST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

β–‘ How many users affected?
  β—‹ 1-10     β†’ Consider P2/P3
  β—‹ 10-100   β†’ Consider P1
  β—‹ 100+     β†’ Consider P0/P1
  β—‹ All users β†’ Likely P0

β–‘ What's the business impact?
  β—‹ Revenue loss        β†’ P0
  β—‹ Data loss/security  β†’ P0
  β—‹ Major feature down  β†’ P1
  β—‹ Minor feature down  β†’ P2
  β—‹ Cosmetic/nice-to-have β†’ P3

β–‘ Is there a workaround?
  β—‹ No workaround       β†’ +1 severity
  β—‹ Manual workaround   β†’ Keep severity
  β—‹ Easy workaround     β†’ -1 severity

β–‘ Is it getting worse?
  β—‹ Growing blast radius β†’ +1 severity
  β—‹ Stable              β†’ Keep severity
  β—‹ Self-limiting       β†’ -1 severity

β–‘ Can we roll back?
  β—‹ Yes, easily  β†’ Rollback first, diagnose later
  β—‹ No           β†’ Fix forward

β–‘ Business context?
  β—‹ High-traffic period      β†’ +1 severity
  β—‹ Critical business period β†’ +1 severity
  β—‹ Low-traffic period       β†’ Keep severity

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FINAL DECISION:
  β–‘ P0 - Drop everything
  β–‘ P1 - Fix today
  β–‘ P2 - This sprint
  β–‘ P3 - Backlog
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Parting Wisdom πŸ’‘

"Hours of debugging can save you minutes of planning." (Anonymous, paraphrased)

The 5 minutes you invest in proper triage can save hours of misdirected debugging effort. When production is burning and adrenaline is high, the discipline to stop, assess, and prioritize separates experienced engineers from those who fight fires ineffectively.

Triage isn't about being slowβ€”it's about being precisely fast. You're sprinting in the right direction, not just sprinting.


πŸ“š Further Study

  1. Google SRE Book - Chapter 12: "Effective Troubleshooting": https://sre.google/sre-book/effective-troubleshooting/ - Deep dive into systematic debugging and triage methodologies from Google's Site Reliability Engineering team

  2. Incident Management Best Practices: https://www.atlassian.com/incident-management/incident-response/severity-levels - Atlassian's guide to severity classification and triage workflows

  3. The RICE Prioritization Framework: https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/ - Detailed explanation of RICE scoring for feature and bug prioritization