Complete the collaborative triage protocol: ```rust struct TriageTeam { incident_commander: Person, technical_lead: Person, product_owner: Person, {{1}}: Person, // Reports user impact } impl TriageTeam { fn assess_priority(&self) -> Priority { // Each member provides their {{2}} // IC makes final decision } } ```

["customer_support","perspective"]

Triage Before Diagnosis

Stopping the bleeding before understanding the wound

Triage Before Diagnosis: Efficient Debugging Under Pressure

Master effective debugging strategies with free flashcards and spaced repetition practice. This lesson covers triage methodologies, prioritization frameworks, and rapid assessment techniques—essential skills for debugging systems under pressure when multiple issues compete for attention.

Welcome to Systematic Bug Triage 🔍

When production is on fire 🔥 and multiple alerts are flooding your dashboard, the instinct to dive deep into the first bug you see can be deadly. Triage before diagnosis is the professional developer's mantra: assess the landscape, prioritize ruthlessly, and allocate your limited time where it will have the greatest impact.

This approach, borrowed from emergency medicine, transforms chaotic crisis response into systematic problem-solving. Just as an ER doctor doesn't treat patients in arrival order, you shouldn't debug issues in the order they appear in your logs.

Core Concepts: The Triage Mindset 🎯

What is Bug Triage?

Bug triage is the systematic process of quickly assessing multiple issues to determine their relative priority, severity, and appropriate response strategy. It's about asking "Which problem should I solve first?" before asking "How do I solve this problem?"

The key distinction:

Triage = Assessment, categorization, prioritization (minutes)
Diagnosis = Root cause analysis, detailed investigation (hours)

💡 Mental Model: Think of triage as the "sorting hat" phase. You're not fixing bugs yet—you're organizing them into buckets that determine your response strategy.

The Three-Tier Severity Model 📊

Most organizations use a variant of this classification system:

Severity	Impact	Response Time	Examples
🔴 P0 / Critical	System down, data loss, security breach	Immediate (drop everything)	Payment processing broken, authentication failing, data corruption
🟡 P1 / High	Major feature broken, significant user impact	Same day	Search not working, emails not sending, API returning 500s
🟢 P2 / Medium	Minor feature broken, workaround exists	This sprint	UI glitch, slow performance on edge case, minor visual bug
🔵 P3 / Low	Nice-to-have, cosmetic, future improvement	Backlog	Color inconsistency, typo, feature request

⚠️ Common Mistake: Treating everything as urgent. If everything is P0, nothing is P0. Proper triage requires honest assessment of actual business impact.

The RICE Framework for Prioritization 🎲

When severity levels aren't enough (multiple P1s competing), use RICE scoring:

Reach: How many users affected? (1-1000+)
Impact: How severely? (0.25=minimal, 0.5=low, 1=medium, 2=high, 3=massive)
Confidence: How sure are you? (50%, 80%, 100%)
Effort: Hours/days to fix? (0.5-20+)

RICE Score = (Reach × Impact × Confidence) / Effort

Higher scores = higher priority.

## Example RICE calculation
bugs = [
    {"name": "Checkout broken", "reach": 1000, "impact": 3, "confidence": 1.0, "effort": 4},
    {"name": "Slow loading", "reach": 5000, "impact": 1, "confidence": 0.8, "effort": 8},
    {"name": "Email typo", "reach": 10000, "impact": 0.25, "confidence": 1.0, "effort": 0.5}
]

for bug in bugs:
    score = (bug["reach"] * bug["impact"] * bug["confidence"]) / bug["effort"]
    print(f"{bug['name']}: RICE = {score:.1f}")

## Output:
## Checkout broken: RICE = 750.0  ← Fix this first!
## Slow loading: RICE = 500.0
## Email typo: RICE = 5000.0      ← High score but low actual priority

⚠️ Watch Out: RICE works best for feature prioritization. For critical bugs, severity trumps RICE scores.

The Five-Minute Triage Assessment ⏱️

When a new issue arrives, spend exactly 5 minutes gathering this information:

┌────────────────────────────────────────────────┐
│         TRIAGE CHECKLIST (5 MINUTES)           │
├────────────────────────────────────────────────┤
│                                                │
│  ✅ What broke?                                │
│  ✅ How many users affected?                   │
│  ✅ Is there a workaround?                     │
│  ✅ Is it getting worse?                       │
│  ✅ What's the business impact?                │
│  ✅ Can we roll back?                          │
│  ✅ Do we have logs/repro steps?               │
│                                                │
└────────────────────────────────────────────────┘

💡 Pro Tip: Use a template. Create a Slack/Teams bot that auto-asks these questions when someone reports a bug. This standardizes triage and prevents information gaps.

The Blast Radius Concept 💥

Blast radius measures how widely a bug spreads if left unfixed:

        BLAST RADIUS VISUALIZATION

    Small Radius           Large Radius
    (Contained)            (Spreading)

       ┌───┐                 ┌─────────────┐
       │ ● │                 │             │
       └───┘                 │    ╱ │ ╲    │
    One feature           │   ╱  │  ╲   │
    affected              │  ●───┼───●  │
                          │   ╲  │  ╱   │
    ┌─────────┐           │    ╲ │ ╱    │
    │    ●    │           └─────────────┘
    │  ╱   ╲  │         Multiple systems,
    │ ●     ● │         cascading failures,
    └─────────┘         data corruption risk
  Small feature
  complex fix

Prioritize bugs with large, growing blast radius over bugs with small, contained impact—even if the small bug is easier to fix.

Example: A memory leak that grows 1MB/hour might seem minor, but in 48 hours it will crash your server. Fix it before the prettier UI bug that affects one modal.

Symptom vs. Root Cause During Triage 🌳

During triage, you're collecting symptoms, not diagnosing root causes:

Triage (Symptoms)	Diagnosis (Root Causes)
"Users can't log in"	"Session token expiry logic has off-by-one error"
"API returns 500 errors"	"Database connection pool exhausted under load"
"Page loads slowly"	"N+1 query on User.orders relationship"
"Payment processing fails"	"Third-party gateway timeout not handled"

🧠 Mnemonic: S.T.O.P. during triage

Symptoms only
Time-box your investigation
Observe impact
Prioritize, don't diagnose

The Decision Matrix 📋

Combine severity and effort to create an action matrix:

           TRIAGE DECISION MATRIX

    Effort │
           │
    High   │  🟡 Schedule     🔴 All-hands
           │  (Plan fix)     (Emergency)
           │
           ├──────────────────────────────
           │
    Low    │  🟢 Quick win    🟠 Urgent
           │  (Next sprint)  (Fix today)
           │
           └──────────────────────────────
                Low          High
                       Impact/Severity

🔴 P0: Drop everything, full team response
🟠 P1: Fix today, reassign resources if needed
🟡 P1/P2: Schedule into sprint, assign owner
🟢 P2/P3: Backlog, fix during cleanup

Real-World Triage Examples 🔬

Example 1: The Production Alert Storm ⛈️

Scenario: It's 3 PM on Friday. Your monitoring dashboard lights up:

// Incoming alerts (simulated)
const alerts = [
  { 
    id: 1, 
    message: "High CPU usage on web-server-03 (95%)",
    timestamp: "15:02",
    affected_users: "unknown"
  },
  { 
    id: 2, 
    message: "Payment webhook timeouts increased 300%",
    timestamp: "15:03",
    affected_users: "~50 transactions/hour"
  },
  { 
    id: 3, 
    message: "Slow query detected: UserProfile.getAll() (12s avg)",
    timestamp: "15:04",
    affected_users: "Admin dashboard only"
  },
  { 
    id: 4, 
    message: "SSL certificate expires in 14 days",
    timestamp: "15:05",
    affected_users: "0 (future risk)"
  },
  { 
    id: 5, 
    message: "Error rate spike on /api/search (45% errors)",
    timestamp: "15:06",
    affected_users: "~2000 searches/hour"
  }
];

Triage Process:

Alert	Impact	Blast Radius	Priority	Action
#2 Payment webhooks	💰 Revenue loss	Growing (queue backup)	🔴 P0	Investigate immediately
#5 Search errors	Major feature broken	Stable but high volume	🟠 P1	Parallel investigation
#1 High CPU	Possible symptom of #2 or #5	Contained to one server	🟡 Monitor	Watch, may resolve with fix
#3 Slow admin query	Internal tool only	Small, contained	🟢 P2	Schedule for next week
#4 SSL expiry	Future risk, no current impact	None yet	🟢 P2	Schedule renewal task

Reasoning:

Payment webhooks (#2) are P0 because they directly impact revenue and create growing debt (failed payments pile up)
Search errors (#5) are P1 because they affect many users, but no monetary loss
High CPU (#1) might be a symptom of #2 or #5, so we monitor it but don't directly investigate
Admin slowness (#3) and SSL warning (#4) are important but not urgent

💡 Key Insight: The CPU alert was a red herring. By triaging first, you avoid spending 30 minutes optimizing server resources when the real issue is the payment webhook timeout.

Example 2: The Deceptive Low Error Rate 📉

Scenario: Your error monitoring shows this pattern:

## Error rate data
class ErrorMetric:
    def __init__(self, endpoint, rate, total_requests):
        self.endpoint = endpoint
        self.rate = rate  # percentage
        self.total_requests = total_requests
        self.actual_errors = int((rate / 100) * total_requests)

errors = [
    ErrorMetric("/api/checkout", rate=0.5, total_requests=10000),  # 50 errors
    ErrorMetric("/api/newsletter", rate=15.0, total_requests=500),  # 75 errors
    ErrorMetric("/api/profile", rate=2.0, total_requests=8000)     # 160 errors
]

for e in errors:
    print(f"{e.endpoint}: {e.rate}% error rate ({e.actual_errors} errors)")

## Output:
## /api/checkout: 0.5% error rate (50 errors)
## /api/newsletter: 15.0% error rate (75 errors)
## /api/profile: 2.0% error rate (160 errors)

Naive Triage (wrong): "Newsletter has 15% errors—that's the worst!"

Expert Triage (correct):

Checkout (0.5% = 50 errors) is P0 despite low rate
- Critical path: blocks purchases
- Even 0.5% means lost revenue
- High-value user action
Profile (2.0% = 160 errors) is P1
- Highest absolute error count
- Common user action
- Degraded experience for many
Newsletter (15% = 75 errors) is P2
- High rate but low impact
- Non-critical feature
- Low traffic volume

🧠 Mental Model: Error rate shows symptom severity. Error count shows impact scale. Business criticality determines priority. You need all three.

Example 3: The Cascading Failure Chain ⛓️

Scenario: Multiple services are failing:

// System status
type ServiceStatus struct {
    Name       string
    Status     string
    ErrorMsg   string
    DependsOn  []string
}

statuses := []ServiceStatus{
    {
        Name:      "frontend-web",
        Status:    "degraded",
        ErrorMsg:  "Slow response times, timeouts on user dashboard",
        DependsOn: []string{"api-gateway"},
    },
    {
        Name:      "api-gateway",
        Status:    "degraded",
        ErrorMsg:  "High latency, connection pool saturation",
        DependsOn: []string{"auth-service", "user-service"},
    },
    {
        Name:      "auth-service",
        Status:    "healthy",
        ErrorMsg:  "",
        DependsOn: []string{"redis-cache"},
    },
    {
        Name:      "user-service",
        Status:    "failing",
        ErrorMsg:  "Database connection timeout",
        DependsOn: []string{"postgres-db"},
    },
    {
        Name:      "postgres-db",
        Status:    "degraded",
        ErrorMsg:  "Max connections reached (500/500)",
        DependsOn: []string{},
    },
}

Dependency Map:

        CASCADING FAILURE ANALYSIS

    ┌──────────────┐
    │ Frontend-Web │ ← User-facing symptoms
    │  (degraded)  │
    └──────┬───────┘
           │
           ↓
    ┌──────────────┐
    │ API-Gateway  │ ← Propagating failures
    │  (degraded)  │
    └───┬──────┬───┘
        │      │
    ┌───┴──┐ ┌─┴────────────┐
    │ Auth │ │ User-Service │ ← One failing
    │ (OK) │ │  (FAILING)   │
    └──────┘ └───────┬──────┘
                     │
              ┌──────┴───────┐
              │ Postgres-DB  │ ← ROOT CAUSE
              │  (degraded)  │   Max connections!
              │  500/500     │
              └──────────────┘

Triage Decision:

Don't fix: Frontend (symptom)
Don't fix: API Gateway (symptom)
Don't fix: User Service (symptom)
Fix: Postgres connection limit (root cause)

-- Quick triage action: increase connection limit
ALTER SYSTEM SET max_connections = 1000;
SELECT pg_reload_conf();

-- Then investigate WHY connections maxed out
-- (connection leak? missing connection pooling? traffic spike?)

💡 Triage Principle: In cascading failures, trace dependencies backward to find the root. Fixing symptoms wastes time and distracts from the real issue.

Example 4: The Friday Afternoon Dilemma 🕐

Scenario: It's 4:30 PM Friday. Your team wants to leave at 5:00 PM. Three bugs are reported:

// Bug reports
interface BugReport {
  title: string;
  severity: string;
  estimatedFix: string;
  reportedBy: string;
  workaround: string;
}

const bugs: BugReport[] = [
  {
    title: "Dark mode toggle doesn't persist after refresh",
    severity: "P2 - Minor annoyance",
    estimatedFix: "15 minutes (localStorage bug)",
    reportedBy: "Internal QA",
    workaround: "Users can re-toggle each session"
  },
  {
    title: "Email confirmation links expire after 1 hour (should be 24h)",
    severity: "P1 - Users can't activate accounts",
    estimatedFix: "5 minutes (config change) + 30 min deploy",
    reportedBy: "Customer support (3 tickets)",
    workaround: "Support can manually activate accounts"
  },
  {
    title: "Analytics dashboard shows wrong revenue numbers",
    severity: "P1 - Executives making decisions on bad data",
    estimatedFix: "Unknown (data pipeline investigation needed)",
    reportedBy: "CFO",
    workaround: "Pull reports manually from database"
  }
];

Triage Decision Matrix:

Bug	Impact	Urgency	Fix Time	Decision
Email expiry	🟠 Blocking new users	High (tickets piling up)	35 min	✅ Fix now (before leaving)
Analytics data	🔴 Business-critical	Low (weekend, no decisions)	Unknown (hours?)	⏰ Schedule Monday morning
Dark mode	🟢 Cosmetic	Low	15 min	📅 Backlog

Reasoning:

Email bug: Quick fix, actively causing problems right now, workaround is manual (burden on support)
Analytics bug: High severity but low urgency (weekend), unknown complexity (investigation might take hours), has workaround
Dark mode: Low impact, has workaround (minor inconvenience)

⚠️ Common Mistake: Fixing the "quick win" (dark mode) first because it's easy. But email expiry is actively damaging user experience and creating support burden. Fix that first, even if it takes longer.

🔔 Triage Rule: Time-to-fix is a factor, not the deciding factor. A 5-hour fix for a P0 beats a 5-minute fix for a P3.

Common Mistakes in Bug Triage ⚠️

1. The "Squeaky Wheel" Trap 🔊

Mistake: Prioritizing bugs based on who's yelling loudest, not actual impact.

## Anti-pattern: Priority based on noise level
class Bug:
    def __init__(self, title, actual_severity, reported_by, follow_ups):
        self.title = title
        self.actual_severity = actual_severity
        self.reported_by = reported_by
        self.follow_ups = follow_ups  # How many times they've asked

bugs = [
    Bug("Checkout broken", "P0", "Anonymous user", follow_ups=1),
    Bug("Logo 2px off-center", "P3", "CEO", follow_ups=5)
]

## ❌ WRONG: Prioritize by follow_ups (noise)
## ✅ RIGHT: Prioritize by actual_severity (impact)

Solution: Establish a clear severity rubric. Communicate it. Use it consistently, regardless of who reports the bug.

2. Analysis Paralysis 🔄

Mistake: Spending 45 minutes triaging when you should be fixing.

// Anti-pattern: Over-analyzing during triage
function triageBug(bug) {
  const impact = assessImpact(bug);        // 5 min ✅
  const users = countAffectedUsers(bug);   // 5 min ✅
  const rootCause = findRootCause(bug);    // 30 min ❌ TOO DEEP!
  const similarBugs = searchHistory(bug);  // 15 min ❌ NOT TRIAGE!
  
  // You're diagnosing, not triaging!
}

// ✅ CORRECT: Time-box triage
function triageBugCorrectly(bug) {
  const deadline = Date.now() + (5 * 60 * 1000); // 5-minute timer
  
  const impact = quickImpactCheck(bug);    // Surface-level only
  const priority = assignPriority(impact);
  
  if (Date.now() > deadline) {
    return { priority, needsMoreInfo: true };
  }
  
  return { priority, readyForDiagnosis: true };
}

Solution: Set a timer. If you can't triage in 5 minutes, mark it "needs investigation" and move on. You can deep-dive after prioritization.

3. Ignoring the Business Context 💼

Mistake: Technical assessment without business awareness.

## Example: Black Friday e-commerce site
class BugTriage
  def assess_priority(bug, date)
    base_priority = bug.technical_severity
    
    # ❌ WRONG: Ignore calendar/business events
    return base_priority
    
    # ✅ RIGHT: Context matters
    if date == BLACK_FRIDAY && bug.affects_checkout?
      return "CRITICAL" # Even minor checkout bugs are P0 today
    elsif date == SUNDAY_3AM && bug.affects_admin_panel?
      return "LOW" # Admin bugs can wait till Monday
    else
      return base_priority
    end
  end
end

Examples of business context:

Product launch week: features > bugs
End of quarter: revenue-impacting issues escalate
After security breach: all security issues escalate
Low-traffic period: good time for risky fixes

4. The "I Can Fix This Quick" Temptation 🏃

Mistake: Starting to fix during triage because "it'll only take a minute."

// Anti-pattern: Fixing while triaging
func TriageIncomingIssues(issues []Issue) {
    for _, issue := range issues {
        priority := assessPriority(issue)
        
        // ❌ TEMPTATION: "Oh, this is just a typo, let me fix it real quick..."
        if priority == "P3" && looksEasy(issue) {
            fixIssue(issue) // STOP! You're context-switching!
            continue
        }
        
        addToPriorityQueue(issue, priority)
    }
}

// ✅ CORRECT: Triage everything FIRST, then fix in priority order
func TriageCorrectly(issues []Issue) {
    priorityQueue := []Issue{}
    
    // Phase 1: Triage ALL issues (no fixing!)
    for _, issue := range issues {
        priority := assessPriority(issue)
        priorityQueue = append(priorityQueue, issue)
    }
    
    // Phase 2: Sort by priority
    sort.Slice(priorityQueue, func(i, j int) bool {
        return priorityQueue[i].Priority > priorityQueue[j].Priority
    })
    
    // Phase 3: NOW fix in order
    for _, issue := range priorityQueue {
        fixIssue(issue)
    }
}

Why this matters: That "quick 2-minute fix" often turns into 20 minutes. Meanwhile, a P0 issue sits unaddressed because you were distracted by an easy P3.

5. Solo Triage for Complex Situations 👥

Mistake: Triaging a major incident alone when you need multiple perspectives.

Solution: For P0 incidents affecting multiple systems, do collaborative triage:

┌────────────────────────────────────────┐
│   COLLABORATIVE TRIAGE PROTOCOL        │
├────────────────────────────────────────┤
│                                        │
│  1. Incident Commander (IC)            │
│     → Coordinates, makes final call    │
│                                        │
│  2. Technical Lead (TL)                │
│     → Assesses technical severity      │
│                                        │
│  3. Product Owner (PO)                 │
│     → Assesses business impact         │
│                                        │
│  4. Customer Support (CS)              │
│     → Reports user impact              │
│                                        │
│  🎯 15-minute sync:                    │
│     - Each shares their view           │
│     - IC assigns priority              │
│     - Team executes                    │
│                                        │
└────────────────────────────────────────┘

When to escalate from solo to collaborative triage:

Multiple services affected
Unclear impact scope
Priority conflicts between stakeholders
Potential for cascading failures

Key Takeaways 🎯

📋 Quick Reference: Triage Essentials

🎯 Golden Rule	Triage is about what to fix and when, not how to fix it
⏱️ Time Limit	5 minutes per issue maximum during initial triage
📊 Priority Formula	Impact × Users Affected × Business Criticality
💥 Blast Radius	Growing > Large > Contained
🎭 Context Matters	Same bug has different priority on Black Friday vs Sunday 3 AM
🔗 Cascading Failures	Trace dependencies backward; fix root cause, not symptoms
🚫 Avoid	Analysis paralysis, squeaky wheel prioritization, solo triage for P0s
✅ Remember	Symptoms first, diagnosis later; prioritize by impact, not ease

The Triage Checklist (Print & Post) 📌

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
     BUG TRIAGE DECISION CHECKLIST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

□ How many users affected?
  ○ 1-10     → Consider P2/P3
  ○ 10-100   → Consider P1
  ○ 100+     → Consider P0/P1
  ○ All users → Likely P0

□ What's the business impact?
  ○ Revenue loss        → P0
  ○ Data loss/security  → P0
  ○ Major feature down  → P1
  ○ Minor feature down  → P2
  ○ Cosmetic/nice-to-have → P3

□ Is there a workaround?
  ○ No workaround       → +1 severity
  ○ Manual workaround   → Keep severity
  ○ Easy workaround     → -1 severity

□ Is it getting worse?
  ○ Growing blast radius → +1 severity
  ○ Stable              → Keep severity
  ○ Self-limiting       → -1 severity

□ Can we roll back?
  ○ Yes, easily  → Rollback first, diagnose later
  ○ No           → Fix forward

□ Business context?
  ○ High-traffic period      → +1 severity
  ○ Critical business period → +1 severity
  ○ Low-traffic period       → Keep severity

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FINAL DECISION:
  □ P0 - Drop everything
  □ P1 - Fix today
  □ P2 - This sprint
  □ P3 - Backlog
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Parting Wisdom 💡

"Hours of debugging can save you minutes of planning." (Anonymous, paraphrased)

The 5 minutes you invest in proper triage can save hours of misdirected debugging effort. When production is burning and adrenaline is high, the discipline to stop, assess, and prioritize separates experienced engineers from those who fight fires ineffectively.

Triage isn't about being slow—it's about being precisely fast. You're sprinting in the right direction, not just sprinting.

📚 Further Study

Google SRE Book - Chapter 12: "Effective Troubleshooting": https://sre.google/sre-book/effective-troubleshooting/ - Deep dive into systematic debugging and triage methodologies from Google's Site Reliability Engineering team
Incident Management Best Practices: https://www.atlassian.com/incident-management/incident-response/severity-levels - Atlassian's guide to severity classification and triage workflows
The RICE Prioritization Framework: https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/ - Detailed explanation of RICE scoring for feature and bug prioritization

📝

Ready to practice?

This lesson has 15 questions to help you learn