Triage Before Diagnosis
Stopping the bleeding before understanding the wound
Triage Before Diagnosis: Efficient Debugging Under Pressure
Master effective debugging strategies with free flashcards and spaced repetition practice. This lesson covers triage methodologies, prioritization frameworks, and rapid assessment techniquesβessential skills for debugging systems under pressure when multiple issues compete for attention.
Welcome to Systematic Bug Triage π
When production is on fire π₯ and multiple alerts are flooding your dashboard, the instinct to dive deep into the first bug you see can be deadly. Triage before diagnosis is the professional developer's mantra: assess the landscape, prioritize ruthlessly, and allocate your limited time where it will have the greatest impact.
This approach, borrowed from emergency medicine, transforms chaotic crisis response into systematic problem-solving. Just as an ER doctor doesn't treat patients in arrival order, you shouldn't debug issues in the order they appear in your logs.
Core Concepts: The Triage Mindset π―
What is Bug Triage?
Bug triage is the systematic process of quickly assessing multiple issues to determine their relative priority, severity, and appropriate response strategy. It's about asking "Which problem should I solve first?" before asking "How do I solve this problem?"
The key distinction:
- Triage = Assessment, categorization, prioritization (minutes)
- Diagnosis = Root cause analysis, detailed investigation (hours)
π‘ Mental Model: Think of triage as the "sorting hat" phase. You're not fixing bugs yetβyou're organizing them into buckets that determine your response strategy.
The Three-Tier Severity Model π
Most organizations use a variant of this classification system:
| Severity | Impact | Response Time | Examples |
|---|---|---|---|
| π΄ P0 / Critical | System down, data loss, security breach | Immediate (drop everything) | Payment processing broken, authentication failing, data corruption |
| π‘ P1 / High | Major feature broken, significant user impact | Same day | Search not working, emails not sending, API returning 500s |
| π’ P2 / Medium | Minor feature broken, workaround exists | This sprint | UI glitch, slow performance on edge case, minor visual bug |
| π΅ P3 / Low | Nice-to-have, cosmetic, future improvement | Backlog | Color inconsistency, typo, feature request |
β οΈ Common Mistake: Treating everything as urgent. If everything is P0, nothing is P0. Proper triage requires honest assessment of actual business impact.
The RICE Framework for Prioritization π²
When severity levels aren't enough (multiple P1s competing), use RICE scoring:
- Reach: How many users affected? (1-1000+)
- Impact: How severely? (0.25=minimal, 0.5=low, 1=medium, 2=high, 3=massive)
- Confidence: How sure are you? (50%, 80%, 100%)
- Effort: Hours/days to fix? (0.5-20+)
RICE Score = (Reach Γ Impact Γ Confidence) / Effort
Higher scores = higher priority.
## Example RICE calculation
bugs = [
{"name": "Checkout broken", "reach": 1000, "impact": 3, "confidence": 1.0, "effort": 4},
{"name": "Slow loading", "reach": 5000, "impact": 1, "confidence": 0.8, "effort": 8},
{"name": "Email typo", "reach": 10000, "impact": 0.25, "confidence": 1.0, "effort": 0.5}
]
for bug in bugs:
score = (bug["reach"] * bug["impact"] * bug["confidence"]) / bug["effort"]
print(f"{bug['name']}: RICE = {score:.1f}")
## Output:
## Checkout broken: RICE = 750.0 β Fix this first!
## Slow loading: RICE = 500.0
## Email typo: RICE = 5000.0 β High score but low actual priority
β οΈ Watch Out: RICE works best for feature prioritization. For critical bugs, severity trumps RICE scores.
The Five-Minute Triage Assessment β±οΈ
When a new issue arrives, spend exactly 5 minutes gathering this information:
ββββββββββββββββββββββββββββββββββββββββββββββββββ β TRIAGE CHECKLIST (5 MINUTES) β ββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β β What broke? β β β How many users affected? β β β Is there a workaround? β β β Is it getting worse? β β β What's the business impact? β β β Can we roll back? β β β Do we have logs/repro steps? β β β ββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Pro Tip: Use a template. Create a Slack/Teams bot that auto-asks these questions when someone reports a bug. This standardizes triage and prevents information gaps.
The Blast Radius Concept π₯
Blast radius measures how widely a bug spreads if left unfixed:
BLAST RADIUS VISUALIZATION
Small Radius Large Radius
(Contained) (Spreading)
βββββ βββββββββββββββ
β β β β β
βββββ β β± β β² β
One feature β β± β β² β
affected β βββββΌββββ β
β β² β β± β
βββββββββββ β β² β β± β
β β β βββββββββββββββ
β β± β² β Multiple systems,
β β β β cascading failures,
βββββββββββ data corruption risk
Small feature
complex fix
Prioritize bugs with large, growing blast radius over bugs with small, contained impactβeven if the small bug is easier to fix.
Example: A memory leak that grows 1MB/hour might seem minor, but in 48 hours it will crash your server. Fix it before the prettier UI bug that affects one modal.
Symptom vs. Root Cause During Triage π³
During triage, you're collecting symptoms, not diagnosing root causes:
| Triage (Symptoms) | Diagnosis (Root Causes) |
|---|---|
| "Users can't log in" | "Session token expiry logic has off-by-one error" |
| "API returns 500 errors" | "Database connection pool exhausted under load" |
| "Page loads slowly" | "N+1 query on User.orders relationship" |
| "Payment processing fails" | "Third-party gateway timeout not handled" |
π§ Mnemonic: S.T.O.P. during triage
- Symptoms only
- Time-box your investigation
- Observe impact
- Prioritize, don't diagnose
The Decision Matrix π
Combine severity and effort to create an action matrix:
TRIAGE DECISION MATRIX
Effort β
β
High β π‘ Schedule π΄ All-hands
β (Plan fix) (Emergency)
β
βββββββββββββββββββββββββββββββ
β
Low β π’ Quick win π Urgent
β (Next sprint) (Fix today)
β
βββββββββββββββββββββββββββββββ
Low High
Impact/Severity
π΄ P0: Drop everything, full team response
π P1: Fix today, reassign resources if needed
π‘ P1/P2: Schedule into sprint, assign owner
π’ P2/P3: Backlog, fix during cleanup
Real-World Triage Examples π¬
Example 1: The Production Alert Storm βοΈ
Scenario: It's 3 PM on Friday. Your monitoring dashboard lights up:
// Incoming alerts (simulated)
const alerts = [
{
id: 1,
message: "High CPU usage on web-server-03 (95%)",
timestamp: "15:02",
affected_users: "unknown"
},
{
id: 2,
message: "Payment webhook timeouts increased 300%",
timestamp: "15:03",
affected_users: "~50 transactions/hour"
},
{
id: 3,
message: "Slow query detected: UserProfile.getAll() (12s avg)",
timestamp: "15:04",
affected_users: "Admin dashboard only"
},
{
id: 4,
message: "SSL certificate expires in 14 days",
timestamp: "15:05",
affected_users: "0 (future risk)"
},
{
id: 5,
message: "Error rate spike on /api/search (45% errors)",
timestamp: "15:06",
affected_users: "~2000 searches/hour"
}
];
Triage Process:
| Alert | Impact | Blast Radius | Priority | Action |
|---|---|---|---|---|
| #2 Payment webhooks | π° Revenue loss | Growing (queue backup) | π΄ P0 | Investigate immediately |
| #5 Search errors | Major feature broken | Stable but high volume | π P1 | Parallel investigation |
| #1 High CPU | Possible symptom of #2 or #5 | Contained to one server | π‘ Monitor | Watch, may resolve with fix |
| #3 Slow admin query | Internal tool only | Small, contained | π’ P2 | Schedule for next week |
| #4 SSL expiry | Future risk, no current impact | None yet | π’ P2 | Schedule renewal task |
Reasoning:
- Payment webhooks (#2) are P0 because they directly impact revenue and create growing debt (failed payments pile up)
- Search errors (#5) are P1 because they affect many users, but no monetary loss
- High CPU (#1) might be a symptom of #2 or #5, so we monitor it but don't directly investigate
- Admin slowness (#3) and SSL warning (#4) are important but not urgent
π‘ Key Insight: The CPU alert was a red herring. By triaging first, you avoid spending 30 minutes optimizing server resources when the real issue is the payment webhook timeout.
Example 2: The Deceptive Low Error Rate π
Scenario: Your error monitoring shows this pattern:
## Error rate data
class ErrorMetric:
def __init__(self, endpoint, rate, total_requests):
self.endpoint = endpoint
self.rate = rate # percentage
self.total_requests = total_requests
self.actual_errors = int((rate / 100) * total_requests)
errors = [
ErrorMetric("/api/checkout", rate=0.5, total_requests=10000), # 50 errors
ErrorMetric("/api/newsletter", rate=15.0, total_requests=500), # 75 errors
ErrorMetric("/api/profile", rate=2.0, total_requests=8000) # 160 errors
]
for e in errors:
print(f"{e.endpoint}: {e.rate}% error rate ({e.actual_errors} errors)")
## Output:
## /api/checkout: 0.5% error rate (50 errors)
## /api/newsletter: 15.0% error rate (75 errors)
## /api/profile: 2.0% error rate (160 errors)
Naive Triage (wrong): "Newsletter has 15% errorsβthat's the worst!"
Expert Triage (correct):
Checkout (0.5% = 50 errors) is P0 despite low rate
- Critical path: blocks purchases
- Even 0.5% means lost revenue
- High-value user action
Profile (2.0% = 160 errors) is P1
- Highest absolute error count
- Common user action
- Degraded experience for many
Newsletter (15% = 75 errors) is P2
- High rate but low impact
- Non-critical feature
- Low traffic volume
π§ Mental Model: Error rate shows symptom severity. Error count shows impact scale. Business criticality determines priority. You need all three.
Example 3: The Cascading Failure Chain βοΈ
Scenario: Multiple services are failing:
// System status
type ServiceStatus struct {
Name string
Status string
ErrorMsg string
DependsOn []string
}
statuses := []ServiceStatus{
{
Name: "frontend-web",
Status: "degraded",
ErrorMsg: "Slow response times, timeouts on user dashboard",
DependsOn: []string{"api-gateway"},
},
{
Name: "api-gateway",
Status: "degraded",
ErrorMsg: "High latency, connection pool saturation",
DependsOn: []string{"auth-service", "user-service"},
},
{
Name: "auth-service",
Status: "healthy",
ErrorMsg: "",
DependsOn: []string{"redis-cache"},
},
{
Name: "user-service",
Status: "failing",
ErrorMsg: "Database connection timeout",
DependsOn: []string{"postgres-db"},
},
{
Name: "postgres-db",
Status: "degraded",
ErrorMsg: "Max connections reached (500/500)",
DependsOn: []string{},
},
}
Dependency Map:
CASCADING FAILURE ANALYSIS
ββββββββββββββββ
β Frontend-Web β β User-facing symptoms
β (degraded) β
ββββββββ¬ββββββββ
β
β
ββββββββββββββββ
β API-Gateway β β Propagating failures
β (degraded) β
βββββ¬βββββββ¬ββββ
β β
βββββ΄βββ βββ΄βββββββββββββ
β Auth β β User-Service β β One failing
β (OK) β β (FAILING) β
ββββββββ βββββββββ¬βββββββ
β
ββββββββ΄ββββββββ
β Postgres-DB β β ROOT CAUSE
β (degraded) β Max connections!
β 500/500 β
ββββββββββββββββ
Triage Decision:
- Don't fix: Frontend (symptom)
- Don't fix: API Gateway (symptom)
- Don't fix: User Service (symptom)
- Fix: Postgres connection limit (root cause)
-- Quick triage action: increase connection limit
ALTER SYSTEM SET max_connections = 1000;
SELECT pg_reload_conf();
-- Then investigate WHY connections maxed out
-- (connection leak? missing connection pooling? traffic spike?)
π‘ Triage Principle: In cascading failures, trace dependencies backward to find the root. Fixing symptoms wastes time and distracts from the real issue.
Example 4: The Friday Afternoon Dilemma π
Scenario: It's 4:30 PM Friday. Your team wants to leave at 5:00 PM. Three bugs are reported:
// Bug reports
interface BugReport {
title: string;
severity: string;
estimatedFix: string;
reportedBy: string;
workaround: string;
}
const bugs: BugReport[] = [
{
title: "Dark mode toggle doesn't persist after refresh",
severity: "P2 - Minor annoyance",
estimatedFix: "15 minutes (localStorage bug)",
reportedBy: "Internal QA",
workaround: "Users can re-toggle each session"
},
{
title: "Email confirmation links expire after 1 hour (should be 24h)",
severity: "P1 - Users can't activate accounts",
estimatedFix: "5 minutes (config change) + 30 min deploy",
reportedBy: "Customer support (3 tickets)",
workaround: "Support can manually activate accounts"
},
{
title: "Analytics dashboard shows wrong revenue numbers",
severity: "P1 - Executives making decisions on bad data",
estimatedFix: "Unknown (data pipeline investigation needed)",
reportedBy: "CFO",
workaround: "Pull reports manually from database"
}
];
Triage Decision Matrix:
| Bug | Impact | Urgency | Fix Time | Decision |
|---|---|---|---|---|
| Email expiry | π Blocking new users | High (tickets piling up) | 35 min | β Fix now (before leaving) |
| Analytics data | π΄ Business-critical | Low (weekend, no decisions) | Unknown (hours?) | β° Schedule Monday morning |
| Dark mode | π’ Cosmetic | Low | 15 min | π Backlog |
Reasoning:
- Email bug: Quick fix, actively causing problems right now, workaround is manual (burden on support)
- Analytics bug: High severity but low urgency (weekend), unknown complexity (investigation might take hours), has workaround
- Dark mode: Low impact, has workaround (minor inconvenience)
β οΈ Common Mistake: Fixing the "quick win" (dark mode) first because it's easy. But email expiry is actively damaging user experience and creating support burden. Fix that first, even if it takes longer.
π Triage Rule: Time-to-fix is a factor, not the deciding factor. A 5-hour fix for a P0 beats a 5-minute fix for a P3.
Common Mistakes in Bug Triage β οΈ
1. The "Squeaky Wheel" Trap π
Mistake: Prioritizing bugs based on who's yelling loudest, not actual impact.
## Anti-pattern: Priority based on noise level
class Bug:
def __init__(self, title, actual_severity, reported_by, follow_ups):
self.title = title
self.actual_severity = actual_severity
self.reported_by = reported_by
self.follow_ups = follow_ups # How many times they've asked
bugs = [
Bug("Checkout broken", "P0", "Anonymous user", follow_ups=1),
Bug("Logo 2px off-center", "P3", "CEO", follow_ups=5)
]
## β WRONG: Prioritize by follow_ups (noise)
## β
RIGHT: Prioritize by actual_severity (impact)
Solution: Establish a clear severity rubric. Communicate it. Use it consistently, regardless of who reports the bug.
2. Analysis Paralysis π
Mistake: Spending 45 minutes triaging when you should be fixing.
// Anti-pattern: Over-analyzing during triage
function triageBug(bug) {
const impact = assessImpact(bug); // 5 min β
const users = countAffectedUsers(bug); // 5 min β
const rootCause = findRootCause(bug); // 30 min β TOO DEEP!
const similarBugs = searchHistory(bug); // 15 min β NOT TRIAGE!
// You're diagnosing, not triaging!
}
// β
CORRECT: Time-box triage
function triageBugCorrectly(bug) {
const deadline = Date.now() + (5 * 60 * 1000); // 5-minute timer
const impact = quickImpactCheck(bug); // Surface-level only
const priority = assignPriority(impact);
if (Date.now() > deadline) {
return { priority, needsMoreInfo: true };
}
return { priority, readyForDiagnosis: true };
}
Solution: Set a timer. If you can't triage in 5 minutes, mark it "needs investigation" and move on. You can deep-dive after prioritization.
3. Ignoring the Business Context πΌ
Mistake: Technical assessment without business awareness.
## Example: Black Friday e-commerce site
class BugTriage
def assess_priority(bug, date)
base_priority = bug.technical_severity
# β WRONG: Ignore calendar/business events
return base_priority
# β
RIGHT: Context matters
if date == BLACK_FRIDAY && bug.affects_checkout?
return "CRITICAL" # Even minor checkout bugs are P0 today
elsif date == SUNDAY_3AM && bug.affects_admin_panel?
return "LOW" # Admin bugs can wait till Monday
else
return base_priority
end
end
end
Examples of business context:
- Product launch week: features > bugs
- End of quarter: revenue-impacting issues escalate
- After security breach: all security issues escalate
- Low-traffic period: good time for risky fixes
4. The "I Can Fix This Quick" Temptation π
Mistake: Starting to fix during triage because "it'll only take a minute."
// Anti-pattern: Fixing while triaging
func TriageIncomingIssues(issues []Issue) {
for _, issue := range issues {
priority := assessPriority(issue)
// β TEMPTATION: "Oh, this is just a typo, let me fix it real quick..."
if priority == "P3" && looksEasy(issue) {
fixIssue(issue) // STOP! You're context-switching!
continue
}
addToPriorityQueue(issue, priority)
}
}
// β
CORRECT: Triage everything FIRST, then fix in priority order
func TriageCorrectly(issues []Issue) {
priorityQueue := []Issue{}
// Phase 1: Triage ALL issues (no fixing!)
for _, issue := range issues {
priority := assessPriority(issue)
priorityQueue = append(priorityQueue, issue)
}
// Phase 2: Sort by priority
sort.Slice(priorityQueue, func(i, j int) bool {
return priorityQueue[i].Priority > priorityQueue[j].Priority
})
// Phase 3: NOW fix in order
for _, issue := range priorityQueue {
fixIssue(issue)
}
}
Why this matters: That "quick 2-minute fix" often turns into 20 minutes. Meanwhile, a P0 issue sits unaddressed because you were distracted by an easy P3.
5. Solo Triage for Complex Situations π₯
Mistake: Triaging a major incident alone when you need multiple perspectives.
Solution: For P0 incidents affecting multiple systems, do collaborative triage:
ββββββββββββββββββββββββββββββββββββββββββ β COLLABORATIVE TRIAGE PROTOCOL β ββββββββββββββββββββββββββββββββββββββββββ€ β β β 1. Incident Commander (IC) β β β Coordinates, makes final call β β β β 2. Technical Lead (TL) β β β Assesses technical severity β β β β 3. Product Owner (PO) β β β Assesses business impact β β β β 4. Customer Support (CS) β β β Reports user impact β β β β π― 15-minute sync: β β - Each shares their view β β - IC assigns priority β β - Team executes β β β ββββββββββββββββββββββββββββββββββββββββββ
When to escalate from solo to collaborative triage:
- Multiple services affected
- Unclear impact scope
- Priority conflicts between stakeholders
- Potential for cascading failures
Key Takeaways π―
π Quick Reference: Triage Essentials
| π― Golden Rule | Triage is about what to fix and when, not how to fix it |
| β±οΈ Time Limit | 5 minutes per issue maximum during initial triage |
| π Priority Formula | Impact Γ Users Affected Γ Business Criticality |
| π₯ Blast Radius | Growing > Large > Contained |
| π Context Matters | Same bug has different priority on Black Friday vs Sunday 3 AM |
| π Cascading Failures | Trace dependencies backward; fix root cause, not symptoms |
| π« Avoid | Analysis paralysis, squeaky wheel prioritization, solo triage for P0s |
| β Remember | Symptoms first, diagnosis later; prioritize by impact, not ease |
The Triage Checklist (Print & Post) π
βββββββββββββββββββββββββββββββββββββ
BUG TRIAGE DECISION CHECKLIST
βββββββββββββββββββββββββββββββββββββ
β‘ How many users affected?
β 1-10 β Consider P2/P3
β 10-100 β Consider P1
β 100+ β Consider P0/P1
β All users β Likely P0
β‘ What's the business impact?
β Revenue loss β P0
β Data loss/security β P0
β Major feature down β P1
β Minor feature down β P2
β Cosmetic/nice-to-have β P3
β‘ Is there a workaround?
β No workaround β +1 severity
β Manual workaround β Keep severity
β Easy workaround β -1 severity
β‘ Is it getting worse?
β Growing blast radius β +1 severity
β Stable β Keep severity
β Self-limiting β -1 severity
β‘ Can we roll back?
β Yes, easily β Rollback first, diagnose later
β No β Fix forward
β‘ Business context?
β High-traffic period β +1 severity
β Critical business period β +1 severity
β Low-traffic period β Keep severity
βββββββββββββββββββββββββββββββββββββ
FINAL DECISION:
β‘ P0 - Drop everything
β‘ P1 - Fix today
β‘ P2 - This sprint
β‘ P3 - Backlog
βββββββββββββββββββββββββββββββββββββ
Parting Wisdom π‘
"Hours of debugging can save you minutes of planning." (Anonymous, paraphrased)
The 5 minutes you invest in proper triage can save hours of misdirected debugging effort. When production is burning and adrenaline is high, the discipline to stop, assess, and prioritize separates experienced engineers from those who fight fires ineffectively.
Triage isn't about being slowβit's about being precisely fast. You're sprinting in the right direction, not just sprinting.
π Further Study
Google SRE Book - Chapter 12: "Effective Troubleshooting": https://sre.google/sre-book/effective-troubleshooting/ - Deep dive into systematic debugging and triage methodologies from Google's Site Reliability Engineering team
Incident Management Best Practices: https://www.atlassian.com/incident-management/incident-response/severity-levels - Atlassian's guide to severity classification and triage workflows
The RICE Prioritization Framework: https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/ - Detailed explanation of RICE scoring for feature and bug prioritization