You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

How Incidents Actually End

Recognizing resolution versus temporary stability

How Incidents Actually End

Master incident resolution with free flashcards and spaced repetition practice. This lesson covers real-world incident closure patterns, post-incident stabilization, and the hidden coordination work that actually stops the bleedingβ€”essential skills for anyone debugging production systems under pressure.

Welcome 🚨

We often imagine incidents ending with a dramatic fixβ€”someone types the perfect command, pushes the magic commit, and everything returns to normal. The reality is messier, more collaborative, and far more interesting. Understanding how incidents actually end is crucial for effective crisis management and helps you recognize when you're truly done versus when you're just seeing a temporary lull.

πŸ’‘ Key Insight: Most incidents don't end with a single heroic action. They end through a series of incremental improvements, coordinated rollbacks, and careful verification that the system has stabilized.

Core Concepts πŸ’»

The Myth of the "Single Fix" 🎯

Popular narratives about incident response often feature a lone engineer who finds the bug at 3 AM and deploys the fix that saves the day. This makes for good storytelling but terrible operational practice.

Reality Check: Most incidents end through:

  • Incremental mitigation - Multiple small actions that reduce impact
  • Coordinated rollbacks - Reverting changes in the right order
  • Traffic shaping - Gradually draining problem areas
  • Cache warming - Preparing systems before full restoration
  • Feature flagging - Selectively disabling problematic code paths
INCIDENT RESOLUTION PATTERN

  Initial State: πŸ”₯πŸ”₯πŸ”₯ Complete Outage
       |
       ↓
  First Action: πŸ”₯πŸ”₯ Reduce Blast Radius
       |         (isolate affected systems)
       ↓
  Second Action: πŸ”₯ Stabilize Core Services
       |         (restore basic functionality)
       ↓
  Third Action: ⚠️ Monitor & Verify
       |         (watch for regression)
       ↓
  Final State: βœ… Full Service Restored
       |         (all features working)
       ↓
  Post-Incident: πŸ“Š Observe Extended Period
                (ensure stability holds)

The Resolution Timeline ⏰

Incidents don't end at "fix deployed"β€”they end when the system proves it can sustain normal operations.

PhaseWhat's HappeningDurationKey Actions
πŸ”΄ Active CrisisSystem is degraded/downMinutes-HoursMitigation, rollbacks, rerouting
🟑 StabilizationServices restored but fragileHoursMonitoring, capacity checks, validation
🟒 RecoveryNormal operations resumingHours-DaysGradual traffic restoration, cleanup
βœ… ResolutionSustained stability confirmedDaysExtended observation, post-mortem

⚠️ Critical Point: Declaring an incident "resolved" too early is one of the most common mistakes. The system needs time to prove stability under normal load.

The Coordination Problem 🀝

Incident resolution requires orchestrating multiple people, systems, and actions. This coordination work is often invisible but absolutely critical.

Who needs to coordinate?

  • Incident Commander - Maintains overall picture, makes final calls
  • Service Owners - Know their systems' quirks and failure modes
  • SREs/Operations - Control infrastructure and deployment
  • Communications Lead - Updates stakeholders and customers
  • Subject Matter Experts - Provide domain-specific knowledge
COORDINATION FLOW DURING RESOLUTION

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         INCIDENT COMMANDER                  β”‚
β”‚      "We're rolling back the deploy"        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚            β”‚             β”‚
       ↓            ↓             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SRE Team β”‚  β”‚ Service  β”‚  β”‚ Comms Lead   β”‚
β”‚ "Executingβ”‚  β”‚ Owners   β”‚  β”‚ "Updating    β”‚
β”‚ rollback" β”‚  β”‚ "Verify  β”‚  β”‚ status page" β”‚
β”‚           β”‚  β”‚ deps OK" β”‚  β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚            β”‚
       ↓            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  System Returns to Normal    β”‚
β”‚  βœ… Load balanced             β”‚
β”‚  βœ… Errors dropping           β”‚
β”‚  βœ… Latency normalizing       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Pro Tip: Use a dedicated incident channel or bridge where all coordination happens. This creates a clear record and prevents information silos.

Verification: Proving It's Actually Fixed πŸ”

How do you know the incident is truly over? You need multiple signals confirming stability:

Technical Indicators:

  • βœ… Error rates return to baseline
  • βœ… Latency percentiles (p50, p95, p99) normalize
  • βœ… CPU/memory/disk metrics stable
  • βœ… Database connection pools healthy
  • βœ… Queue depths returning to normal
  • βœ… Cache hit rates recovered

Business Indicators:

  • βœ… User-facing features working
  • βœ… Transaction success rates normal
  • βœ… Customer complaints decreasing
  • βœ… Revenue/conversion metrics recovered
## Example: Verification checklist in code
class IncidentVerification:
    def __init__(self, incident_id):
        self.incident_id = incident_id
        self.checks_passed = []
        self.checks_failed = []
    
    def verify_error_rate(self, threshold=0.01):
        """Error rate must be below 1% for 15 minutes"""
        current_rate = metrics.get_error_rate(
            window_minutes=15
        )
        if current_rate < threshold:
            self.checks_passed.append("error_rate")
            return True
        self.checks_failed.append(
            f"error_rate: {current_rate} > {threshold}"
        )
        return False
    
    def verify_latency(self, p95_threshold_ms=500):
        """p95 latency must be under threshold"""
        p95 = metrics.get_percentile(
            percentile=95,
            window_minutes=10
        )
        if p95 < p95_threshold_ms:
            self.checks_passed.append("latency_p95")
            return True
        self.checks_failed.append(
            f"latency_p95: {p95}ms > {p95_threshold_ms}ms"
        )
        return False
    
    def can_close_incident(self):
        """All checks must pass"""
        return len(self.checks_failed) == 0

⚠️ Watch Out: Systems can appear stable for 10-15 minutes then degrade again. Always wait for sustained stability before declaring victory.

The Hidden Cleanup Work 🧹

Even after services are restored, significant work remains:

Immediate Cleanup:

  • Remove temporary workarounds
  • Clear incident-related alerts
  • Update status pages
  • Notify all stakeholders
  • Stop recording/logging at elevated levels

Short-term Cleanup:

  • Revert emergency configuration changes
  • Restore normal monitoring thresholds
  • Re-enable background jobs that were paused
  • Clear incident command channel
  • Archive incident logs

Long-term Cleanup:

  • Conduct post-incident review
  • Document timeline and decisions
  • Create follow-up tickets for root cause fixes
  • Update runbooks based on learnings
  • Share knowledge with broader team
CLEANUP CHECKLIST

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ βœ… Services Restored                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ⬜ Temporary fixes removed           β”‚
β”‚ ⬜ Config returned to normal         β”‚
β”‚ ⬜ Monitoring thresholds reset       β”‚
β”‚ ⬜ Status page updated               β”‚
β”‚ ⬜ Stakeholders notified             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ⬜ Post-incident review scheduled    β”‚
β”‚ ⬜ Timeline documented               β”‚
β”‚ ⬜ Action items created              β”‚
β”‚ ⬜ Runbooks updated                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     Incident truly closed only
     when ALL boxes checked βœ“

The Post-Incident Stabilization Period πŸ“Š

After declaring an incident resolved, teams should maintain heightened awareness:

Day 1 Post-Resolution:

  • Keep incident responders on call
  • Monitor metrics more frequently
  • Keep incident channel open (muted)
  • Have rollback plan ready

Week 1 Post-Resolution:

  • Continue elevated monitoring
  • Watch for related issues
  • Gather data for post-mortem
  • Begin implementing preventive measures
// Example: Post-incident monitoring
class PostIncidentMonitor {
  constructor(incidentId, resolutionTime) {
    this.incidentId = incidentId;
    this.resolutionTime = resolutionTime;
    this.observationPeriodHours = 72; // 3 days
  }
  
  isInObservationWindow() {
    const hoursSinceResolution = 
      (Date.now() - this.resolutionTime) / (1000 * 60 * 60);
    return hoursSinceResolution < this.observationPeriodHours;
  }
  
  checkForRegression() {
    if (!this.isInObservationWindow()) {
      return { stable: true, message: "Observation period complete" };
    }
    
    const metrics = this.getCurrentMetrics();
    const baseline = this.getBaselineMetrics();
    
    if (metrics.errorRate > baseline.errorRate * 1.5) {
      return {
        stable: false,
        message: "Error rate elevated",
        action: "Consider reopening incident"
      };
    }
    
    return { stable: true, message: "Within normal parameters" };
  }
}

Common Resolution Patterns 🎨

Pattern 1: The Rollback

## Most common resolution: undo the breaking change
$ git revert abc123
$ kubectl rollout undo deployment/api-service
$ terraform apply -target=aws_instance.broken

## Verify rollback succeeded
$ kubectl rollout status deployment/api-service
## Waiting for deployment "api-service" rollout to finish
## deployment "api-service" successfully rolled out

## Monitor for stability
$ watch -n 5 'curl -s http://health-check/api | jq .status'

Pattern 2: The Traffic Drain

## Gradually remove problematic instances from load balancer
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
    version: v2  # Changed from v3 back to v2
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

Pattern 3: The Feature Flag Kill Switch

## Disable problematic feature without full deployment
import feature_flags

## During incident
feature_flags.set("new_recommendation_engine", False)

## System stabilizes as traffic routes to old code path
if feature_flags.is_enabled("new_recommendation_engine", user):
    return new_recommendations(user)  # Skipped now
else:
    return legacy_recommendations(user)  # Falls back here

Pattern 4: The Cache Flush & Rebuild

## Corrupted cache causing issues
import redis_client

## Clear bad data
redis_client.flushdb("recommendation_cache")

## Warm cache with known-good data
for user_id in critical_users:
    recommendations = generate_fresh_recommendations(user_id)
    redis_client.set(
        f"rec:{user_id}",
        recommendations,
        ex=3600
    )

## Gradually enable cache reads
for region in ["us-west", "us-east", "eu-west"]:
    enable_cache_for_region(region)
    time.sleep(300)  # Wait 5 min between regions
    verify_metrics(region)

Examples πŸ“š

Example 1: Database Connection Pool Exhaustion

The Incident: API services can't connect to the database. Error rate spikes to 45%. Users seeing "Service Unavailable" errors.

Initial Discovery:

-- Check current connections
SELECT count(*) FROM pg_stat_activity;
-- Returns: 2000 (max_connections = 2000)

SELECT state, count(*) 
FROM pg_stat_activity 
GROUP BY state;
/*
  state   | count
----------+-------
 idle     | 1850
 active   |  150
*/

The Problem: Connection leak in new deployment. Services opening connections but not closing them.

Resolution Steps:

StepActionImpactTime
1Rollback deployment to previous versionStop new leaksT+5min
2Restart API pods to clear leaked connectionsRelease db connectionsT+8min
3Verify connection count droppingConfirm fix workingT+12min
4Monitor error rates for 30 minutesEnsure stabilityT+42min
5Declare incident resolvedResume normal opsT+45min

Verification:

## Monitoring script used during stabilization
import time
import psycopg2

def check_db_health():
    conn = psycopg2.connect("dbname=production")
    cur = conn.cursor()
    
    cur.execute("""
        SELECT 
            max_conn,
            used,
            res_for_super,
            max_conn - used - res_for_super AS available
        FROM (
            SELECT 
                setting::int AS max_conn,
                (SELECT count(*) FROM pg_stat_activity) AS used,
                setting::int AS res_for_super
            FROM pg_settings 
            WHERE name IN ('max_connections', 'superuser_reserved_connections')
        ) q
    """)
    
    result = cur.fetchone()
    print(f"Available connections: {result[3]} / {result[0]}")
    
    if result[3] < 100:  # Less than 100 available
        return False
    return True

## Run every 30 seconds
while True:
    if check_db_health():
        print("βœ… DB connection pool healthy")
    else:
        print("⚠️ Connection pool still stressed")
    time.sleep(30)

Key Lesson: The incident "ended" at T+8min when connections were released, but it wasn't resolved until T+45min after sustained stability.

Example 2: Cascading Cache Failure

The Incident: Memcached cluster crashes. All requests hit the database. Database becomes overloaded. Entire site slows to a crawl.

The Problem: Cache warming logic had a bug that caused stampede when cache was cold.

Resolution Approach:

## Emergency traffic shaping to prevent total collapse
class EmergencyThrottler:
    def __init__(self):
        self.cache = get_redis_client()  # Backup cache
        self.db = get_db_connection()
        self.allow_db_queries = False  # Start with DB protection
    
    def get_user_data(self, user_id):
        # Try backup cache first
        cached = self.cache.get(f"user:{user_id}")
        if cached:
            return cached
        
        # Only query DB if we're allowing it
        if self.allow_db_queries:
            data = self.db.query(
                "SELECT * FROM users WHERE id = %s",
                (user_id,)
            )
            self.cache.set(f"user:{user_id}", data, ex=300)
            return data
        
        # Serve stale/degraded experience
        return {"id": user_id, "degraded": True}

## Gradual restoration plan
throttler = EmergencyThrottler()

## Phase 1: Protect database, serve degraded
time.sleep(180)  # 3 minutes to stabilize

## Phase 2: Slowly warm backup cache
for user_id in get_active_user_ids(limit=1000):
    data = db.query("SELECT * FROM users WHERE id = %s", (user_id,))
    throttler.cache.set(f"user:{user_id}", data, ex=3600)
    time.sleep(0.1)  # Rate limit warming

## Phase 3: Enable DB queries with rate limiting
throttler.allow_db_queries = True

## Phase 4: Fix and restart primary memcached cluster
## Phase 5: Gradually shift traffic back to primary

Timeline:

  • T+0: Memcached cluster crashes
  • T+2min: Emergency throttling activated (degraded mode)
  • T+5min: Database load stabilizes
  • T+10min: Backup cache warming begins
  • T+25min: Primary cache cluster fixed and restarted
  • T+40min: Traffic gradually shifted back to primary
  • T+65min: Full functionality restored
  • T+2hours: Declared resolved after sustained stability

Why It Took So Long: Couldn't just "flip a switch" back on. Had to ensure cache was warm to prevent immediate re-collapse.

Example 3: Kubernetes Networking Split Brain

The Incident: Some pods can't reach others. Service mesh partially broken. Intermittent 503 errors for 30% of requests.

Diagnosis:

## Check pod connectivity
$ kubectl exec -it api-pod-abc123 -- curl service-b.default.svc.cluster.local
curl: (7) Failed to connect to service-b: Connection refused

## Check DNS resolution
$ kubectl exec -it api-pod-abc123 -- nslookup service-b.default.svc.cluster.local
Server:    10.96.0.10
Address:   10.96.0.10:53

Name:      service-b.default.svc.cluster.local
Address:   10.100.200.50

## DNS works, but connection fails - network policy issue?
$ kubectl get networkpolicies --all-namespaces
## Shows misconfigured policy blocking traffic

Resolution:

## The problematic network policy that was blocking traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-ingress
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 8080
## Missing: backend services couldn't reach API!

The Fix:

## Quick mitigation: Delete problematic policy
$ kubectl delete networkpolicy api-ingress -n default

## Verify connectivity restored
$ for pod in $(kubectl get pods -l app=api -o name); do
    kubectl exec $pod -- curl -s service-b.default.svc.cluster.local/health
  done
## All return {"status": "healthy"}

## Monitor for 15 minutes
$ watch -n 10 'kubectl top pods; kubectl get pods | grep -v Running'

## Create corrected policy
$ kubectl apply -f corrected-network-policy.yaml

## Verify correction didn't break anything
$ ./run-smoke-tests.sh

Key Point: The incident "ended" when the policy was deleted (T+8min), but resolution required:

  1. Verifying connectivity restored (T+15min)
  2. Creating correct policy (T+30min)
  3. Testing the correction (T+40min)
  4. Monitoring for regression (T+90min)

Example 4: Multi-Region Failover Gone Wrong

The Incident: Automatic failover to backup region triggered, but backup region wasn't ready. Both regions now degraded.

The Complexity:

## The failover logic that backfired
class RegionFailover:
    def __init__(self):
        self.primary_region = "us-east-1"
        self.backup_region = "us-west-2"
        self.failover_threshold = 0.05  # 5% error rate
    
    def check_and_failover(self):
        primary_errors = self.get_error_rate(self.primary_region)
        
        if primary_errors > self.failover_threshold:
            # Problem: Didn't check if backup was ready!
            self.set_active_region(self.backup_region)
            return True
        return False
    
    def set_active_region(self, region):
        # Updates DNS to point to new region
        # Problem: Backup region cache was cold,
        # database read replicas weren't warmed up
        dns.update_route53_weighted_routing(
            primary_weight=0 if region == self.backup_region else 100,
            backup_weight=100 if region == self.backup_region else 0
        )

Resolution Required Careful Orchestration:

RESOLUTION ORCHESTRATION

T+0    Primary region degraded (15% errors)
       ↓
T+2    Automatic failover triggered
       ↓
T+3    Backup region receiving traffic
       BUT: Cold cache, slow DB replicas
       ↓
T+5    Backup region now at 25% errors!
       ↓
T+7    MANUAL INTERVENTION
       β†’ Pause automatic failover
       β†’ Route traffic 50/50 both regions
       ↓
T+15   Warm up backup region:
       β†’ Pre-populate cache
       β†’ Promote read replicas
       ↓
T+30   Gradually increase backup to 70%
       Monitor: Both regions stable
       ↓
T+45   Investigate primary region issue
       Found: Memory leak in new service
       ↓
T+60   Deploy fix to primary region
       ↓
T+75   Gradually shift back to primary
       ↓
T+90   Primary at 100%, backup at 0%
       ↓
T+120  Extended monitoring confirms stable
       ↓
T+180  Incident declared RESOLVED

Lessons:

  • Multiple regions make incidents more complex, not simpler
  • Resolution required coordinating DNS, cache, database, and traffic
  • "Fixed" happened at T+60, "Resolved" at T+180

Common Mistakes ⚠️

Mistake 1: Declaring Victory Too Early 🏁

The Problem: Team sees error rates drop and immediately closes the incident, only to have it flare up again 20 minutes later.

Why It Happens:

  • Pressure to "resolve" the incident quickly
  • Fatigue after hours of firefighting
  • Relief that things appear to be working

The Fix:

## Enforce waiting period before closure
class IncidentLifecycle:
    MIN_STABILITY_MINUTES = 30
    
    def can_resolve(self, incident):
        if not incident.all_systems_green():
            return False, "Systems not fully recovered"
        
        time_since_green = datetime.now() - incident.became_green_at
        minutes_stable = time_since_green.total_seconds() / 60
        
        if minutes_stable < self.MIN_STABILITY_MINUTES:
            return False, f"Only {minutes_stable:.0f} minutes stable, need {self.MIN_STABILITY_MINUTES}"
        
        return True, "Incident can be resolved"

Mistake 2: Not Documenting the Timeline ⏱️

The Problem: Team fixes the incident but can't remember what they did or when. Post-mortem is impossible.

The Fix: Designate a scribe to log all actions in real-time:

## Incident Timeline Template

### 14:23 UTC - Incident Detected
- Alert fired: API error rate > 5%
- On-call: Jane Smith paged

### 14:26 UTC - Initial Investigation
- Jane: Checked error logs, seeing DB connection timeouts
- Jane: Started incident channel #incident-2024-01-15

### 14:30 UTC - Escalation
- Jane: Paged DB team (Mike Rodriguez)
- Mike joined incident channel

### 14:32 UTC - Hypothesis
- Mike: DB connection pool exhausted (1998/2000 used)
- Recent deploy by Alice Chen at 14:15 UTC suspected

### 14:35 UTC - Mitigation Started
- Decision: Rollback Alice's deploy
- Command: kubectl rollout undo deployment/api-service

### 14:38 UTC - Mitigation Complete
- Rollback finished
- Restarting pods to clear leaked connections

### 14:42 UTC - Recovery Observed
- Error rate dropping: now 2%
- DB connections: 1450/2000

### 14:55 UTC - Stability Confirmed
- Error rate at baseline (0.1%)
- DB connections stable at ~800
- Monitoring for regression

### 15:25 UTC - Incident Resolved
- 30 minutes of stability confirmed
- Post-mortem scheduled for tomorrow 10am

Mistake 3: Forgetting About Downstream Systems 🌊

The Problem: Team fixes their service, but downstream consumers are still broken because they cached bad data or got into a bad state.

Example:

// Upstream service is fixed, but downstream still broken
class DownstreamService {
  constructor() {
    this.cache = new Map();
    this.circuitBreaker = new CircuitBreaker();
  }
  
  async callUpstream(id) {
    // Circuit breaker tripped during incident
    if (this.circuitBreaker.isOpen()) {
      // Still returning errors even though upstream is fixed!
      return { error: "Circuit breaker open" };
    }
    
    // Cache populated with bad data during incident
    if (this.cache.has(id)) {
      return this.cache.get(id);  // Serving stale errors!
    }
    
    return await http.get(`http://upstream/api/${id}`);
  }
}

The Fix: After fixing upstream, coordinate downstream recovery:

## 1. Clear caches in downstream services
$ kubectl exec -it downstream-pod -- redis-cli FLUSHDB

## 2. Reset circuit breakers
$ curl -X POST http://downstream/admin/circuit-breaker/reset

## 3. Restart downstream pods if needed
$ kubectl rollout restart deployment/downstream-service

## 4. Verify end-to-end flow
$ ./run-integration-tests.sh

Mistake 4: Skipping the Cleanup Work 🧹

The Problem: Emergency changes and workarounds left in place "temporarily" become permanent technical debt.

Examples of Forgotten Cleanup:

  • Rate limits lowered during incident
  • Circuit breakers manually opened
  • Feature flags disabled
  • Monitoring alerts silenced
  • Debug logging left at elevated levels
  • Temporary firewall rules

The Fix: Create cleanup tickets immediately:

## Automated cleanup tracking
class IncidentCleanup:
    def __init__(self, incident_id):
        self.incident_id = incident_id
        self.cleanup_items = []
    
    def add_emergency_change(self, change_type, details, owner):
        self.cleanup_items.append({
            "type": change_type,
            "details": details,
            "owner": owner,
            "created_at": datetime.now(),
            "must_revert_by": datetime.now() + timedelta(days=7),
            "status": "pending"
        })
        
        # Automatically create ticket
        jira.create_issue(
            project="OPS",
            summary=f"Cleanup: {change_type} from incident {self.incident_id}",
            description=details,
            assignee=owner,
            due_date=datetime.now() + timedelta(days=7)
        )
    
    def verify_all_cleaned(self):
        pending = [item for item in self.cleanup_items 
                   if item["status"] != "completed"]
        return len(pending) == 0

## Usage during incident
cleanup = IncidentCleanup("INC-2024-001")
cleanup.add_emergency_change(
    "rate_limit",
    "Lowered API rate limit from 1000/min to 100/min",
    "jane.smith@company.com"
)

Mistake 5: Not Communicating Resolution πŸ“’

The Problem: Engineering team fixes everything but forgets to tell stakeholders. Customer support still fielding complaints. Status page still shows "Major Outage."

The Fix: Communication checklist:

### Resolution Communication Checklist

#### Internal (Immediate)
- [ ] Update incident channel with "RESOLVED" message
- [ ] Post in #engineering channel
- [ ] Notify customer support team
- [ ] Update internal status dashboard
- [ ] Email affected team leads

#### External (Within 15 minutes)
- [ ] Update public status page to "Resolved"
- [ ] Post on Twitter/X if public-facing
- [ ] Send customer email if SLA breach
- [ ] Update any support tickets

#### Follow-up (Within 24 hours)
- [ ] Post incident summary to status page
- [ ] Internal post-mortem scheduled
- [ ] Stakeholders notified of follow-up actions

Key Takeaways 🎯

πŸ“‹ Quick Reference: Incident Resolution

Resolution β‰  FixedServices working doesn't mean incident is over
Sustained StabilityWait 30+ minutes of normal operations before declaring resolved
Coordination CriticalMultiple people/systems must work together
Verify EverythingCheck technical AND business metrics
Document TimelineReal-time logging enables effective post-mortems
Downstream ImpactCheck consumers, caches, circuit breakers
Cleanup RequiredEmergency changes must be reverted
Communicate WidelyInternal and external stakeholders need updates

The Three Phases of True Resolution:

  1. πŸ”§ Fixed - The immediate problem is mitigated (minutes to hours)
  2. βœ… Stable - Systems prove they can sustain normal operations (hours)
  3. πŸ“š Resolved - Cleanup complete, knowledge captured, team ready for next time (days)

πŸ’‘ Remember: The best incident responses aren't the fastestβ€”they're the most thorough. Taking time to verify stability and complete cleanup prevents the same incident from recurring hours or days later.

πŸ“š Further Study

  1. Google SRE Book - Postmortem Culture: https://sre.google/sre-book/postmortem-culture/ - Deep dive into learning from incidents
  2. PagerDuty Incident Response Guide: https://response.pagerduty.com/ - Comprehensive incident management practices
  3. Etsy's Debriefing Facilitation Guide: https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf - How to run effective post-incident reviews