Fill in the feature flag pattern for reversible changes: ```python if config.get('{{1}}', False): result = new_algorithm(data) else: result = {{2}}(data) ```

["feature_enabled","legacy_algorithm"]

Reversible Actions First

Choosing interventions you can safely undo

Reversible Actions First

Master debugging under pressure with free flashcards and spaced repetition practice. This lesson covers non-destructive investigation techniques, read-only commands, and safe rollback strategies—essential concepts for triaging production incidents without making them worse.

Welcome 👋

When systems fail in production, every second counts. But here's the paradox: the urgency to fix things quickly can lead to hasty actions that compound the problem. You've seen it happen—someone restarts a service to "fix" an issue, wiping out crucial logs that would have revealed the root cause. Or a well-intentioned configuration change cascades into a broader outage.

Reversible actions first is a fundamental principle in incident response: prioritize investigation and remediation steps that can be undone or that don't permanently alter system state. Think of it like a surgeon examining a patient before making any incisions—gather information safely before making irreversible changes.

This lesson will equip you with:

🔍 Techniques for gathering diagnostic information without side effects
🛡️ Strategies for making changes that can be cleanly rolled back
⚠️ Recognition of dangerous "point of no return" actions
💻 Practical command patterns for safe investigation

🎯 Core Concepts

The Reversibility Spectrum

Not all debugging actions carry the same risk. Understanding this spectrum helps you prioritize effectively:

Risk Level	Action Type	Examples	Reversibility
🟢 Safe	Read-only observation	View logs, check metrics, list processes	Fully reversible (no changes made)
🟡 Low Risk	Non-destructive changes	Enable debug logging, create snapshots	Easily reversible (state preserved)
🟠 Medium Risk	Reversible modifications	Config changes, feature flags, rolling restarts	Reversible with effort (rollback possible)
🔴 High Risk	Destructive operations	Delete data, force kill processes, drop tables	Irreversible or very difficult to undo

Golden Rule: Work from green to red. Exhaust safer options before escalating to riskier ones.

📖 The Read-Only Investigation Phase

Before you change anything, understand what's happening. This phase focuses on gathering evidence without altering system state.

Essential Read-Only Commands

System Status Observation:

## Check process status
ps aux | grep <service_name>
top -b -n 1
htop  # Interactive, but read-only

## Network connections
netstat -tulpn
ss -tulpn
lsof -i :<port>

## Disk usage (no modifications)
df -h
du -sh /path/to/directory

## Memory information
free -m
vmstat 1 5

Application-Level Diagnostics:

## View current application state
import sys
import os

def safe_diagnostic_dump():
    """Gather info without side effects"""
    info = {
        "pid": os.getpid(),
        "python_version": sys.version,
        "module_paths": sys.path,
        # Read config, don't modify
        "config": read_config_file(),
        # Check connections, don't restart
        "db_status": check_db_connection()
    }
    return info

// Node.js: Check state without mutations
const diagnostics = {
  memory: process.memoryUsage(),
  uptime: process.uptime(),
  version: process.version,
  activeHandles: process._getActiveHandles().length,
  // View, don't modify
  env: process.env.NODE_ENV
};

console.log(JSON.stringify(diagnostics, null, 2));

💡 Pro Tip: Redirect diagnostic output to files rather than stdout in production. This preserves evidence without impacting the running system:

## Good: Save to file for later analysis
ps aux > /tmp/process_snapshot_$(date +%s).txt

## Better: Timestamped and organized
mkdir -p /var/log/incident_$(date +%Y%m%d_%H%M%S)
ps aux > /var/log/incident_$(date +%Y%m%d_%H%M%S)/processes.txt
netstat -tulpn > /var/log/incident_$(date +%Y%m%d_%H%M%S)/network.txt

🔄 Reversible Changes: Making Safe Modifications

When you must make changes, structure them so they can be undone. This requires planning before execution.

Configuration Management

Always backup before modifying:

## Backup configuration files
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup_$(date +%s)

## For complex configs, use version control
cd /etc/myapp
git add .
git commit -m "Snapshot before emergency change - ticket INC-1234"
## Now make your changes
vim config.yaml
## If it breaks, easy rollback:
git checkout config.yaml

Feature flags for runtime behavior:

## Toggle behavior without deployment
class FeatureFlags:
    def __init__(self):
        # Read from external config that can be updated live
        self.flags = self.load_flags()
    
    def is_enabled(self, feature_name):
        return self.flags.get(feature_name, False)

## In your code
if feature_flags.is_enabled("new_algorithm"):
    result = new_fast_algorithm(data)
else:
    result = stable_legacy_algorithm(data)  # Fallback

## If new_algorithm causes issues, flip the flag remotely
## No deployment needed, instant rollback

// Go: Circuit breaker pattern for reversible routing
type CircuitBreaker struct {
    failureThreshold int
    failures         int
    state            string // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == "open" {
        // Fail fast, don't make problem worse
        return errors.New("circuit breaker open")
    }
    
    err := fn()
    if err != nil {
        cb.failures++
        if cb.failures >= cb.failureThreshold {
            cb.state = "open"  // Reversible: can be reset
        }
        return err
    }
    
    // Success: reset
    cb.failures = 0
    cb.state = "closed"
    return nil
}

Database Operations: The Transactional Approach

Always use transactions for exploratory queries:

-- Start transaction (changes not committed)
BEGIN TRANSACTION;

-- Your diagnostic/fix query
UPDATE users SET status = 'active' 
WHERE last_login > NOW() - INTERVAL '30 days'
  AND status = 'pending';

-- Check affected rows
SELECT COUNT(*) FROM users WHERE status = 'active';

-- If wrong, rollback
ROLLBACK;
-- If correct, commit
-- COMMIT;

Soft deletes instead of hard deletes:

## Bad: Irreversible
def delete_old_records(cutoff_date):
    db.execute("DELETE FROM logs WHERE created_at < ?", cutoff_date)

## Good: Reversible
def soft_delete_old_records(cutoff_date):
    db.execute(
        "UPDATE logs SET deleted_at = NOW() "
        "WHERE created_at < ? AND deleted_at IS NULL",
        cutoff_date
    )
    # Can undo: UPDATE logs SET deleted_at = NULL WHERE ...

⚠️ Recognizing Point-of-No-Return Actions

Some actions cannot be undone. Learn to recognize them and establish safeguards.

Destructive Operations Checklist

🚨 STOP Before Running These

❌ rm -rf on production data
❌ DROP TABLE / TRUNCATE
❌ kill -9 (force kill without cleanup)
❌ Overwriting log files or rotating them out
❌ Deploying new code without a tested rollback plan
❌ Scaling down infrastructure during active traffic
❌ Revoking credentials that might be in use

Safe alternatives:

## Instead of: rm -rf /var/log/app/*
## Do: Archive then verify
tar -czf /backup/logs_$(date +%s).tar.gz /var/log/app/
ls -la /backup/logs_*.tar.gz  # Verify backup exists
## Then consider if deletion is really necessary

## Instead of: kill -9 <pid>
## Do: Graceful shutdown
kill -TERM <pid>  # SIGTERM allows cleanup
sleep 5
## Check if stopped
ps -p <pid> || echo "Process stopped gracefully"

The "Undo Plan" Requirement

Before any medium or high-risk action, document the reversal steps:

## incident-response-runbook.yml
action: "Deploy hotfix v2.3.1 to production"
risk_level: "medium"
rollback_plan:
  - step: "Identify current version"
    command: "kubectl get deployment myapp -o yaml | grep image:"
  - step: "Revert deployment"
    command: "kubectl set image deployment/myapp myapp=myapp:v2.3.0"
  - step: "Verify rollback"
    command: "kubectl rollout status deployment/myapp"
  - step: "Check health endpoint"
    command: "curl https://api.example.com/health"
max_execution_time: "30 seconds"
if_rollback_fails: "Contact @oncall-lead, consider full service restart"

🛡️ Progressive Intervention Strategy

A framework for escalating from safe to riskier actions:

┌─────────────────────────────────────────────────────────┐
│           PROGRESSIVE INTERVENTION LADDER               │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Level 1: 🟢 OBSERVE (Read-Only)                       │
│  ├─ View logs and metrics                              │
│  ├─ Check process status                               │
│  ├─ Query database (SELECT only)                       │
│  └─ Network traffic analysis                           │
│      ↓ If not enough info...                           │
│                                                         │
│  Level 2: 🟡 ENHANCE VISIBILITY (Non-Destructive)      │
│  ├─ Enable debug logging                               │
│  ├─ Attach profiler/debugger                           │
│  ├─ Create snapshots/checkpoints                       │
│  └─ Add temporary instrumentation                      │
│      ↓ If problem persists...                          │
│                                                         │
│  Level 3: 🟠 REVERSIBLE CHANGES                        │
│  ├─ Toggle feature flags                               │
│  ├─ Adjust configuration (with backup)                 │
│  ├─ Rolling restart (one instance)                     │
│  └─ Route traffic to healthy instances                 │
│      ↓ If still not resolved...                        │
│                                                         │
│  Level 4: 🔴 CONTROLLED RISK (Documented Rollback)     │
│  ├─ Deploy hotfix (tested rollback ready)              │
│  ├─ Scale resources significantly                      │
│  ├─ Failover to backup systems                         │
│  └─ Emergency configuration override                   │
│      ↓ Last resort only...                             │
│                                                         │
│  Level 5: ⚫ HIGH RISK (Irreversible)                  │
│  ├─ Require approval from incident commander           │
│  ├─ Full service restart/outage                        │
│  ├─ Data deletion or schema changes                    │
│  └─ Infrastructure teardown                            │
│                                                         │
└─────────────────────────────────────────────────────────┘

              ⚠️ Move up the ladder ONLY when
              lower levels fail to resolve the issue

💼 Practical Examples

Example 1: Web Service Responding Slowly 🐌

Scenario: Users report your API is taking 10+ seconds to respond. You have SSH access to the production server.

❌ Panic Response (Don't Do This):

sudo systemctl restart myapp  # Wipes current state!
## Lost: in-memory data, active connections, stack traces

✅ Reversible Actions Approach:

Step 1 - Observe (Read-Only):

## Capture current state FIRST
timestamp=$(date +%Y%m%d_%H%M%S)
mkdir -p /tmp/incident_$timestamp

## What's running?
ps aux > /tmp/incident_$timestamp/processes.txt
top -b -n 1 > /tmp/incident_$timestamp/cpu.txt

## Network connections
netstat -tulpn > /tmp/incident_$timestamp/network.txt

## Application logs (last 1000 lines, no truncation)
tail -n 1000 /var/log/myapp/error.log > /tmp/incident_$timestamp/errors.txt

## Check for blocked processes
lsof -p $(pgrep myapp) > /tmp/incident_$timestamp/open_files.txt

Step 2 - Analyze Without Touching:

## Are we CPU-bound?
grep "myapp" /tmp/incident_$timestamp/cpu.txt | awk '{print $3}'

## Memory leak?
free -m

## Database connections piling up?
grep -c "ESTABLISHED.*3306" /tmp/incident_$timestamp/network.txt

## Errors spiking?
grep -E "ERROR|FATAL" /tmp/incident_$timestamp/errors.txt | tail -20

Discovery: 500 ESTABLISHED connections to database (normal is 20).

Step 3 - Reversible Fix:

## Instead of restarting, add connection pooling (code example)
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

## Current broken code (no pooling)
## engine = create_engine('mysql://...')

## Reversible: Add pool limits via config
engine = create_engine(
    'mysql://user:pass@localhost/db',
    poolclass=QueuePool,
    pool_size=20,          # Can adjust this
    max_overflow=10,       # Can adjust this
    pool_pre_ping=True     # Non-destructive check
)

## Deploy via feature flag
if config.get('use_connection_pooling', False):
    db = PooledDatabase(engine)
else:
    db = LegacyDatabase()  # Fallback

Result: Enabled pooling via config flag. Connections dropped to 20. Response time recovered. If it hadn't worked, flip flag back.

Example 2: Mysterious Data Corruption 🔍

Scenario: Users report seeing wrong prices in checkout. Database contains incorrect values.

❌ Impulsive Fix:

-- DON'T: Immediate "fix" without investigation
UPDATE products SET price = price * 1.0 WHERE price < 0;
COMMIT;
-- Now you've lost evidence of which records were corrupt!

✅ Evidence-Preserving Approach:

-- Step 1: Snapshot corrupt data (read-only)
CREATE TABLE corrupt_prices_20240115 AS
SELECT product_id, price, updated_at, updated_by
FROM products
WHERE price < 0 OR price > 1000000;  -- Suspicious values

-- Step 2: Analyze patterns
SELECT updated_by, COUNT(*), MIN(updated_at), MAX(updated_at)
FROM corrupt_prices_20240115
GROUP BY updated_by;

-- Discovery: All bad prices updated by 'batch_import_job' 
-- between 02:00-02:15 this morning

-- Step 3: Identify correct values (don't modify yet)
CREATE TABLE price_corrections AS
SELECT 
    cp.product_id,
    cp.price AS corrupt_price,
    ph.price AS correct_price  -- From price history table
FROM corrupt_prices_20240115 cp
JOIN price_history ph ON cp.product_id = ph.product_id
WHERE ph.updated_at < '2024-01-15 02:00:00'
ORDER BY ph.updated_at DESC
LIMIT 1;

-- Step 4: Reversible fix (transaction with verification)
BEGIN TRANSACTION;

UPDATE products p
SET price = pc.correct_price,
    updated_at = NOW(),
    updated_by = 'incident_correction_INC789'
FROM price_corrections pc
WHERE p.product_id = pc.product_id;

-- Verify
SELECT COUNT(*) FROM products WHERE price < 0;  -- Should be 0

-- If good:
COMMIT;
-- If suspicious:
-- ROLLBACK;

Key Principle: Create shadow tables for analysis. Never modify source data until you understand the problem.

Example 3: Container Crash Loop 🔄

Scenario: Kubernetes pod keeps restarting. kubectl get pods shows CrashLoopBackOff.

❌ Destructive Debugging:

kubectl delete pod myapp-7d8f9c-xkq2p  # Loses crash state!
kubectl rollout restart deployment/myapp  # Nuclear option

✅ Stateful Investigation:

## Step 1: Capture logs BEFORE pod cycles
kubectl logs myapp-7d8f9c-xkq2p > /tmp/crash_logs.txt
kubectl logs myapp-7d8f9c-xkq2p --previous > /tmp/previous_crash.txt

## Step 2: Describe pod (non-destructive)
kubectl describe pod myapp-7d8f9c-xkq2p > /tmp/pod_describe.txt

## Check events
kubectl get events --sort-by='.lastTimestamp' | grep myapp

## Step 3: Check config without restarting
kubectl get configmap myapp-config -o yaml
kubectl get secret myapp-secrets -o yaml

## Discovery from logs: "Cannot connect to database: connection refused"

## Step 4: Test connectivity from a debug pod (non-destructive)
kubectl run debug-pod --rm -it --image=busybox -- /bin/sh
## Inside debug pod:
telnet postgres-service 5432
ping postgres-service
## Exit (pod auto-deletes due to --rm)

## Discovery: postgres-service resolves but port closed

## Step 5: Reversible fix (scale up database, don't touch app yet)
kubectl scale deployment postgres --replicas=2

## Wait and observe (read-only check)
kubectl get pods -l app=postgres -w

## App pods self-recover because we fixed root cause

Principle: Debug pods and --dry-run are your friends for safe exploration.

Example 4: Production Deployment Gone Wrong 🚀

Scenario: You deployed v2.0.0. Error rate spiked from 0.1% to 15%.

❌ Forward-Only Thinking:

## Keep deploying fixes...
git commit -m "Quick fix for issue A"
docker build -t myapp:v2.0.1 .
kubectl set image deployment/myapp myapp=myapp:v2.0.1
## Error rate now 20%...
git commit -m "Fix for fix..."
## Downward spiral

✅ Immediate Rollback (Designed In):

## deployment.yaml - designed for reversibility
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  annotations:
    deployment.kubernetes.io/revision: "42"
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Only 1 pod down at a time
      maxSurge: 1        # Only 1 extra pod during update
  revisionHistoryLimit: 10  # Keep rollback history
  template:
    metadata:
      labels:
        app: myapp
        version: "2.0.0"
    spec:
      containers:
      - name: myapp
        image: myapp:v2.0.0
        livenessProbe:      # Auto-rollback if unhealthy
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3

Instant rollback:

## Check what versions are available
kubectl rollout history deployment/myapp

## Rollback to previous (v1.9.5)
kubectl rollout undo deployment/myapp

## Or specific revision
kubectl rollout undo deployment/myapp --to-revision=41

## Monitor rollback
kubectl rollout status deployment/myapp

## Verify error rate dropped (read-only)
curl https://metrics.example.com/api/error_rate

After rollback (safe state restored):

## NOW investigate what went wrong
kubectl logs -l app=myapp,version=2.0.0 --tail=1000 > v2_failure_logs.txt

## Compare configs
kubectl diff -f deployment_v2.yaml

## Fix in development, not production
git checkout -b fix-v2-issues
## Make fixes, test locally, THEN redeploy

⚠️ Common Mistakes

Mistake 1: "Let's Just Restart It" 🔄

The Problem: Restarting services is often the first instinct, but it destroys evidence.

## Bad: Automatic restart on any error
def run_service():
    try:
        service.start()
    except Exception as e:
        logger.error(f"Error: {e}")
        service.restart()  # Loses stack trace context!

Better Approach:

import traceback
import time

def run_service_with_diagnostics():
    try:
        service.start()
    except Exception as e:
        # PRESERVE state before any restart
        incident_id = int(time.time())
        
        # Dump everything
        with open(f'/tmp/crash_{incident_id}.log', 'w') as f:
            f.write(f"Exception: {str(e)}\n")
            f.write(f"Traceback:\n{traceback.format_exc()}\n")
            f.write(f"Service state: {service.get_state()}\n")
            f.write(f"Active connections: {service.connection_count()}\n")
        
        # NOW consider restart (with backoff)
        if service.restart_count < 3:
            time.sleep(2 ** service.restart_count)  # Exponential backoff
            service.restart()
        else:
            # Too many crashes, keep it down for investigation
            logger.critical(f"Service crashed {service.restart_count} times. Manual intervention required.")
            service.shutdown()

Mistake 2: No-COMMIT Transactions 💾

The Problem: Running UPDATE/DELETE without transactions means no undo button.

-- DANGER: Auto-commit mode
UPDATE users SET account_balance = account_balance * 0.9 
WHERE signup_date < '2020-01-01';
-- Oops, meant to filter by inactive users too! No way back.

Fix:

-- Always wrap in transaction
BEGIN TRANSACTION;

-- Dry run: SELECT first
SELECT user_id, account_balance, 
       account_balance * 0.9 AS new_balance
FROM users 
WHERE signup_date < '2020-01-01'
LIMIT 10;  -- Sample check

-- Looks good? Run the update
UPDATE users 
SET account_balance = account_balance * 0.9 
WHERE signup_date < '2020-01-01';

-- Verify
SELECT COUNT(*), AVG(account_balance) 
FROM users 
WHERE signup_date < '2020-01-01';

-- If wrong:
ROLLBACK;
-- If right:
COMMIT;

Mistake 3: Overwriting Logs 📋

The Problem: Log rotation or truncation during an incident.

## BAD: Clears the file
echo "Starting debug" > /var/log/app.log

## BAD: Rotates out current logs
logrotate -f /etc/logrotate.conf

Fix:

## GOOD: Append, never overwrite
echo "[$(date)] Starting investigation" >> /var/log/app.log

## GOOD: Copy before any operations
cp /var/log/app.log /var/log/app.log.incident_$(date +%s)

## GOOD: Use tee to preserve and display
tail -f /var/log/app.log | tee /tmp/incident_capture.log

Mistake 4: No Rollback Plan 📝

The Problem: Making changes without knowing how to undo them.

## Dangerous: No way back
ssh production "sed -i 's/old_value/new_value/g' /etc/app/config.ini"

Fix:

## 1. Backup first
ssh production "cp /etc/app/config.ini /etc/app/config.ini.backup_$(date +%s)"

## 2. Make change
ssh production "sed -i 's/old_value/new_value/g' /etc/app/config.ini"

## 3. Test
ssh production "app --validate-config"

## 4. If bad, rollback ready:
## ssh production "cp /etc/app/config.ini.backup_TIMESTAMP /etc/app/config.ini"

## 5. Document in runbook
cat >> incident_log.md << EOF
### Rollback Procedure
If config change causes issues:
\`\`\`bash
ssh production "cp /etc/app/config.ini.backup_TIMESTAMP /etc/app/config.ini"
ssh production "systemctl restart app"
\`\`\`
EOF

🧠 Key Takeaways

📋 Quick Reference Card: Reversible Actions First

Principle	Implementation
🟢 Observe First	Read logs, metrics, state BEFORE changing anything
💾 Preserve Evidence	Copy logs/configs to timestamped backups
🔄 Use Transactions	Wrap DB changes in BEGIN...COMMIT/ROLLBACK
📝 Document Rollback	Write undo steps BEFORE executing risky changes
🎚️ Progressive Escalation	Start with read-only, escalate only when necessary
🛡️ Feature Flags	Toggle behavior via config, not deployments
⚠️ Test Rollback	Verify undo procedure works before execution

Golden Rules:

If you can't undo it, don't do it (unless no other option)
Restarts are last resort, not first response
Snapshots are cheap, lost evidence is expensive
When in doubt, wait and gather more information

The Reversibility Checklist ✅

Before executing any debugging action, ask:

Can I gather this information without modifying state?
Have I captured current state (logs, configs, metrics)?
Do I have a tested rollback procedure?
Will this action destroy evidence I might need later?
Is there a safer alternative that gives me the same information?
Have I documented what I'm about to do and why?
Can I test this in a non-production environment first?

Mnemonic: R.E.V.E.R.S.E 🧠

Read-only investigation first
Evidence preservation (backups, snapshots)
Verify before committing changes
Escalate progressively (low-risk to high-risk)
Rollback plan documented
Snapshot system state
Expect to undo (design for reversibility)

📚 Further Study

Google SRE Book - Incident Response: https://sre.google/sre-book/effective-troubleshooting/ (Chapter on systematic troubleshooting and safe intervention)
Database Reliability Engineering: https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/ (Reversible database operations and transaction management)
Kubernetes Documentation - Debug Pods: https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/ (Non-destructive debugging techniques for containers)

🎯 Remember: The fastest way to fix a production incident is often to make it worse by acting impulsively. The most reliable path to resolution is methodical, evidence-preserving investigation. When systems are on fire, your calm, reversible approach is what saves the day.

Master these patterns, and you'll be the engineer everyone wants on-call. 💪

📝

Ready to practice?

This lesson has 15 questions to help you learn