Reversible Actions First
Choosing interventions you can safely undo
Reversible Actions First
Master debugging under pressure with free flashcards and spaced repetition practice. This lesson covers non-destructive investigation techniques, read-only commands, and safe rollback strategiesβessential concepts for triaging production incidents without making them worse.
Welcome π
When systems fail in production, every second counts. But here's the paradox: the urgency to fix things quickly can lead to hasty actions that compound the problem. You've seen it happenβsomeone restarts a service to "fix" an issue, wiping out crucial logs that would have revealed the root cause. Or a well-intentioned configuration change cascades into a broader outage.
Reversible actions first is a fundamental principle in incident response: prioritize investigation and remediation steps that can be undone or that don't permanently alter system state. Think of it like a surgeon examining a patient before making any incisionsβgather information safely before making irreversible changes.
This lesson will equip you with:
- π Techniques for gathering diagnostic information without side effects
- π‘οΈ Strategies for making changes that can be cleanly rolled back
- β οΈ Recognition of dangerous "point of no return" actions
- π» Practical command patterns for safe investigation
π― Core Concepts
The Reversibility Spectrum
Not all debugging actions carry the same risk. Understanding this spectrum helps you prioritize effectively:
| Risk Level | Action Type | Examples | Reversibility |
|---|---|---|---|
| π’ Safe | Read-only observation | View logs, check metrics, list processes | Fully reversible (no changes made) |
| π‘ Low Risk | Non-destructive changes | Enable debug logging, create snapshots | Easily reversible (state preserved) |
| π Medium Risk | Reversible modifications | Config changes, feature flags, rolling restarts | Reversible with effort (rollback possible) |
| π΄ High Risk | Destructive operations | Delete data, force kill processes, drop tables | Irreversible or very difficult to undo |
Golden Rule: Work from green to red. Exhaust safer options before escalating to riskier ones.
π The Read-Only Investigation Phase
Before you change anything, understand what's happening. This phase focuses on gathering evidence without altering system state.
Essential Read-Only Commands
System Status Observation:
## Check process status
ps aux | grep <service_name>
top -b -n 1
htop # Interactive, but read-only
## Network connections
netstat -tulpn
ss -tulpn
lsof -i :<port>
## Disk usage (no modifications)
df -h
du -sh /path/to/directory
## Memory information
free -m
vmstat 1 5
Application-Level Diagnostics:
## View current application state
import sys
import os
def safe_diagnostic_dump():
"""Gather info without side effects"""
info = {
"pid": os.getpid(),
"python_version": sys.version,
"module_paths": sys.path,
# Read config, don't modify
"config": read_config_file(),
# Check connections, don't restart
"db_status": check_db_connection()
}
return info
// Node.js: Check state without mutations
const diagnostics = {
memory: process.memoryUsage(),
uptime: process.uptime(),
version: process.version,
activeHandles: process._getActiveHandles().length,
// View, don't modify
env: process.env.NODE_ENV
};
console.log(JSON.stringify(diagnostics, null, 2));
π‘ Pro Tip: Redirect diagnostic output to files rather than stdout in production. This preserves evidence without impacting the running system:
## Good: Save to file for later analysis
ps aux > /tmp/process_snapshot_$(date +%s).txt
## Better: Timestamped and organized
mkdir -p /var/log/incident_$(date +%Y%m%d_%H%M%S)
ps aux > /var/log/incident_$(date +%Y%m%d_%H%M%S)/processes.txt
netstat -tulpn > /var/log/incident_$(date +%Y%m%d_%H%M%S)/network.txt
π Reversible Changes: Making Safe Modifications
When you must make changes, structure them so they can be undone. This requires planning before execution.
Configuration Management
Always backup before modifying:
## Backup configuration files
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup_$(date +%s)
## For complex configs, use version control
cd /etc/myapp
git add .
git commit -m "Snapshot before emergency change - ticket INC-1234"
## Now make your changes
vim config.yaml
## If it breaks, easy rollback:
git checkout config.yaml
Feature flags for runtime behavior:
## Toggle behavior without deployment
class FeatureFlags:
def __init__(self):
# Read from external config that can be updated live
self.flags = self.load_flags()
def is_enabled(self, feature_name):
return self.flags.get(feature_name, False)
## In your code
if feature_flags.is_enabled("new_algorithm"):
result = new_fast_algorithm(data)
else:
result = stable_legacy_algorithm(data) # Fallback
## If new_algorithm causes issues, flip the flag remotely
## No deployment needed, instant rollback
// Go: Circuit breaker pattern for reversible routing
type CircuitBreaker struct {
failureThreshold int
failures int
state string // "closed", "open", "half-open"
}
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == "open" {
// Fail fast, don't make problem worse
return errors.New("circuit breaker open")
}
err := fn()
if err != nil {
cb.failures++
if cb.failures >= cb.failureThreshold {
cb.state = "open" // Reversible: can be reset
}
return err
}
// Success: reset
cb.failures = 0
cb.state = "closed"
return nil
}
Database Operations: The Transactional Approach
Always use transactions for exploratory queries:
-- Start transaction (changes not committed)
BEGIN TRANSACTION;
-- Your diagnostic/fix query
UPDATE users SET status = 'active'
WHERE last_login > NOW() - INTERVAL '30 days'
AND status = 'pending';
-- Check affected rows
SELECT COUNT(*) FROM users WHERE status = 'active';
-- If wrong, rollback
ROLLBACK;
-- If correct, commit
-- COMMIT;
Soft deletes instead of hard deletes:
## Bad: Irreversible
def delete_old_records(cutoff_date):
db.execute("DELETE FROM logs WHERE created_at < ?", cutoff_date)
## Good: Reversible
def soft_delete_old_records(cutoff_date):
db.execute(
"UPDATE logs SET deleted_at = NOW() "
"WHERE created_at < ? AND deleted_at IS NULL",
cutoff_date
)
# Can undo: UPDATE logs SET deleted_at = NULL WHERE ...
β οΈ Recognizing Point-of-No-Return Actions
Some actions cannot be undone. Learn to recognize them and establish safeguards.
Destructive Operations Checklist
π¨ STOP Before Running These
- β
rm -rfon production data - β
DROP TABLE/TRUNCATE - β
kill -9(force kill without cleanup) - β Overwriting log files or rotating them out
- β Deploying new code without a tested rollback plan
- β Scaling down infrastructure during active traffic
- β Revoking credentials that might be in use
Safe alternatives:
## Instead of: rm -rf /var/log/app/*
## Do: Archive then verify
tar -czf /backup/logs_$(date +%s).tar.gz /var/log/app/
ls -la /backup/logs_*.tar.gz # Verify backup exists
## Then consider if deletion is really necessary
## Instead of: kill -9 <pid>
## Do: Graceful shutdown
kill -TERM <pid> # SIGTERM allows cleanup
sleep 5
## Check if stopped
ps -p <pid> || echo "Process stopped gracefully"
The "Undo Plan" Requirement
Before any medium or high-risk action, document the reversal steps:
## incident-response-runbook.yml
action: "Deploy hotfix v2.3.1 to production"
risk_level: "medium"
rollback_plan:
- step: "Identify current version"
command: "kubectl get deployment myapp -o yaml | grep image:"
- step: "Revert deployment"
command: "kubectl set image deployment/myapp myapp=myapp:v2.3.0"
- step: "Verify rollback"
command: "kubectl rollout status deployment/myapp"
- step: "Check health endpoint"
command: "curl https://api.example.com/health"
max_execution_time: "30 seconds"
if_rollback_fails: "Contact @oncall-lead, consider full service restart"
π‘οΈ Progressive Intervention Strategy
A framework for escalating from safe to riskier actions:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROGRESSIVE INTERVENTION LADDER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Level 1: π’ OBSERVE (Read-Only) β
β ββ View logs and metrics β
β ββ Check process status β
β ββ Query database (SELECT only) β
β ββ Network traffic analysis β
β β If not enough info... β
β β
β Level 2: π‘ ENHANCE VISIBILITY (Non-Destructive) β
β ββ Enable debug logging β
β ββ Attach profiler/debugger β
β ββ Create snapshots/checkpoints β
β ββ Add temporary instrumentation β
β β If problem persists... β
β β
β Level 3: π REVERSIBLE CHANGES β
β ββ Toggle feature flags β
β ββ Adjust configuration (with backup) β
β ββ Rolling restart (one instance) β
β ββ Route traffic to healthy instances β
β β If still not resolved... β
β β
β Level 4: π΄ CONTROLLED RISK (Documented Rollback) β
β ββ Deploy hotfix (tested rollback ready) β
β ββ Scale resources significantly β
β ββ Failover to backup systems β
β ββ Emergency configuration override β
β β Last resort only... β
β β
β Level 5: β« HIGH RISK (Irreversible) β
β ββ Require approval from incident commander β
β ββ Full service restart/outage β
β ββ Data deletion or schema changes β
β ββ Infrastructure teardown β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ Move up the ladder ONLY when
lower levels fail to resolve the issue
πΌ Practical Examples
Example 1: Web Service Responding Slowly π
Scenario: Users report your API is taking 10+ seconds to respond. You have SSH access to the production server.
β Panic Response (Don't Do This):
sudo systemctl restart myapp # Wipes current state!
## Lost: in-memory data, active connections, stack traces
β Reversible Actions Approach:
Step 1 - Observe (Read-Only):
## Capture current state FIRST
timestamp=$(date +%Y%m%d_%H%M%S)
mkdir -p /tmp/incident_$timestamp
## What's running?
ps aux > /tmp/incident_$timestamp/processes.txt
top -b -n 1 > /tmp/incident_$timestamp/cpu.txt
## Network connections
netstat -tulpn > /tmp/incident_$timestamp/network.txt
## Application logs (last 1000 lines, no truncation)
tail -n 1000 /var/log/myapp/error.log > /tmp/incident_$timestamp/errors.txt
## Check for blocked processes
lsof -p $(pgrep myapp) > /tmp/incident_$timestamp/open_files.txt
Step 2 - Analyze Without Touching:
## Are we CPU-bound?
grep "myapp" /tmp/incident_$timestamp/cpu.txt | awk '{print $3}'
## Memory leak?
free -m
## Database connections piling up?
grep -c "ESTABLISHED.*3306" /tmp/incident_$timestamp/network.txt
## Errors spiking?
grep -E "ERROR|FATAL" /tmp/incident_$timestamp/errors.txt | tail -20
Discovery: 500 ESTABLISHED connections to database (normal is 20).
Step 3 - Reversible Fix:
## Instead of restarting, add connection pooling (code example)
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
## Current broken code (no pooling)
## engine = create_engine('mysql://...')
## Reversible: Add pool limits via config
engine = create_engine(
'mysql://user:pass@localhost/db',
poolclass=QueuePool,
pool_size=20, # Can adjust this
max_overflow=10, # Can adjust this
pool_pre_ping=True # Non-destructive check
)
## Deploy via feature flag
if config.get('use_connection_pooling', False):
db = PooledDatabase(engine)
else:
db = LegacyDatabase() # Fallback
Result: Enabled pooling via config flag. Connections dropped to 20. Response time recovered. If it hadn't worked, flip flag back.
Example 2: Mysterious Data Corruption π
Scenario: Users report seeing wrong prices in checkout. Database contains incorrect values.
β Impulsive Fix:
-- DON'T: Immediate "fix" without investigation
UPDATE products SET price = price * 1.0 WHERE price < 0;
COMMIT;
-- Now you've lost evidence of which records were corrupt!
β Evidence-Preserving Approach:
-- Step 1: Snapshot corrupt data (read-only)
CREATE TABLE corrupt_prices_20240115 AS
SELECT product_id, price, updated_at, updated_by
FROM products
WHERE price < 0 OR price > 1000000; -- Suspicious values
-- Step 2: Analyze patterns
SELECT updated_by, COUNT(*), MIN(updated_at), MAX(updated_at)
FROM corrupt_prices_20240115
GROUP BY updated_by;
-- Discovery: All bad prices updated by 'batch_import_job'
-- between 02:00-02:15 this morning
-- Step 3: Identify correct values (don't modify yet)
CREATE TABLE price_corrections AS
SELECT
cp.product_id,
cp.price AS corrupt_price,
ph.price AS correct_price -- From price history table
FROM corrupt_prices_20240115 cp
JOIN price_history ph ON cp.product_id = ph.product_id
WHERE ph.updated_at < '2024-01-15 02:00:00'
ORDER BY ph.updated_at DESC
LIMIT 1;
-- Step 4: Reversible fix (transaction with verification)
BEGIN TRANSACTION;
UPDATE products p
SET price = pc.correct_price,
updated_at = NOW(),
updated_by = 'incident_correction_INC789'
FROM price_corrections pc
WHERE p.product_id = pc.product_id;
-- Verify
SELECT COUNT(*) FROM products WHERE price < 0; -- Should be 0
-- If good:
COMMIT;
-- If suspicious:
-- ROLLBACK;
Key Principle: Create shadow tables for analysis. Never modify source data until you understand the problem.
Example 3: Container Crash Loop π
Scenario: Kubernetes pod keeps restarting. kubectl get pods shows CrashLoopBackOff.
β Destructive Debugging:
kubectl delete pod myapp-7d8f9c-xkq2p # Loses crash state!
kubectl rollout restart deployment/myapp # Nuclear option
β Stateful Investigation:
## Step 1: Capture logs BEFORE pod cycles
kubectl logs myapp-7d8f9c-xkq2p > /tmp/crash_logs.txt
kubectl logs myapp-7d8f9c-xkq2p --previous > /tmp/previous_crash.txt
## Step 2: Describe pod (non-destructive)
kubectl describe pod myapp-7d8f9c-xkq2p > /tmp/pod_describe.txt
## Check events
kubectl get events --sort-by='.lastTimestamp' | grep myapp
## Step 3: Check config without restarting
kubectl get configmap myapp-config -o yaml
kubectl get secret myapp-secrets -o yaml
## Discovery from logs: "Cannot connect to database: connection refused"
## Step 4: Test connectivity from a debug pod (non-destructive)
kubectl run debug-pod --rm -it --image=busybox -- /bin/sh
## Inside debug pod:
telnet postgres-service 5432
ping postgres-service
## Exit (pod auto-deletes due to --rm)
## Discovery: postgres-service resolves but port closed
## Step 5: Reversible fix (scale up database, don't touch app yet)
kubectl scale deployment postgres --replicas=2
## Wait and observe (read-only check)
kubectl get pods -l app=postgres -w
## App pods self-recover because we fixed root cause
Principle: Debug pods and --dry-run are your friends for safe exploration.
Example 4: Production Deployment Gone Wrong π
Scenario: You deployed v2.0.0. Error rate spiked from 0.1% to 15%.
β Forward-Only Thinking:
## Keep deploying fixes...
git commit -m "Quick fix for issue A"
docker build -t myapp:v2.0.1 .
kubectl set image deployment/myapp myapp=myapp:v2.0.1
## Error rate now 20%...
git commit -m "Fix for fix..."
## Downward spiral
β Immediate Rollback (Designed In):
## deployment.yaml - designed for reversibility
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
annotations:
deployment.kubernetes.io/revision: "42"
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Only 1 pod down at a time
maxSurge: 1 # Only 1 extra pod during update
revisionHistoryLimit: 10 # Keep rollback history
template:
metadata:
labels:
app: myapp
version: "2.0.0"
spec:
containers:
- name: myapp
image: myapp:v2.0.0
livenessProbe: # Auto-rollback if unhealthy
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
Instant rollback:
## Check what versions are available
kubectl rollout history deployment/myapp
## Rollback to previous (v1.9.5)
kubectl rollout undo deployment/myapp
## Or specific revision
kubectl rollout undo deployment/myapp --to-revision=41
## Monitor rollback
kubectl rollout status deployment/myapp
## Verify error rate dropped (read-only)
curl https://metrics.example.com/api/error_rate
After rollback (safe state restored):
## NOW investigate what went wrong
kubectl logs -l app=myapp,version=2.0.0 --tail=1000 > v2_failure_logs.txt
## Compare configs
kubectl diff -f deployment_v2.yaml
## Fix in development, not production
git checkout -b fix-v2-issues
## Make fixes, test locally, THEN redeploy
β οΈ Common Mistakes
Mistake 1: "Let's Just Restart It" π
The Problem: Restarting services is often the first instinct, but it destroys evidence.
## Bad: Automatic restart on any error
def run_service():
try:
service.start()
except Exception as e:
logger.error(f"Error: {e}")
service.restart() # Loses stack trace context!
Better Approach:
import traceback
import time
def run_service_with_diagnostics():
try:
service.start()
except Exception as e:
# PRESERVE state before any restart
incident_id = int(time.time())
# Dump everything
with open(f'/tmp/crash_{incident_id}.log', 'w') as f:
f.write(f"Exception: {str(e)}\n")
f.write(f"Traceback:\n{traceback.format_exc()}\n")
f.write(f"Service state: {service.get_state()}\n")
f.write(f"Active connections: {service.connection_count()}\n")
# NOW consider restart (with backoff)
if service.restart_count < 3:
time.sleep(2 ** service.restart_count) # Exponential backoff
service.restart()
else:
# Too many crashes, keep it down for investigation
logger.critical(f"Service crashed {service.restart_count} times. Manual intervention required.")
service.shutdown()
Mistake 2: No-COMMIT Transactions πΎ
The Problem: Running UPDATE/DELETE without transactions means no undo button.
-- DANGER: Auto-commit mode
UPDATE users SET account_balance = account_balance * 0.9
WHERE signup_date < '2020-01-01';
-- Oops, meant to filter by inactive users too! No way back.
Fix:
-- Always wrap in transaction
BEGIN TRANSACTION;
-- Dry run: SELECT first
SELECT user_id, account_balance,
account_balance * 0.9 AS new_balance
FROM users
WHERE signup_date < '2020-01-01'
LIMIT 10; -- Sample check
-- Looks good? Run the update
UPDATE users
SET account_balance = account_balance * 0.9
WHERE signup_date < '2020-01-01';
-- Verify
SELECT COUNT(*), AVG(account_balance)
FROM users
WHERE signup_date < '2020-01-01';
-- If wrong:
ROLLBACK;
-- If right:
COMMIT;
Mistake 3: Overwriting Logs π
The Problem: Log rotation or truncation during an incident.
## BAD: Clears the file
echo "Starting debug" > /var/log/app.log
## BAD: Rotates out current logs
logrotate -f /etc/logrotate.conf
Fix:
## GOOD: Append, never overwrite
echo "[$(date)] Starting investigation" >> /var/log/app.log
## GOOD: Copy before any operations
cp /var/log/app.log /var/log/app.log.incident_$(date +%s)
## GOOD: Use tee to preserve and display
tail -f /var/log/app.log | tee /tmp/incident_capture.log
Mistake 4: No Rollback Plan π
The Problem: Making changes without knowing how to undo them.
## Dangerous: No way back
ssh production "sed -i 's/old_value/new_value/g' /etc/app/config.ini"
Fix:
## 1. Backup first
ssh production "cp /etc/app/config.ini /etc/app/config.ini.backup_$(date +%s)"
## 2. Make change
ssh production "sed -i 's/old_value/new_value/g' /etc/app/config.ini"
## 3. Test
ssh production "app --validate-config"
## 4. If bad, rollback ready:
## ssh production "cp /etc/app/config.ini.backup_TIMESTAMP /etc/app/config.ini"
## 5. Document in runbook
cat >> incident_log.md << EOF
### Rollback Procedure
If config change causes issues:
\`\`\`bash
ssh production "cp /etc/app/config.ini.backup_TIMESTAMP /etc/app/config.ini"
ssh production "systemctl restart app"
\`\`\`
EOF
π§ Key Takeaways
π Quick Reference Card: Reversible Actions First
| Principle | Implementation |
|---|---|
| π’ Observe First | Read logs, metrics, state BEFORE changing anything |
| πΎ Preserve Evidence | Copy logs/configs to timestamped backups |
| π Use Transactions | Wrap DB changes in BEGIN...COMMIT/ROLLBACK |
| π Document Rollback | Write undo steps BEFORE executing risky changes |
| ποΈ Progressive Escalation | Start with read-only, escalate only when necessary |
| π‘οΈ Feature Flags | Toggle behavior via config, not deployments |
| β οΈ Test Rollback | Verify undo procedure works before execution |
Golden Rules:
- If you can't undo it, don't do it (unless no other option)
- Restarts are last resort, not first response
- Snapshots are cheap, lost evidence is expensive
- When in doubt, wait and gather more information
The Reversibility Checklist β
Before executing any debugging action, ask:
- Can I gather this information without modifying state?
- Have I captured current state (logs, configs, metrics)?
- Do I have a tested rollback procedure?
- Will this action destroy evidence I might need later?
- Is there a safer alternative that gives me the same information?
- Have I documented what I'm about to do and why?
- Can I test this in a non-production environment first?
Mnemonic: R.E.V.E.R.S.E π§
- Read-only investigation first
- Evidence preservation (backups, snapshots)
- Verify before committing changes
- Escalate progressively (low-risk to high-risk)
- Rollback plan documented
- Snapshot system state
- Expect to undo (design for reversibility)
π Further Study
- Google SRE Book - Incident Response: https://sre.google/sre-book/effective-troubleshooting/ (Chapter on systematic troubleshooting and safe intervention)
- Database Reliability Engineering: https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/ (Reversible database operations and transaction management)
- Kubernetes Documentation - Debug Pods: https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/ (Non-destructive debugging techniques for containers)
π― Remember: The fastest way to fix a production incident is often to make it worse by acting impulsively. The most reliable path to resolution is methodical, evidence-preserving investigation. When systems are on fire, your calm, reversible approach is what saves the day.
Master these patterns, and you'll be the engineer everyone wants on-call. πͺ