You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

Feature Flags and Kill Switches

Using runtime toggles to disable problematic features

Feature Flags and Kill Switches

Master the art of deploying safer code with feature flags and kill switches, complemented by free flashcards for spaced repetition practice. This lesson covers implementing reversible deployment controls, designing emergency shutoff mechanisms, and strategic rollback patternsβ€”essential skills for debugging production systems under pressure.

Welcome to Defensive Deployment πŸš€

When production breaks at 3 AM, the last thing you want is to be scrambling through a complex rollback procedure. Feature flags and kill switches are your safety netβ€”mechanisms that let you disable problematic code instantly without redeploying. Think of them as circuit breakers for your application: when something goes wrong, you flip a switch and the problem disappears.

In high-pressure debugging scenarios, reversibility is paramount. These tools let you answer "Can we turn it off RIGHT NOW?" with a confident "Yes." This lesson will show you how to build and use these critical safety mechanisms.

Core Concepts πŸ”§

What Are Feature Flags? 🎏

Feature flags (also called feature toggles or feature switches) are conditional statements in your code that control whether specific functionality executes. Instead of:

## Without feature flag
def process_payment(order):
    return new_payment_system.charge(order)  # No way to disable!

You write:

## With feature flag
def process_payment(order):
    if feature_flags.is_enabled('use_new_payment_system'):
        return new_payment_system.charge(order)
    else:
        return legacy_payment_system.charge(order)

Now you can toggle between implementations without touching code or redeploying.

Types of Feature Flags πŸ“Š

Flag Type Lifetime Purpose Example
Release Toggles Temporary (days/weeks) Hide incomplete features New dashboard UI
Ops Toggles (Kill Switches) Permanent Emergency shutoff Disable expensive query
Experiment Toggles Temporary (test duration) A/B testing Button color variation
Permission Toggles Permanent Access control Admin-only features

Kill Switches: Your Emergency Brake πŸ›‘

A kill switch is a specific type of operational feature flag designed for emergency use. It should:

βœ… Be accessible instantly (no deployment needed)
βœ… Take effect immediately (no cache delays)
βœ… Have minimal dependencies (works even when other systems fail)
βœ… Be easy to operate (simple boolean, no complex configuration)
βœ… Be monitored (alert when toggled)

πŸ’‘ Golden Rule: Every risky feature should have a kill switch from day one, not added after the first outage.

Architecture Patterns πŸ—οΈ

1. Configuration-Based (Simple)
// config.json (can be updated without deployment)
{
  "features": {
    "enable_new_search": true,
    "enable_ml_recommendations": false
  }
}

// Application code
const config = require('./config.json');

function search(query) {
  if (config.features.enable_new_search) {
    return elasticSearch(query);
  }
  return legacySearch(query);
}

Pros: Simple, no external dependencies
Cons: Requires reload/restart to take effect

2. Database-Based (Dynamic)
## Feature flags stored in database
class FeatureFlag:
    def is_enabled(self, flag_name, user_id=None):
        flag = db.query("SELECT enabled, rollout_percentage "
                       "FROM feature_flags WHERE name = ?", flag_name)
        
        if not flag:
            return False  # Safe default
        
        if flag.rollout_percentage == 100:
            return flag.enabled
        
        # Gradual rollout based on user ID
        if user_id:
            hash_value = hash(f"{flag_name}:{user_id}") % 100
            return flag.enabled and hash_value < flag.rollout_percentage
        
        return flag.enabled

## Usage
ff = FeatureFlag()

if ff.is_enabled('new_checkout', user_id=current_user.id):
    show_new_checkout()
else:
    show_old_checkout()

Pros: Dynamic updates, supports gradual rollout
Cons: Database dependency (what if DB is down?)

3. Remote Config Service (Production-Grade)
// Using a service like LaunchDarkly, Split.io, or custom
import "github.com/launchdarkly/go-server-sdk/v6"

ldClient, _ := ld.MakeClient(sdkKey, 5*time.Second)
user := lduser.NewUser(userID)

showFeature, _ := ldClient.BoolVariation("new-dashboard", user, false)

if showFeature {
    renderNewDashboard()
} else {
    renderOldDashboard()
}

Pros: Instant updates, built-in analytics, gradual rollout, targeting rules
Cons: External dependency, cost

Implementation Best Practices 🎯

1. Fail-Safe Defaults
// ALWAYS provide safe defaults
fn get_feature_flag(name: &str) -> bool {
    match flag_service.get(name) {
        Ok(value) => value,
        Err(e) => {
            log::error!("Flag service error: {}", e);
            // Default to SAFE behavior (usually false/off)
            false
        }
    }
}

⚠️ Critical: When the flag system fails, default to the safest behavior (usually the old/stable code path).

2. Fast Evaluation
// BAD: Blocking API call on every check
function isEnabled(flag: string): boolean {
  const response = await fetch(`/api/flags/${flag}`);  // SLOW!
  return response.json().enabled;
}

// GOOD: Cached with background refresh
class FeatureFlagCache {
  private cache: Map<string, boolean> = new Map();
  
  constructor() {
    this.refreshCache();  // Initial load
    setInterval(() => this.refreshCache(), 30000);  // Refresh every 30s
  }
  
  async refreshCache() {
    const flags = await fetch('/api/flags').then(r => r.json());
    flags.forEach(f => this.cache.set(f.name, f.enabled));
  }
  
  isEnabled(flag: string): boolean {
    return this.cache.get(flag) ?? false;  // Fast lookup, safe default
  }
}

πŸ’‘ Performance Tip: Feature flag checks happen in hot code paths. Keep them sub-millisecond with local caching.

3. Gradual Rollout Strategy
## Percentage-based rollout
def is_enabled_for_user(flag_name: str, user_id: int, rollout_pct: int) -> bool:
    # Deterministic: same user always gets same result
    user_bucket = hash(f"{flag_name}:{user_id}") % 100
    return user_bucket < rollout_pct

## Roll out to 1% -> 5% -> 25% -> 100%
if is_enabled_for_user('new_algorithm', user.id, rollout_pct=5):
    return new_algorithm(data)
else:
    return stable_algorithm(data)
GRADUAL ROLLOUT VISUALIZATION

Day 1:  [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 5%  β†’ Monitor metrics
Day 2:  [β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 15% β†’ Still stable?
Day 3:  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 30% β†’ Metrics look good
Day 4:  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 50% β†’ Halfway there
Day 5:  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 100% β†’ Full rollout!

If ANY issues detected β†’ Immediately rollback to 0%
4. Monitoring and Alerting
// Instrument flag usage
func CheckFeatureFlag(ctx context.Context, flagName string, userID string) bool {
    enabled := flagService.IsEnabled(flagName, userID)
    
    // Metrics
    metrics.Increment("feature_flag.check", map[string]string{
        "flag": flagName,
        "enabled": strconv.FormatBool(enabled),
    })
    
    // Alert if kill switch activated
    if strings.HasPrefix(flagName, "killswitch_") && !enabled {
        alerts.Send(AlertCritical, 
            fmt.Sprintf("Kill switch %s activated!", flagName))
    }
    
    return enabled
}

πŸ“Š Track these metrics:

  • Flag check frequency
  • True/false split ratio
  • Error rates per code path
  • Performance difference between paths

Detailed Examples πŸ’»

Example 1: Emergency Database Query Kill Switch

Scenario: You deploy a new analytics query that's accidentally causing database load spikes. You need to disable it NOW.

-- Feature flags table
CREATE TABLE feature_flags (
    name VARCHAR(100) PRIMARY KEY,
    enabled BOOLEAN NOT NULL DEFAULT false,
    description TEXT,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO feature_flags (name, enabled, description) VALUES
('enable_analytics_query', true, 'New user behavior analytics');
// Application code with kill switch
public class AnalyticsService {
    private final FeatureFlagService flags;
    
    public AnalyticsReport generateReport(String userId) {
        // Kill switch check (fast, cached)
        if (!flags.isEnabled("enable_analytics_query")) {
            logger.info("Analytics query disabled, returning cached data");
            return getCachedReport(userId);  // Fallback to safe option
        }
        
        try {
            // Potentially expensive operation
            return database.executeComplexAnalytics(userId);
        } catch (Exception e) {
            logger.error("Analytics query failed", e);
            // Automatic fallback
            return getCachedReport(userId);
        }
    }
    
    private AnalyticsReport getCachedReport(String userId) {
        // Safe, fast alternative
        return cache.get("analytics:" + userId, 
            () -> generateBasicReport(userId));
    }
}

When database load spikes:

-- DBA runs this ONE command to fix it immediately:
UPDATE feature_flags SET enabled = false 
WHERE name = 'enable_analytics_query';

βœ… Result: Within 30 seconds (cache refresh), all servers stop running the expensive query. No deployment, no restart, no downtime.

Example 2: Canary Deployment with Automatic Rollback

Scenario: Deploy new payment processing code to 10% of users, automatically rollback if error rate exceeds threshold.

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class RolloutConfig:
    flag_name: str
    target_percentage: int
    error_threshold: float  # e.g., 0.05 = 5%
    monitoring_window_seconds: int = 300  # 5 minutes

class AutoRollbackFeatureFlag:
    def __init__(self, db, metrics, alerting):
        self.db = db
        self.metrics = metrics
        self.alerting = alerting
    
    def check_and_rollback_if_needed(self, config: RolloutConfig):
        """Background job that monitors and auto-rolls back"""
        current_pct = self.get_current_percentage(config.flag_name)
        
        if current_pct == 0:
            return  # Already disabled
        
        # Check error rate for flag=true path
        error_rate = self.metrics.get_error_rate(
            flag=config.flag_name,
            value=True,
            window_seconds=config.monitoring_window_seconds
        )
        
        if error_rate > config.error_threshold:
            # AUTOMATIC ROLLBACK
            self.set_percentage(config.flag_name, 0)
            
            self.alerting.send_critical(
                title=f"Auto-rollback triggered: {config.flag_name}",
                message=f"Error rate {error_rate:.2%} exceeded threshold "
                        f"{config.error_threshold:.2%}. Rolled back to 0%."
            )
            
            self.db.log_rollback_event(
                flag=config.flag_name,
                reason="error_rate_exceeded",
                error_rate=error_rate,
                timestamp=time.time()
            )
    
    def gradual_rollout(self, flag_name: str, target_pct: int, 
                       step_pct: int = 10, step_delay_minutes: int = 30):
        """Gradually increase percentage with monitoring"""
        current_pct = 0
        
        while current_pct < target_pct:
            next_pct = min(current_pct + step_pct, target_pct)
            
            # Increase percentage
            self.set_percentage(flag_name, next_pct)
            print(f"Rolled out to {next_pct}%")
            
            # Wait and monitor
            time.sleep(step_delay_minutes * 60)
            
            # Check if auto-rollback occurred
            if self.get_current_percentage(flag_name) == 0:
                print("Rollback detected, stopping rollout")
                return False
            
            current_pct = next_pct
        
        return True  # Successfully reached target

## Usage
rollout = AutoRollbackFeatureFlag(db, metrics, alerting)

## Start gradual rollout with safety net
rollout.gradual_rollout(
    flag_name='new_payment_processor',
    target_pct=100,
    step_pct=10,      # Increase by 10% each step
    step_delay_minutes=30  # Wait 30 min between steps
)

## Background monitoring job
config = RolloutConfig(
    flag_name='new_payment_processor',
    target_percentage=100,
    error_threshold=0.05,  # Auto-rollback if >5% error rate
    monitoring_window_seconds=300
)

while True:
    rollout.check_and_rollback_if_needed(config)
    time.sleep(60)  # Check every minute

Example 3: Multi-Level Circuit Breaker

Scenario: Implement defense-in-depth with multiple levels of protection.

interface CircuitBreakerConfig {
  flagName: string;
  errorThreshold: number;      // e.g., 10 errors
  errorWindowMs: number;       // e.g., 60000 (1 minute)
  openCircuitDurationMs: number; // e.g., 30000 (30 seconds)
}

class CircuitBreakerFeatureFlag {
  private errorCounts: Map<string, number[]> = new Map();
  private circuitOpen: Map<string, number> = new Map();
  
  isEnabled(config: CircuitBreakerConfig): boolean {
    const now = Date.now();
    
    // Level 1: Check manual kill switch
    const manualFlag = featureFlagService.get(config.flagName);
    if (!manualFlag) {
      return false;  // Manually disabled
    }
    
    // Level 2: Check if circuit breaker is open
    const openUntil = this.circuitOpen.get(config.flagName);
    if (openUntil && now < openUntil) {
      console.log(`Circuit open for ${config.flagName}, using fallback`);
      return false;  // Auto-disabled due to errors
    }
    
    // Level 3: Check error rate
    const recentErrors = this.getRecentErrors(config.flagName, 
                                               config.errorWindowMs);
    if (recentErrors.length >= config.errorThreshold) {
      // Open the circuit
      this.circuitOpen.set(config.flagName, 
                          now + config.openCircuitDurationMs);
      
      alerting.send({
        level: 'warning',
        message: `Circuit breaker opened: ${config.flagName}. ` +
                `${recentErrors.length} errors in ` +
                `${config.errorWindowMs}ms window.`
      });
      
      return false;
    }
    
    return true;  // All checks passed, feature is enabled
  }
  
  recordError(flagName: string): void {
    const errors = this.errorCounts.get(flagName) || [];
    errors.push(Date.now());
    this.errorCounts.set(flagName, errors);
  }
  
  recordSuccess(flagName: string): void {
    // If we're getting successes, consider closing circuit
    const openUntil = this.circuitOpen.get(flagName);
    if (openUntil && Date.now() >= openUntil) {
      this.circuitOpen.delete(flagName);
      this.errorCounts.delete(flagName);  // Reset error count
      console.log(`Circuit closed for ${flagName}, resuming normal operation`);
    }
  }
  
  private getRecentErrors(flagName: string, windowMs: number): number[] {
    const errors = this.errorCounts.get(flagName) || [];
    const cutoff = Date.now() - windowMs;
    return errors.filter(timestamp => timestamp > cutoff);
  }
}

// Usage in application code
const circuitBreaker = new CircuitBreakerFeatureFlag();

const config = {
  flagName: 'external_api_integration',
  errorThreshold: 10,
  errorWindowMs: 60000,
  openCircuitDurationMs: 30000
};

async function callExternalAPI(data: any) {
  if (!circuitBreaker.isEnabled(config)) {
    // Fallback to cached data or degraded functionality
    return getCachedData(data);
  }
  
  try {
    const result = await externalAPI.call(data);
    circuitBreaker.recordSuccess(config.flagName);
    return result;
  } catch (error) {
    circuitBreaker.recordError(config.flagName);
    throw error;
  }
}
CIRCUIT BREAKER STATE MACHINE

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   CLOSED     β”‚ ← Normal operation
    β”‚  (Enabled)   β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    Errors exceed
    threshold
           β”‚
           ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚     OPEN     β”‚ ← Fallback active
    β”‚  (Disabled)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    Wait timeout
    period
           β”‚
           ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  HALF-OPEN   β”‚ ← Testing recovery
    β”‚  (Cautious)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
      β”‚         β”‚
   Success    Failure
      β”‚         β”‚
      ↓         ↓
   CLOSED    OPEN

Example 4: User-Targeted Feature Flags

Scenario: Enable new feature only for internal employees first, then beta users, then everyone.

class TargetedFeatureFlag
  def initialize(flag_service, user_service)
    @flag_service = flag_service
    @user_service = user_service
  end
  
  def enabled?(flag_name, user_id)
    flag = @flag_service.get(flag_name)
    return false unless flag&.enabled
    
    user = @user_service.find(user_id)
    
    # Targeting rules (evaluated in order)
    case flag.targeting_mode
    when 'all'
      true
    when 'employees_only'
      user.email.end_with?('@company.com')
    when 'beta_users'
      user.email.end_with?('@company.com') || user.beta_tester?
    when 'percentage_rollout'
      user_in_rollout_percentage?(flag_name, user_id, flag.percentage)
    when 'whitelist'
      flag.whitelisted_user_ids.include?(user_id)
    when 'attribute_match'
      evaluate_attribute_rules(user, flag.attribute_rules)
    else
      false  # Safe default
    end
  end
  
  private
  
  def user_in_rollout_percentage?(flag_name, user_id, percentage)
    # Consistent hashing ensures same user always gets same result
    hash = Digest::MD5.hexdigest("#{flag_name}:#{user_id}")
    hash_int = hash[0..7].to_i(16)  # First 8 hex digits
    bucket = hash_int % 100
    bucket < percentage
  end
  
  def evaluate_attribute_rules(user, rules)
    # Example: { "country" => "US", "plan" => "premium" }
    rules.all? do |attribute, expected_value|
      user.send(attribute) == expected_value
    end
  end
end

## Example usage in controller
class FeaturesController < ApplicationController
  before_action :check_feature_access
  
  def new_dashboard
    if @feature_flags.enabled?('new_dashboard', current_user.id)
      render 'dashboards/new'
    else
      render 'dashboards/legacy'
    end
  end
  
  private
  
  def check_feature_access
    @feature_flags = TargetedFeatureFlag.new(FeatureFlagService.new, 
                                             UserService.new)
  end
end
Rollout Phase Target Duration Rollback Risk
1. Internal Employees only 1 week Low (small user base)
2. Beta Opt-in testers 2 weeks Medium (friendly users)
3. Canary 5% random users 3 days Medium (small percentage)
4. Gradual 25% β†’ 50% β†’ 100% 1 week High (large user base)

Common Mistakes ⚠️

1. ❌ No Default Behavior

## WRONG: Crashes if flag service is down
def process_order(order):
    if flag_service.is_enabled('new_processor'):  # What if this throws?
        return new_processor.process(order)
    else:
        return old_processor.process(order)

## RIGHT: Safe defaults and error handling
def process_order(order):
    try:
        use_new = flag_service.is_enabled('new_processor')
    except Exception as e:
        logger.error(f"Flag check failed: {e}")
        use_new = False  # Default to stable code path
    
    if use_new:
        return new_processor.process(order)
    else:
        return old_processor.process(order)

Why it matters: Your flag service becomes a single point of failure. Always have a fallback.

2. ❌ Forgetting to Remove Old Flags

// After 6 months, you might have:
if (flags.get('new_ui_2023')) { }        // Still needed?
if (flags.get('experiment_checkout')) { } // Experiment ended 5 months ago!
if (flags.get('beta_search')) { }        // This is production now!

// This creates:
// - Technical debt
// - Confusing code paths
// - Performance overhead
// - Maintenance burden

Solution: Add expiry dates and periodic cleanup:

@dataclass
class FeatureFlag:
    name: str
    enabled: bool
    created_at: datetime
    expires_at: Optional[datetime]  # Reminder to clean up
    flag_type: str  # 'release', 'ops', 'experiment'

## Automated cleanup reminder
def check_stale_flags():
    stale_flags = db.query(
        "SELECT name FROM feature_flags "
        "WHERE expires_at < NOW() AND still_in_code = true"
    )
    for flag in stale_flags:
        alert_team(f"Remove stale flag from code: {flag.name}")

3. ❌ Slow Flag Evaluation

// WRONG: Network call every time (could be thousands of times/second)
func ShowNewFeature(userID string) bool {
    resp, _ := http.Get(fmt.Sprintf("https://flags-api.com/check?user=%s", userID))
    defer resp.Body.Close()
    // This is TERRIBLE for performance!
    return parseResponse(resp)
}

// RIGHT: Local evaluation with cached rules
var flagCache = sync.Map{}

func ShowNewFeature(userID string) bool {
    // Fast lookup from memory
    if val, ok := flagCache.Load("new_feature"); ok {
        return val.(bool)
    }
    return false  // Safe default
}

// Background goroutine refreshes cache every 30s
func refreshFlagsInBackground() {
    for {
        flags := fetchAllFlagsFromAPI()  // One batch call
        for name, enabled := range flags {
            flagCache.Store(name, enabled)
        }
        time.Sleep(30 * time.Second)
    }
}

4. ❌ Complex Flag Logic

// WRONG: Too complex to understand and debug
if ((flags.newUI && user.isPremium) || 
    (flags.betaAccess && user.country === 'US' && !user.hasOptedOut) ||
    (flags.forceNewUI && user.email.endsWith('@company.com'))) {
  // What conditions actually enabled this?!
}

// RIGHT: Clear, testable, traceable
function shouldShowNewUI(user: User): boolean {
  const reasons: string[] = [];
  
  if (flags.newUI && user.isPremium) {
    reasons.push('premium_user');
    return true;
  }
  
  if (flags.betaAccess && user.country === 'US' && !user.hasOptedOut) {
    reasons.push('us_beta_user');
    return true;
  }
  
  if (flags.forceNewUI && user.email.endsWith('@company.com')) {
    reasons.push('internal_user');
    return true;
  }
  
  logger.debug(`New UI not shown. Reasons checked: ${reasons.join(', ')}`);
  return false;
}

5. ❌ Not Monitoring Flag State Changes

## WRONG: No audit trail
def update_flag(name, enabled)
  db.execute("UPDATE flags SET enabled = ? WHERE name = ?", enabled, name)
end

## RIGHT: Full audit trail
def update_flag(name, enabled, changed_by:, reason:)
  old_value = db.query("SELECT enabled FROM flags WHERE name = ?", name).first
  
  db.execute("UPDATE flags SET enabled = ? WHERE name = ?", enabled, name)
  
  # Log every change
  audit_log.create!(
    flag_name: name,
    old_value: old_value,
    new_value: enabled,
    changed_by: changed_by,
    reason: reason,
    timestamp: Time.now
  )
  
  # Alert team
  if name.start_with?('killswitch_')
    slack.post(
      channel: '#ops-alerts',
      message: "🚨 Kill switch #{name} #{enabled ? 'ENABLED' : 'DISABLED'} " +
               "by #{changed_by}. Reason: #{reason}"
    )
  end
end

6. ❌ Using Flags for Configuration

## WRONG: Feature flags are not a configuration system
flag_service.set('database_host', 'db.prod.company.com')
flag_service.set('max_connections', 100)
flag_service.set('api_timeout_seconds', 30)

## These should be in proper configuration management!
## Feature flags are for ENABLING/DISABLING features, not configuration values

## RIGHT: Use flags for booleans, config for values
if feature_flags.is_enabled('use_connection_pooling'):
    pool = ConnectionPool(
        host=config.get('database_host'),
        max_connections=config.get('max_connections')
    )

Emergency Response Workflow 🚨

When production breaks and you need to use a kill switch:

EMERGENCY KILL SWITCH PROCEDURE

1. IDENTIFY THE PROBLEM
   β”œβ”€ What feature/code is causing issues?
   β”œβ”€ Is there a relevant kill switch?
   └─ What's the blast radius?

2. ACTIVATE KILL SWITCH
   β”œβ”€ Update flag in control panel/database
   β”œβ”€ Verify flag propagated (check cache refresh)
   └─ Confirm feature is disabled
        β”‚
        β”œβ”€ Check metrics dashboard
        β”œβ”€ Test affected endpoints
        └─ Monitor error rates

3. VERIFY IMPACT
   β”œβ”€ Did errors stop? βœ…
   β”œβ”€ Is fallback working? βœ…
   └─ Are users experiencing issues? ❌

4. COMMUNICATE
   β”œβ”€ Post in incident channel
   β”œβ”€ Update status page
   └─ Notify stakeholders

5. INVESTIGATE ROOT CAUSE
   β”œβ”€ Collect logs/metrics
   β”œβ”€ Reproduce issue in staging
   └─ Develop fix

6. RE-ENABLE (GRADUALLY)
   β”œβ”€ Deploy fix to staging
   β”œβ”€ Test thoroughly
   β”œβ”€ Enable for 1% of users
   β”œβ”€ Monitor for 1 hour
   └─ Gradual rollout to 100%

πŸ’‘ Pro Tip: Practice your kill switch procedure during low-traffic periods. You don't want the first time using it to be during a critical incident.

Key Takeaways 🎯

βœ… Feature flags are safety nets – They let you disable code instantly without deployment
βœ… Kill switches are permanent – Critical features should always have an emergency shutoff
βœ… Default to safety – When flag evaluation fails, fall back to stable code paths
βœ… Keep evaluation fast – Cache flags locally, avoid network calls in hot paths
βœ… Gradual rollout – Start with 1-5%, monitor metrics, increase slowly
βœ… Monitor everything – Track flag state changes, error rates per code path
βœ… Clean up regularly – Remove expired flags to reduce technical debt
βœ… Audit all changes – Log who changed what flag and why
βœ… Test the off switch – Verify your fallback code actually works
βœ… Document flag purpose – Future you will forget why this flag exists

πŸ“‹ Quick Reference Card: Kill Switch Checklist

Characteristic Requirement
🎯 Scope One kill switch per risky feature
⚑ Speed Takes effect within 30 seconds
πŸ›‘οΈ Safety Graceful fallback to stable behavior
πŸ“Š Monitoring Alert when toggled, track metrics
πŸ” Testing Regularly test off state works
πŸ“ Documentation Clear description and runbook
πŸ‘€ Access On-call engineer can toggle 24/7

πŸ“š Further Study

  1. Feature Toggles (Martin Fowler) - https://martinfowler.com/articles/feature-toggles.html - Comprehensive guide to feature flag patterns and antipatterns

  2. LaunchDarkly Feature Flag Guide - https://launchdarkly.com/blog/what-are-feature-flags/ - Best practices from a production flag management platform

  3. Site Reliability Engineering Book (Google) - https://sre.google/sre-book/managing-critical-state/ - Chapter on managing rollouts and emergency controls at scale