Feature Flags and Kill Switches
Using runtime toggles to disable problematic features
Feature Flags and Kill Switches
Master the art of deploying safer code with feature flags and kill switches, complemented by free flashcards for spaced repetition practice. This lesson covers implementing reversible deployment controls, designing emergency shutoff mechanisms, and strategic rollback patternsβessential skills for debugging production systems under pressure.
Welcome to Defensive Deployment π
When production breaks at 3 AM, the last thing you want is to be scrambling through a complex rollback procedure. Feature flags and kill switches are your safety netβmechanisms that let you disable problematic code instantly without redeploying. Think of them as circuit breakers for your application: when something goes wrong, you flip a switch and the problem disappears.
In high-pressure debugging scenarios, reversibility is paramount. These tools let you answer "Can we turn it off RIGHT NOW?" with a confident "Yes." This lesson will show you how to build and use these critical safety mechanisms.
Core Concepts π§
What Are Feature Flags? π
Feature flags (also called feature toggles or feature switches) are conditional statements in your code that control whether specific functionality executes. Instead of:
## Without feature flag
def process_payment(order):
return new_payment_system.charge(order) # No way to disable!
You write:
## With feature flag
def process_payment(order):
if feature_flags.is_enabled('use_new_payment_system'):
return new_payment_system.charge(order)
else:
return legacy_payment_system.charge(order)
Now you can toggle between implementations without touching code or redeploying.
Types of Feature Flags π
| Flag Type | Lifetime | Purpose | Example |
|---|---|---|---|
| Release Toggles | Temporary (days/weeks) | Hide incomplete features | New dashboard UI |
| Ops Toggles (Kill Switches) | Permanent | Emergency shutoff | Disable expensive query |
| Experiment Toggles | Temporary (test duration) | A/B testing | Button color variation |
| Permission Toggles | Permanent | Access control | Admin-only features |
Kill Switches: Your Emergency Brake π
A kill switch is a specific type of operational feature flag designed for emergency use. It should:
β
Be accessible instantly (no deployment needed)
β
Take effect immediately (no cache delays)
β
Have minimal dependencies (works even when other systems fail)
β
Be easy to operate (simple boolean, no complex configuration)
β
Be monitored (alert when toggled)
π‘ Golden Rule: Every risky feature should have a kill switch from day one, not added after the first outage.
Architecture Patterns ποΈ
1. Configuration-Based (Simple)
// config.json (can be updated without deployment)
{
"features": {
"enable_new_search": true,
"enable_ml_recommendations": false
}
}
// Application code
const config = require('./config.json');
function search(query) {
if (config.features.enable_new_search) {
return elasticSearch(query);
}
return legacySearch(query);
}
Pros: Simple, no external dependencies
Cons: Requires reload/restart to take effect
2. Database-Based (Dynamic)
## Feature flags stored in database
class FeatureFlag:
def is_enabled(self, flag_name, user_id=None):
flag = db.query("SELECT enabled, rollout_percentage "
"FROM feature_flags WHERE name = ?", flag_name)
if not flag:
return False # Safe default
if flag.rollout_percentage == 100:
return flag.enabled
# Gradual rollout based on user ID
if user_id:
hash_value = hash(f"{flag_name}:{user_id}") % 100
return flag.enabled and hash_value < flag.rollout_percentage
return flag.enabled
## Usage
ff = FeatureFlag()
if ff.is_enabled('new_checkout', user_id=current_user.id):
show_new_checkout()
else:
show_old_checkout()
Pros: Dynamic updates, supports gradual rollout
Cons: Database dependency (what if DB is down?)
3. Remote Config Service (Production-Grade)
// Using a service like LaunchDarkly, Split.io, or custom
import "github.com/launchdarkly/go-server-sdk/v6"
ldClient, _ := ld.MakeClient(sdkKey, 5*time.Second)
user := lduser.NewUser(userID)
showFeature, _ := ldClient.BoolVariation("new-dashboard", user, false)
if showFeature {
renderNewDashboard()
} else {
renderOldDashboard()
}
Pros: Instant updates, built-in analytics, gradual rollout, targeting rules
Cons: External dependency, cost
Implementation Best Practices π―
1. Fail-Safe Defaults
// ALWAYS provide safe defaults
fn get_feature_flag(name: &str) -> bool {
match flag_service.get(name) {
Ok(value) => value,
Err(e) => {
log::error!("Flag service error: {}", e);
// Default to SAFE behavior (usually false/off)
false
}
}
}
β οΈ Critical: When the flag system fails, default to the safest behavior (usually the old/stable code path).
2. Fast Evaluation
// BAD: Blocking API call on every check
function isEnabled(flag: string): boolean {
const response = await fetch(`/api/flags/${flag}`); // SLOW!
return response.json().enabled;
}
// GOOD: Cached with background refresh
class FeatureFlagCache {
private cache: Map<string, boolean> = new Map();
constructor() {
this.refreshCache(); // Initial load
setInterval(() => this.refreshCache(), 30000); // Refresh every 30s
}
async refreshCache() {
const flags = await fetch('/api/flags').then(r => r.json());
flags.forEach(f => this.cache.set(f.name, f.enabled));
}
isEnabled(flag: string): boolean {
return this.cache.get(flag) ?? false; // Fast lookup, safe default
}
}
π‘ Performance Tip: Feature flag checks happen in hot code paths. Keep them sub-millisecond with local caching.
3. Gradual Rollout Strategy
## Percentage-based rollout
def is_enabled_for_user(flag_name: str, user_id: int, rollout_pct: int) -> bool:
# Deterministic: same user always gets same result
user_bucket = hash(f"{flag_name}:{user_id}") % 100
return user_bucket < rollout_pct
## Roll out to 1% -> 5% -> 25% -> 100%
if is_enabled_for_user('new_algorithm', user.id, rollout_pct=5):
return new_algorithm(data)
else:
return stable_algorithm(data)
GRADUAL ROLLOUT VISUALIZATION Day 1: [ββββββββββββββββββββ] 5% β Monitor metrics Day 2: [ββββββββββββββββββββ] 15% β Still stable? Day 3: [ββββββββββββββββββββ] 30% β Metrics look good Day 4: [ββββββββββββββββββββ] 50% β Halfway there Day 5: [ββββββββββββββββββββ] 100% β Full rollout! If ANY issues detected β Immediately rollback to 0%
4. Monitoring and Alerting
// Instrument flag usage
func CheckFeatureFlag(ctx context.Context, flagName string, userID string) bool {
enabled := flagService.IsEnabled(flagName, userID)
// Metrics
metrics.Increment("feature_flag.check", map[string]string{
"flag": flagName,
"enabled": strconv.FormatBool(enabled),
})
// Alert if kill switch activated
if strings.HasPrefix(flagName, "killswitch_") && !enabled {
alerts.Send(AlertCritical,
fmt.Sprintf("Kill switch %s activated!", flagName))
}
return enabled
}
π Track these metrics:
- Flag check frequency
- True/false split ratio
- Error rates per code path
- Performance difference between paths
Detailed Examples π»
Example 1: Emergency Database Query Kill Switch
Scenario: You deploy a new analytics query that's accidentally causing database load spikes. You need to disable it NOW.
-- Feature flags table
CREATE TABLE feature_flags (
name VARCHAR(100) PRIMARY KEY,
enabled BOOLEAN NOT NULL DEFAULT false,
description TEXT,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO feature_flags (name, enabled, description) VALUES
('enable_analytics_query', true, 'New user behavior analytics');
// Application code with kill switch
public class AnalyticsService {
private final FeatureFlagService flags;
public AnalyticsReport generateReport(String userId) {
// Kill switch check (fast, cached)
if (!flags.isEnabled("enable_analytics_query")) {
logger.info("Analytics query disabled, returning cached data");
return getCachedReport(userId); // Fallback to safe option
}
try {
// Potentially expensive operation
return database.executeComplexAnalytics(userId);
} catch (Exception e) {
logger.error("Analytics query failed", e);
// Automatic fallback
return getCachedReport(userId);
}
}
private AnalyticsReport getCachedReport(String userId) {
// Safe, fast alternative
return cache.get("analytics:" + userId,
() -> generateBasicReport(userId));
}
}
When database load spikes:
-- DBA runs this ONE command to fix it immediately:
UPDATE feature_flags SET enabled = false
WHERE name = 'enable_analytics_query';
β Result: Within 30 seconds (cache refresh), all servers stop running the expensive query. No deployment, no restart, no downtime.
Example 2: Canary Deployment with Automatic Rollback
Scenario: Deploy new payment processing code to 10% of users, automatically rollback if error rate exceeds threshold.
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class RolloutConfig:
flag_name: str
target_percentage: int
error_threshold: float # e.g., 0.05 = 5%
monitoring_window_seconds: int = 300 # 5 minutes
class AutoRollbackFeatureFlag:
def __init__(self, db, metrics, alerting):
self.db = db
self.metrics = metrics
self.alerting = alerting
def check_and_rollback_if_needed(self, config: RolloutConfig):
"""Background job that monitors and auto-rolls back"""
current_pct = self.get_current_percentage(config.flag_name)
if current_pct == 0:
return # Already disabled
# Check error rate for flag=true path
error_rate = self.metrics.get_error_rate(
flag=config.flag_name,
value=True,
window_seconds=config.monitoring_window_seconds
)
if error_rate > config.error_threshold:
# AUTOMATIC ROLLBACK
self.set_percentage(config.flag_name, 0)
self.alerting.send_critical(
title=f"Auto-rollback triggered: {config.flag_name}",
message=f"Error rate {error_rate:.2%} exceeded threshold "
f"{config.error_threshold:.2%}. Rolled back to 0%."
)
self.db.log_rollback_event(
flag=config.flag_name,
reason="error_rate_exceeded",
error_rate=error_rate,
timestamp=time.time()
)
def gradual_rollout(self, flag_name: str, target_pct: int,
step_pct: int = 10, step_delay_minutes: int = 30):
"""Gradually increase percentage with monitoring"""
current_pct = 0
while current_pct < target_pct:
next_pct = min(current_pct + step_pct, target_pct)
# Increase percentage
self.set_percentage(flag_name, next_pct)
print(f"Rolled out to {next_pct}%")
# Wait and monitor
time.sleep(step_delay_minutes * 60)
# Check if auto-rollback occurred
if self.get_current_percentage(flag_name) == 0:
print("Rollback detected, stopping rollout")
return False
current_pct = next_pct
return True # Successfully reached target
## Usage
rollout = AutoRollbackFeatureFlag(db, metrics, alerting)
## Start gradual rollout with safety net
rollout.gradual_rollout(
flag_name='new_payment_processor',
target_pct=100,
step_pct=10, # Increase by 10% each step
step_delay_minutes=30 # Wait 30 min between steps
)
## Background monitoring job
config = RolloutConfig(
flag_name='new_payment_processor',
target_percentage=100,
error_threshold=0.05, # Auto-rollback if >5% error rate
monitoring_window_seconds=300
)
while True:
rollout.check_and_rollback_if_needed(config)
time.sleep(60) # Check every minute
Example 3: Multi-Level Circuit Breaker
Scenario: Implement defense-in-depth with multiple levels of protection.
interface CircuitBreakerConfig {
flagName: string;
errorThreshold: number; // e.g., 10 errors
errorWindowMs: number; // e.g., 60000 (1 minute)
openCircuitDurationMs: number; // e.g., 30000 (30 seconds)
}
class CircuitBreakerFeatureFlag {
private errorCounts: Map<string, number[]> = new Map();
private circuitOpen: Map<string, number> = new Map();
isEnabled(config: CircuitBreakerConfig): boolean {
const now = Date.now();
// Level 1: Check manual kill switch
const manualFlag = featureFlagService.get(config.flagName);
if (!manualFlag) {
return false; // Manually disabled
}
// Level 2: Check if circuit breaker is open
const openUntil = this.circuitOpen.get(config.flagName);
if (openUntil && now < openUntil) {
console.log(`Circuit open for ${config.flagName}, using fallback`);
return false; // Auto-disabled due to errors
}
// Level 3: Check error rate
const recentErrors = this.getRecentErrors(config.flagName,
config.errorWindowMs);
if (recentErrors.length >= config.errorThreshold) {
// Open the circuit
this.circuitOpen.set(config.flagName,
now + config.openCircuitDurationMs);
alerting.send({
level: 'warning',
message: `Circuit breaker opened: ${config.flagName}. ` +
`${recentErrors.length} errors in ` +
`${config.errorWindowMs}ms window.`
});
return false;
}
return true; // All checks passed, feature is enabled
}
recordError(flagName: string): void {
const errors = this.errorCounts.get(flagName) || [];
errors.push(Date.now());
this.errorCounts.set(flagName, errors);
}
recordSuccess(flagName: string): void {
// If we're getting successes, consider closing circuit
const openUntil = this.circuitOpen.get(flagName);
if (openUntil && Date.now() >= openUntil) {
this.circuitOpen.delete(flagName);
this.errorCounts.delete(flagName); // Reset error count
console.log(`Circuit closed for ${flagName}, resuming normal operation`);
}
}
private getRecentErrors(flagName: string, windowMs: number): number[] {
const errors = this.errorCounts.get(flagName) || [];
const cutoff = Date.now() - windowMs;
return errors.filter(timestamp => timestamp > cutoff);
}
}
// Usage in application code
const circuitBreaker = new CircuitBreakerFeatureFlag();
const config = {
flagName: 'external_api_integration',
errorThreshold: 10,
errorWindowMs: 60000,
openCircuitDurationMs: 30000
};
async function callExternalAPI(data: any) {
if (!circuitBreaker.isEnabled(config)) {
// Fallback to cached data or degraded functionality
return getCachedData(data);
}
try {
const result = await externalAPI.call(data);
circuitBreaker.recordSuccess(config.flagName);
return result;
} catch (error) {
circuitBreaker.recordError(config.flagName);
throw error;
}
}
CIRCUIT BREAKER STATE MACHINE
ββββββββββββββββ
β CLOSED β β Normal operation
β (Enabled) β
ββββββββ¬ββββββββ
β
Errors exceed
threshold
β
β
ββββββββββββββββ
β OPEN β β Fallback active
β (Disabled) β
ββββββββ¬ββββββββ
β
Wait timeout
period
β
β
ββββββββββββββββ
β HALF-OPEN β β Testing recovery
β (Cautious) β
ββββββββ¬ββββββββ
β
ββββββ΄βββββ
β β
Success Failure
β β
β β
CLOSED OPEN
Example 4: User-Targeted Feature Flags
Scenario: Enable new feature only for internal employees first, then beta users, then everyone.
class TargetedFeatureFlag
def initialize(flag_service, user_service)
@flag_service = flag_service
@user_service = user_service
end
def enabled?(flag_name, user_id)
flag = @flag_service.get(flag_name)
return false unless flag&.enabled
user = @user_service.find(user_id)
# Targeting rules (evaluated in order)
case flag.targeting_mode
when 'all'
true
when 'employees_only'
user.email.end_with?('@company.com')
when 'beta_users'
user.email.end_with?('@company.com') || user.beta_tester?
when 'percentage_rollout'
user_in_rollout_percentage?(flag_name, user_id, flag.percentage)
when 'whitelist'
flag.whitelisted_user_ids.include?(user_id)
when 'attribute_match'
evaluate_attribute_rules(user, flag.attribute_rules)
else
false # Safe default
end
end
private
def user_in_rollout_percentage?(flag_name, user_id, percentage)
# Consistent hashing ensures same user always gets same result
hash = Digest::MD5.hexdigest("#{flag_name}:#{user_id}")
hash_int = hash[0..7].to_i(16) # First 8 hex digits
bucket = hash_int % 100
bucket < percentage
end
def evaluate_attribute_rules(user, rules)
# Example: { "country" => "US", "plan" => "premium" }
rules.all? do |attribute, expected_value|
user.send(attribute) == expected_value
end
end
end
## Example usage in controller
class FeaturesController < ApplicationController
before_action :check_feature_access
def new_dashboard
if @feature_flags.enabled?('new_dashboard', current_user.id)
render 'dashboards/new'
else
render 'dashboards/legacy'
end
end
private
def check_feature_access
@feature_flags = TargetedFeatureFlag.new(FeatureFlagService.new,
UserService.new)
end
end
| Rollout Phase | Target | Duration | Rollback Risk |
|---|---|---|---|
| 1. Internal | Employees only | 1 week | Low (small user base) |
| 2. Beta | Opt-in testers | 2 weeks | Medium (friendly users) |
| 3. Canary | 5% random users | 3 days | Medium (small percentage) |
| 4. Gradual | 25% β 50% β 100% | 1 week | High (large user base) |
Common Mistakes β οΈ
1. β No Default Behavior
## WRONG: Crashes if flag service is down
def process_order(order):
if flag_service.is_enabled('new_processor'): # What if this throws?
return new_processor.process(order)
else:
return old_processor.process(order)
## RIGHT: Safe defaults and error handling
def process_order(order):
try:
use_new = flag_service.is_enabled('new_processor')
except Exception as e:
logger.error(f"Flag check failed: {e}")
use_new = False # Default to stable code path
if use_new:
return new_processor.process(order)
else:
return old_processor.process(order)
Why it matters: Your flag service becomes a single point of failure. Always have a fallback.
2. β Forgetting to Remove Old Flags
// After 6 months, you might have:
if (flags.get('new_ui_2023')) { } // Still needed?
if (flags.get('experiment_checkout')) { } // Experiment ended 5 months ago!
if (flags.get('beta_search')) { } // This is production now!
// This creates:
// - Technical debt
// - Confusing code paths
// - Performance overhead
// - Maintenance burden
Solution: Add expiry dates and periodic cleanup:
@dataclass
class FeatureFlag:
name: str
enabled: bool
created_at: datetime
expires_at: Optional[datetime] # Reminder to clean up
flag_type: str # 'release', 'ops', 'experiment'
## Automated cleanup reminder
def check_stale_flags():
stale_flags = db.query(
"SELECT name FROM feature_flags "
"WHERE expires_at < NOW() AND still_in_code = true"
)
for flag in stale_flags:
alert_team(f"Remove stale flag from code: {flag.name}")
3. β Slow Flag Evaluation
// WRONG: Network call every time (could be thousands of times/second)
func ShowNewFeature(userID string) bool {
resp, _ := http.Get(fmt.Sprintf("https://flags-api.com/check?user=%s", userID))
defer resp.Body.Close()
// This is TERRIBLE for performance!
return parseResponse(resp)
}
// RIGHT: Local evaluation with cached rules
var flagCache = sync.Map{}
func ShowNewFeature(userID string) bool {
// Fast lookup from memory
if val, ok := flagCache.Load("new_feature"); ok {
return val.(bool)
}
return false // Safe default
}
// Background goroutine refreshes cache every 30s
func refreshFlagsInBackground() {
for {
flags := fetchAllFlagsFromAPI() // One batch call
for name, enabled := range flags {
flagCache.Store(name, enabled)
}
time.Sleep(30 * time.Second)
}
}
4. β Complex Flag Logic
// WRONG: Too complex to understand and debug
if ((flags.newUI && user.isPremium) ||
(flags.betaAccess && user.country === 'US' && !user.hasOptedOut) ||
(flags.forceNewUI && user.email.endsWith('@company.com'))) {
// What conditions actually enabled this?!
}
// RIGHT: Clear, testable, traceable
function shouldShowNewUI(user: User): boolean {
const reasons: string[] = [];
if (flags.newUI && user.isPremium) {
reasons.push('premium_user');
return true;
}
if (flags.betaAccess && user.country === 'US' && !user.hasOptedOut) {
reasons.push('us_beta_user');
return true;
}
if (flags.forceNewUI && user.email.endsWith('@company.com')) {
reasons.push('internal_user');
return true;
}
logger.debug(`New UI not shown. Reasons checked: ${reasons.join(', ')}`);
return false;
}
5. β Not Monitoring Flag State Changes
## WRONG: No audit trail
def update_flag(name, enabled)
db.execute("UPDATE flags SET enabled = ? WHERE name = ?", enabled, name)
end
## RIGHT: Full audit trail
def update_flag(name, enabled, changed_by:, reason:)
old_value = db.query("SELECT enabled FROM flags WHERE name = ?", name).first
db.execute("UPDATE flags SET enabled = ? WHERE name = ?", enabled, name)
# Log every change
audit_log.create!(
flag_name: name,
old_value: old_value,
new_value: enabled,
changed_by: changed_by,
reason: reason,
timestamp: Time.now
)
# Alert team
if name.start_with?('killswitch_')
slack.post(
channel: '#ops-alerts',
message: "π¨ Kill switch #{name} #{enabled ? 'ENABLED' : 'DISABLED'} " +
"by #{changed_by}. Reason: #{reason}"
)
end
end
6. β Using Flags for Configuration
## WRONG: Feature flags are not a configuration system
flag_service.set('database_host', 'db.prod.company.com')
flag_service.set('max_connections', 100)
flag_service.set('api_timeout_seconds', 30)
## These should be in proper configuration management!
## Feature flags are for ENABLING/DISABLING features, not configuration values
## RIGHT: Use flags for booleans, config for values
if feature_flags.is_enabled('use_connection_pooling'):
pool = ConnectionPool(
host=config.get('database_host'),
max_connections=config.get('max_connections')
)
Emergency Response Workflow π¨
When production breaks and you need to use a kill switch:
EMERGENCY KILL SWITCH PROCEDURE
1. IDENTIFY THE PROBLEM
ββ What feature/code is causing issues?
ββ Is there a relevant kill switch?
ββ What's the blast radius?
2. ACTIVATE KILL SWITCH
ββ Update flag in control panel/database
ββ Verify flag propagated (check cache refresh)
ββ Confirm feature is disabled
β
ββ Check metrics dashboard
ββ Test affected endpoints
ββ Monitor error rates
3. VERIFY IMPACT
ββ Did errors stop? β
ββ Is fallback working? β
ββ Are users experiencing issues? β
4. COMMUNICATE
ββ Post in incident channel
ββ Update status page
ββ Notify stakeholders
5. INVESTIGATE ROOT CAUSE
ββ Collect logs/metrics
ββ Reproduce issue in staging
ββ Develop fix
6. RE-ENABLE (GRADUALLY)
ββ Deploy fix to staging
ββ Test thoroughly
ββ Enable for 1% of users
ββ Monitor for 1 hour
ββ Gradual rollout to 100%
π‘ Pro Tip: Practice your kill switch procedure during low-traffic periods. You don't want the first time using it to be during a critical incident.
Key Takeaways π―
β
Feature flags are safety nets β They let you disable code instantly without deployment
β
Kill switches are permanent β Critical features should always have an emergency shutoff
β
Default to safety β When flag evaluation fails, fall back to stable code paths
β
Keep evaluation fast β Cache flags locally, avoid network calls in hot paths
β
Gradual rollout β Start with 1-5%, monitor metrics, increase slowly
β
Monitor everything β Track flag state changes, error rates per code path
β
Clean up regularly β Remove expired flags to reduce technical debt
β
Audit all changes β Log who changed what flag and why
β
Test the off switch β Verify your fallback code actually works
β
Document flag purpose β Future you will forget why this flag exists
π Quick Reference Card: Kill Switch Checklist
| Characteristic | Requirement |
| π― Scope | One kill switch per risky feature |
| β‘ Speed | Takes effect within 30 seconds |
| π‘οΈ Safety | Graceful fallback to stable behavior |
| π Monitoring | Alert when toggled, track metrics |
| π Testing | Regularly test off state works |
| π Documentation | Clear description and runbook |
| π€ Access | On-call engineer can toggle 24/7 |
π Further Study
Feature Toggles (Martin Fowler) - https://martinfowler.com/articles/feature-toggles.html - Comprehensive guide to feature flag patterns and antipatterns
LaunchDarkly Feature Flag Guide - https://launchdarkly.com/blog/what-are-feature-flags/ - Best practices from a production flag management platform
Site Reliability Engineering Book (Google) - https://sre.google/sre-book/managing-critical-state/ - Chapter on managing rollouts and emergency controls at scale