A payment gateway has a {{1}} transaction lock that causes 2% failure rate during {{2}}. This is an example of when you should likely {{3}} the issue rather than redesign.

["500ms","peak traffic","accept"]

Redesign vs Accept

Choosing what to fix versus what to tolerate

Redesign vs Accept: Strategic Technical Decision-Making

Master the critical art of deciding when to redesign versus accepting technical debt with free flashcards and spaced repetition practice. This lesson covers decision frameworks for architecture choices, risk assessment strategies, and practical techniques for evaluating refactoring ROI—essential skills for senior engineers, technical leads, and anyone making high-stakes technical decisions under pressure.

Welcome to the Crossroads of Engineering Judgment

💻 Every software engineer eventually faces this moment: staring at legacy code or a flawed system, weighing whether to rebuild from scratch or make peace with imperfection. This isn't just a technical decision—it's a business judgment that can make or break projects.

In the heat of debugging under pressure, you've accumulated battle scars: workarounds, patches, and architectural compromises. Now comes the harder question: Which scars should you accept as part of your system's history, and which demand surgical reconstruction?

This lesson transforms your debugging experiences into strategic architecture decisions. You'll learn frameworks that senior engineers use to make these calls with confidence, even when both options feel risky.

Core Concepts: The Decision Framework

🎯 The Two Paths

When you've identified a systemic issue—whether through debugging, monitoring, or user feedback—you face two fundamental strategies:

Strategy	Description	Best For	Risk Profile
🔧 Accept & Mitigate	Keep the current design; add guardrails, monitoring, documentation	Stable systems, low frequency issues, time-critical delivery	Known, bounded risks
🏗️ Redesign & Replace	Rebuild the problematic component with a new architecture	Cascading failures, scaling blockers, security vulnerabilities	Unknown risks during transition

The key insight: Neither is universally correct. Context drives the decision.

📊 The Cost-Benefit Matrix

Every redesign vs. accept decision involves four cost dimensions:

┌─────────────────────────────────────────────────────────┐
│           DECISION COST ANALYSIS                        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dimension          Accept          Redesign           │
│  ───────────────────────────────────────────────────   │
│  💰 Direct Cost     Low (patches)   High (rebuild)     │
│  ⏱️ Time            Days-Weeks      Months             │
│  🎯 Opportunity     Medium          High (blocked)     │
│  ⚠️ Risk            Known bugs      Unknown issues     │
│                                                         │
│  🔄 Long-term       Accumulating    Pays down debt     │
│     Impact          technical debt  (if successful)    │
│                                                         │
└─────────────────────────────────────────────────────────┘

💡 Pro Tip: The "opportunity cost" is often the hidden killer. Every week spent redesigning is a week not building new features. Calculate what you're not doing.

🧮 The DORA Metrics Lens

High-performing engineering teams evaluate decisions through four key metrics:

Deployment Frequency: Will this decision speed up or slow down releases?
Lead Time for Changes: How long until we can iterate again?
Time to Restore Service: Does this make recovery easier or harder?
Change Failure Rate: What's the blast radius if this goes wrong?

DECISION IMPACT ON DORA METRICS

                    Accept              Redesign
                    ──────              ────────
Deployment Freq     🟢 Maintained       🔴 Halted during rebuild
Lead Time           🟡 Slowly degrading 🟢 Improved after completion
Restore Time        🔴 May worsen       🟢 Should improve
Failure Rate        🟡 Known rate       ❓ Unknown initially

🟢 = Positive  🟡 = Neutral  🔴 = Negative  ❓ = Uncertain

🔍 Risk Assessment: The Failure Mode Analysis

Before deciding, map out failure scenarios for both paths:

Accept Failure Modes:

The workaround fails under unexpected load
New features compound the underlying issue
Team morale erodes from working with bad code
You eventually redesign anyway—but later, at higher cost

Redesign Failure Modes:

The new design has its own unforeseen issues
Migration introduces data corruption or downtime
Timeline overruns kill business opportunities
You recreate the same problems in new code

💡 The Second-System Effect: Redesigns often try to fix everything, becoming overengineered. Beware perfectionism.

🎪 The Strangler Fig Pattern: The Third Way

Sometimes you don't have to choose! The Strangler Fig Pattern lets you redesign incrementally:

STRANGLER FIG PATTERN EVOLUTION

Phase 1: Wrap Legacy          Phase 2: Route New Code
┌──────────────────┐          ┌──────────────────┐
│   New Facade     │          │   New Facade     │
│                  │          │  ┌────────────┐  │
│  ┌───────────┐   │          │  │ New Module │  │
│  │  Legacy   │   │          │  │    20%     │  │
│  │  System   │   │    →     │  └────────────┘  │
│  │           │   │          │  ┌───────────┐   │
│  └───────────┘   │          │  │  Legacy   │   │
│                  │          │  │   80%     │   │
│                  │          │  └───────────┘   │
└──────────────────┘          └──────────────────┘

Phase 3: Complete Migration
┌──────────────────┐
│   New System     │
│  ┌────────────┐  │
│  │ New Module │  │
│  │   100%     │  │
│  └────────────┘  │
│                  │
│  (Legacy removed)│
└──────────────────┘

This approach:

✅ Delivers value continuously
✅ Allows learning and course correction
✅ Reduces big-bang migration risk
❌ Requires maintaining two systems temporarily
❌ Demands excellent testing and monitoring

📐 The Decision Tree

Use this flowchart to guide your decision:

REDESIGN VS ACCEPT DECISION TREE

              ┌─────────────────────────┐
              │ Is this a security or   │
              │ data integrity issue?   │
              └────────┬────────────────┘
                       │
           ┌───────────┴───────────┐
           │                       │
        ┌──┴──┐                 ┌──┴──┐
        │ YES │                 │ NO  │
        └──┬──┘                 └──┬──┘
           │                       │
           ▼                       ▼
    ┌──────────────┐    ┌──────────────────────┐
    │ REDESIGN NOW │    │ Can you contain the  │
    │ (no question)│    │ blast radius?        │
    └──────────────┘    └────────┬─────────────┘
                                 │
                     ┌───────────┴───────────┐
                     │                       │
                  ┌──┴──┐                 ┌──┴──┐
                  │ YES │                 │ NO  │
                  └──┬──┘                 └──┬──┘
                     │                       │
                     ▼                       ▼
          ┌─────────────────────┐   ┌────────────────┐
          │ Does the issue block │   │ REDESIGN NOW   │
          │ strategic roadmap?   │   │ (high cascade  │
          └─────────┬───────────┘   │ risk)          │
                    │                └────────────────┘
        ┌───────────┴───────────┐
        │                       │
     ┌──┴──┐                 ┌──┴──┐
     │ YES │                 │ NO  │
     └──┬──┘                 └──┬──┘
        │                       │
        ▼                       ▼
 ┌──────────────────┐   ┌──────────────────┐
 │ Calculate ROI:   │   │ Can you live     │
 │ - Rebuild cost   │   │ with workarounds │
 │ - Feature delay  │   │ for 6-12 months? │
 │ - Unlock value   │   └────────┬─────────┘
 └────────┬─────────┘            │
          │          ┌───────────┴───────────┐
          ▼          │                       │
   ┌───────────┐  ┌──┴──┐                 ┌──┴──┐
   │ REDESIGN  │  │ YES │                 │ NO  │
   │ if ROI >  │  └──┬──┘                 └──┬──┘
   │ 3:1       │     │                       │
   └───────────┘     ▼                       ▼
                 ┌─────────┐         ┌────────────┐
                 │ ACCEPT  │         │ REDESIGN   │
                 │ & add   │         │ or use     │
                 │ monitoring│       │ Strangler  │
                 └─────────┘         │ Fig pattern│
                                     └────────────┘

🧪 The Scientific Method for Architecture

Treat your decision as a hypothesis:

Hypothesis Template:

"If we [ACCEPT/REDESIGN] the [COMPONENT], then [PREDICTED OUTCOME] will occur within [TIMEFRAME], which we'll measure by [METRIC]."

Example (Accept):

"If we accept the current authentication service and add rate limiting, then unauthorized access attempts will decrease by 80% within 2 weeks, measured by failed login attempts in our security logs."

Example (Redesign):

"If we redesign the order processing pipeline using an event-driven architecture, then order throughput will increase 5x within 3 months, measured by orders processed per minute."

💡 Pro Tip: Write down your hypothesis before deciding. This creates accountability and clear success criteria.

🎭 Real-World Scenario: The Payment Gateway

Let's walk through a concrete example:

The Problem: Your payment processing system occasionally fails during peak traffic (Black Friday). Error rate spikes to 2% (normally 0.1%). Debug logs reveal the issue: a monolithic payment gateway written 5 years ago that locks a database row for 500ms per transaction.

Option 1: Accept & Mitigate

## Current code (simplified)
class PaymentGateway:
    def process_payment(self, order_id, amount):
        with db.transaction():
            # Locks the order row
            order = Order.get_and_lock(order_id)
            
            # External API call (slow!)
            result = payment_api.charge(amount)
            
            # Still holding lock...
            order.status = 'paid' if result.success else 'failed'
            order.save()
            
        return result

## Mitigation: Add async processing
class PaymentGateway:
    def process_payment(self, order_id, amount):
        # Immediately queue for processing
        payment_queue.enqueue({
            'order_id': order_id,
            'amount': amount,
            'timestamp': time.now()
        })
        return {'status': 'pending', 'order_id': order_id}
    
    def background_processor(self):
        # Process payments without holding locks
        while True:
            job = payment_queue.dequeue()
            result = payment_api.charge(job['amount'])
            # Quick update, no long lock
            Order.update_status(job['order_id'], result.status)

Cost: 3 engineer-days to implement queue, test, and deploy. Risk: Eventual consistency (users see "pending" briefly). Benefit: Solves 95% of the problem, ships this week.

Option 2: Redesign with Event Sourcing

## New event-driven architecture
class PaymentService:
    def initiate_payment(self, order_id, amount):
        event = PaymentInitiated(
            order_id=order_id,
            amount=amount,
            timestamp=time.now()
        )
        event_store.append(event)
        return event.id

class PaymentProcessor:
    def handle_payment_initiated(self, event):
        result = payment_api.charge(event.amount)
        
        if result.success:
            event_store.append(PaymentSucceeded(
                order_id=event.order_id,
                transaction_id=result.transaction_id
            ))
        else:
            event_store.append(PaymentFailed(
                order_id=event.order_id,
                reason=result.error
            ))

class OrderProjection:
    # Read model built from events
    def get_payment_status(self, order_id):
        events = event_store.get_for_order(order_id)
        # Rebuild state from events
        return self._calculate_status(events)

Cost: 8-12 engineer-weeks (new event store, migration, testing). Risk: Event store could have its own bugs; complex migration. Benefit: Highly scalable, full audit trail, enables future features.

The Decision:

For most companies, Option 1 (Accept) is correct because:

✅ Problem occurs 2 days per year (Black Friday, Cyber Monday)
✅ 3-day fix vs. 3-month project
✅ Lets team ship holiday features
✅ Proven queue pattern (low risk)

When Option 2 (Redesign) would be right:

⚠️ Payment failures happen daily
⚠️ Planning to 10x transaction volume
⚠️ Need detailed audit trail for compliance
⚠️ Monolith blocks other critical improvements

Examples: Real Decisions from the Field

Example 1: The Database Schema Mistake 🗄️

Background: A social media startup's user table has a preferences TEXT column storing JSON. As features grew, queries became slow (no indexing on JSON fields), and data inconsistencies emerged.

Debugging Scars:

Added validation layer to catch malformed JSON
Cached frequent queries to reduce DB load
Created data cleanup scripts for inconsistencies

The Crossroads:

Accept Path:

-- Add materialized columns for common queries
ALTER TABLE users 
  ADD COLUMN email_notifications_enabled BOOLEAN 
  GENERATED ALWAYS AS (
    CAST(JSON_EXTRACT(preferences, '$.notifications.email') AS BOOLEAN)
  ) STORED;

CREATE INDEX idx_email_notif ON users(email_notifications_enabled);

-- Cost: 2 days, downtime: 1 hour, risk: low

Redesign Path:

-- New normalized schema
CREATE TABLE user_preferences (
  user_id BIGINT,
  preference_key VARCHAR(100),
  preference_value VARCHAR(500),
  PRIMARY KEY (user_id, preference_key),
  INDEX (preference_key, preference_value)
);

-- Migration for 10M users
-- Cost: 3 weeks, downtime: 4 hours, risk: medium

What They Chose: Accept with materialized columns. Why?

Growth was linear, not exponential (no urgent scaling need)
Migration risk too high for 10M active users
Materialized columns solved 80% of query performance issues
Bought time to redesign properly in v2 platform

Outcome: Successfully scaled to 25M users before eventually migrating 18 months later during a planned platform upgrade.

Example 2: The Microservices Monolith 🏗️

Background: An e-commerce company's checkout service became a "distributed monolith"—20 microservices so tightly coupled that deploying one required coordinating 8 others.

Debugging Scars:

Created deployment runbooks (45 minutes to deploy)
Added extensive integration testing (2-hour test suite)
Implemented feature flags to decouple releases

The Crossroads:

Accept Path:

## Improve coordination without restructuring
class DeploymentCoordinator:
    def deploy_checkout_change(self, service, version):
        # Automated rollout order
        dependency_graph = {
            'cart': ['inventory', 'pricing'],
            'checkout': ['cart', 'payment', 'shipping'],
            'order': ['checkout', 'notification']
        }
        
        # Deploy dependencies first
        for dep in dependency_graph[service]:
            self.ensure_compatible_version(dep)
        
        # Canary deployment
        self.deploy_canary(service, version, traffic_pct=5)
        self.monitor_metrics(duration='10m')
        
        if self.metrics_healthy():
            self.deploy_full(service, version)

## Cost: 2 weeks for automation, ongoing: 20min deploys

Redesign Path:

## Merge related services, define clear boundaries

## BEFORE: 20 services
## cart-service, cart-validator, cart-pricer...

## AFTER: 6 bounded contexts
class CheckoutDomain:
    """Encapsulates entire checkout flow"""
    
    # Internal modules (not separate services)
    cart_module: CartManager
    pricing_module: PricingEngine
    inventory_module: InventoryChecker
    
    def process_checkout(self, cart_id):
        # All coordination happens in-process
        # External calls only for payment, shipping
        pass

## Cost: 6 months to consolidate, test, migrate traffic

What They Chose: Redesign (with Strangler Fig approach). Why?

Deployment friction was killing velocity (1 feature = 3 weeks)
Every bug required 5 teams to coordinate
New engineers took 6 months to understand the architecture
The "microservices" were actually modules forced into services

Outcome: Over 9 months, consolidated to 6 services. Deployment time dropped to 10 minutes, feature velocity increased 3x, and on-call burden decreased 60%.

Example 3: The Caching Layer That Lied 🚀

Background: A news website added Redis caching to speed up article pages. Cached data occasionally became stale, showing outdated headlines. Debugging revealed race conditions in cache invalidation logic.

Debugging Scars:

Added TTL to force refresh every 5 minutes
Implemented cache warming on article updates
Created monitoring for cache hit rates

The Crossroads:

Accept Path:

## Current: Cache-aside pattern with manual invalidation
class ArticleCache:
    def get_article(self, article_id):
        cached = redis.get(f'article:{article_id}')
        if cached:
            return json.loads(cached)
        
        # Cache miss
        article = db.query_article(article_id)
        redis.setex(
            f'article:{article_id}',
            300,  # 5 minute TTL
            json.dumps(article)
        )
        return article
    
    def update_article(self, article_id, content):
        db.update_article(article_id, content)
        # Manual invalidation (race condition prone!)
        redis.delete(f'article:{article_id}')

## Mitigation: Add versioning
class ArticleCache:
    def update_article(self, article_id, content):
        version = redis.incr(f'article:{article_id}:version')
        db.update_article(article_id, content, version)
        redis.delete(f'article:{article_id}')
        # Next read will get new version

## Cost: 3 days

Redesign Path:

## Event-driven cache invalidation
class ArticleEventStream:
    def update_article(self, article_id, content):
        # Write to DB
        article = db.update_article(article_id, content)
        
        # Publish event (guaranteed delivery)
        event_bus.publish(ArticleUpdated(
            article_id=article_id,
            version=article.version,
            timestamp=time.now()
        ))

class CacheInvalidationWorker:
    def handle_article_updated(self, event):
        # Invalidate all affected caches
        redis.delete(f'article:{event.article_id}')
        redis.delete(f'author:{event.author_id}:articles')
        redis.delete(f'category:{event.category}:latest')
        
        # Warm cache immediately
        article = db.query_article(event.article_id)
        redis.setex(
            f'article:{event.article_id}',
            3600,
            json.dumps(article)
        )

## Cost: 3 weeks (event bus, workers, testing)

What They Chose: Accept with versioning. Why?

Stale cache only affected <0.1% of page views
5-minute TTL acceptable for news content
Event bus infrastructure didn't exist yet
Versioning eliminated race conditions for <5% of the cost

Outcome: Stale cache incidents dropped 95%. Team later added event bus for other features, then migrated caching to use it—but only after proving its value elsewhere.

Example 4: The Authentication Time Bomb ⏰

Background: A SaaS platform used home-built JWT authentication with a custom encryption scheme. A security audit flagged it as vulnerable to timing attacks and lacking key rotation.

Debugging Scars:

Added rate limiting to slow down attack attempts
Implemented anomaly detection for suspicious auth patterns
Created incident response runbooks

The Crossroads:

Accept Path:

// Current: Custom JWT implementation
function verifyToken(token) {
  const [header, payload, signature] = token.split('.');
  const expectedSig = customHMAC(header + payload, SECRET_KEY);
  
  // Vulnerable to timing attack!
  if (signature !== expectedSig) {
    throw new Error('Invalid signature');
  }
  
  return JSON.parse(base64Decode(payload));
}

// Mitigation: Add constant-time comparison
function verifyToken(token) {
  const [header, payload, signature] = token.split('.');
  const expectedSig = customHMAC(header + payload, SECRET_KEY);
  
  // Constant-time comparison
  if (!crypto.timingSafeEqual(
    Buffer.from(signature),
    Buffer.from(expectedSig)
  )) {
    throw new Error('Invalid signature');
  }
  
  return JSON.parse(base64Decode(payload));
}

// Still missing: key rotation, standard algorithms
// Cost: 1 day

Redesign Path:

// Replace with industry-standard library
const jwt = require('jsonwebtoken');
const jwksClient = require('jwks-rsa');

const client = jwksClient({
  jwksUri: 'https://your-domain/.well-known/jwks.json',
  cache: true,
  rateLimit: true
});

function verifyToken(token) {
  return new Promise((resolve, reject) => {
    jwt.verify(token, getKey, {
      algorithms: ['RS256'],
      issuer: 'your-domain',
      audience: 'your-api'
    }, (err, decoded) => {
      if (err) reject(err);
      else resolve(decoded);
    });
  });
}

function getKey(header, callback) {
  client.getSigningKey(header.kid, (err, key) => {
    if (err) return callback(err);
    callback(null, key.getPublicKey());
  });
}

// Includes: key rotation, standard algorithms, tested library
// Cost: 2 weeks (migration, testing, rollout)

What They Chose: Redesign immediately. Why?

🚨 Security vulnerabilities are non-negotiable
Custom crypto is almost always wrong ("don't roll your own")
Standard libraries are battle-tested by millions
The mitigation didn't address underlying flaws (no key rotation)
Risk of breach far exceeded cost of 2-week project

Outcome: Completed migration in 10 days with zero downtime using a dual-verification period where both old and new tokens were accepted. Security audit passed on retest.

💡 Key Lesson: Security and data integrity issues collapse the decision tree—redesign wins almost always.

Common Mistakes to Avoid

❌ Mistake 1: The Sunk Cost Fallacy

The Error: "We've already spent 3 months on this design, we can't change now."

Why It's Wrong: Past time is gone whether you accept or redesign. Only future costs and benefits matter.

Example:

## Spent 3 months building a custom message queue
class CustomQueue:
    def __init__(self):
        self.storage = []  # In-memory only!
        self.lock = threading.Lock()
    
    def enqueue(self, item):
        with self.lock:
            self.storage.append(item)
    
    def dequeue(self):
        with self.lock:
            return self.storage.pop(0) if self.storage else None

## Discovery: It's not durable (loses data on restart)
## Fixing this properly requires... rebuilding it

## WRONG THINKING: "We've invested 3 months, let's add persistence"
## RIGHT THINKING: "Should we spend 1 more month patching, or 
##                   2 weeks migrating to Redis/RabbitMQ?"

Better Approach: Use Redis or RabbitMQ from day 1. The "save 2 weeks" turned into "waste 3 months."

❌ Mistake 2: Perfect Is the Enemy of Done

The Error: Redesigning to fix everything instead of the actual problem.

Why It's Wrong: Scope creep kills projects. The redesign takes 3x longer than estimated and ships with its own bugs.

Example:

## Original problem: Slow database queries

## Targeted fix (Accept path):
## - Add indexes: 1 day
## - Query optimization: 2 days
## - Total: 3 days, solves the problem

## Overengineered redesign:
class UltimateCachingFramework:
    """Let's solve caching forever!"""
    
    def __init__(self):
        self.l1_cache = {}  # In-memory
        self.l2_cache = RedisCache()  # Redis
        self.l3_cache = MemcachedCache()  # Memcached backup
        self.cdn_cache = CloudFlarePurger()  # CDN layer
        
        # Automatic cache warming
        self.predictor = MLCachePredictor()  # ML to predict access!
        
        # Distributed invalidation
        self.event_bus = KafkaEventBus()
        
        # Monitoring
        self.metrics = PrometheusExporter()
    
    # 47 more methods...

## Result: 4 months, still not done, original problem still exists

💡 Fix: Solve the problem you have, not the problem you imagine. Add complexity only when needed.

❌ Mistake 3: Ignoring Team Capacity

The Error: Choosing redesign without considering your team's actual bandwidth.

Why It's Wrong: Junior engineers maintaining critical systems during a redesign leads to outages. Senior engineers context-switching between new and old systems deliver neither well.

Reality Check:

TEAM CAPACITY CALCULATION

5 engineers × 40 hours/week = 200 hours
Minus:
  - On-call rotation: -20 hours/week
  - Meetings/planning: -30 hours/week
  - Bug fixes/support: -25 hours/week
  - Code review: -15 hours/week
  ─────────────────────────────────
Available for new work: 110 hours/week

Redesign estimate: 400 hours
Realistic timeline: 4 weeks (not 2!)

During those 4 weeks:
  - Feature work stops
  - Technical debt accumulates elsewhere
  - Opportunity cost: What could you ship instead?

💡 Fix: Be brutally honest about capacity. Consider hiring contractors for the redesign or postponing other work.

❌ Mistake 4: No Rollback Plan

The Error: Committing to a redesign without a way to revert if it fails.

Why It's Wrong: Murphy's Law applies. Production issues emerge that testing missed.

Brittle Approach:

## Big-bang migration
def migrate_to_new_system():
    print("Shutting down old system...")
    old_system.stop()
    
    print("Starting migration...")
    migrate_all_data()  # 4 hours, no going back
    
    print("Starting new system...")
    new_system.start()
    
    # If new_system fails, you're offline for hours

Robust Approach:

## Parallel run with feature flag
def process_request(request):
    if feature_flags.new_system_enabled():
        # Primary: new system
        result_new = new_system.process(request)
        
        # Shadow: old system (compare results)
        result_old = old_system.process(request)
        metrics.compare(result_new, result_old)
        
        return result_new
    else:
        # Rollback: instant, just flip flag
        return old_system.process(request)

## Gradual rollout:
## Week 1: 5% traffic to new system
## Week 2: 25% if metrics look good
## Week 3: 50%
## Week 4: 100% or rollback if issues found

💡 Fix: Always have a rollback plan that takes <5 minutes to execute.

❌ Mistake 5: Redesigning Without Understanding Why It Failed

The Error: Rebuilding the same thing in a different language/framework without fixing the root cause.

Why It's Wrong: You'll recreate the same problems.

Example:

## Original Ruby service (slow):
class OrderProcessor
  def process(order)
    # Synchronous API calls
    inventory.reserve(order.items)
    payment.charge(order.total)
    shipping.create_label(order)
    
    order.complete!
  end
end

## "Redesigned" in Go (still slow!):
func processOrder(order Order) error {
    // Still synchronous!
    if err := inventory.Reserve(order.Items); err != nil {
        return err
    }
    if err := payment.Charge(order.Total); err != nil {
        return err
    }
    if err := shipping.CreateLabel(order); err != nil {
        return err
    }
    return order.Complete()
}

// Problem wasn't Ruby—it was synchronous design!
// Should have been async regardless of language

💡 Fix: Conduct a blameless postmortem to understand why the current design struggles before redesigning.

Key Takeaways

📋 Quick Reference: Redesign vs Accept Decision Framework

Factor	Accept	Redesign
Security/Data Integrity	❌ Never for critical issues	✅ Always for vulnerabilities
Frequency	✅ Rare problems (< monthly)	✅ Daily pain points
Blast Radius	✅ Isolated, containable	✅ Cascading failures
Timeline	✅ Days to weeks	⚠️ Months (use Strangler Fig)
Opportunity Cost	✅ Low (doesn't block features)	⚠️ High (delays roadmap)
Team Capacity	✅ Can be done alongside other work	❌ Requires dedicated focus
Risk	✅ Known, bounded	⚠️ Unknown during transition

🎯 Golden Rules

Security first: Always redesign for security/data integrity issues
Measure twice, cut once: Write down your hypothesis and success metrics
Consider the third way: Strangler Fig lets you redesign incrementally
ROI threshold: Redesign should provide 3:1 value over cost minimum
Rollback always: Never commit to a one-way door without escape plan
Sunk costs don't matter: Only future value counts
Solve the problem you have: Not the problem you imagine
Team capacity is real: Account for it honestly

🧠 Decision Shortcuts

Redesign immediately if:

🚨 Security vulnerability
🚨 Data corruption risk
🚨 Cascading failures across systems
🚨 Regulatory compliance requirement

Accept if:

✅ Problem occurs < once per month
✅ Workaround takes < 1 week to implement
✅ No blocking effect on roadmap
✅ Team bandwidth is constrained
✅ Redesign cost > 3x the mitigation cost

Use Strangler Fig if:

🔄 Need continuous delivery during transition
🔄 System is too large for big-bang migration
🔄 Want to validate new design incrementally
🔄 Can't afford extended feature freeze

📚 Further Study

Martin Fowler's Refactoring Catalog - https://refactoring.com/catalog/ - Comprehensive guide to code-level redesign decisions with before/after examples
The Strangler Fig Application Pattern - https://martinfowler.com/bliki/StranglerFigApplication.html - Detailed explanation of incremental migration strategies from the pattern's creator
Technology Radar: Adopt, Trial, Assess, Hold - https://www.thoughtworks.com/radar - Framework for evaluating when to adopt new technologies vs. accepting current tools, updated quarterly by ThoughtWorks

📝

Ready to practice?

This lesson has 15 questions to help you learn