Redesign vs Accept
Choosing what to fix versus what to tolerate
Redesign vs Accept: Strategic Technical Decision-Making
Master the critical art of deciding when to redesign versus accepting technical debt with free flashcards and spaced repetition practice. This lesson covers decision frameworks for architecture choices, risk assessment strategies, and practical techniques for evaluating refactoring ROIโessential skills for senior engineers, technical leads, and anyone making high-stakes technical decisions under pressure.
Welcome to the Crossroads of Engineering Judgment
๐ป Every software engineer eventually faces this moment: staring at legacy code or a flawed system, weighing whether to rebuild from scratch or make peace with imperfection. This isn't just a technical decisionโit's a business judgment that can make or break projects.
In the heat of debugging under pressure, you've accumulated battle scars: workarounds, patches, and architectural compromises. Now comes the harder question: Which scars should you accept as part of your system's history, and which demand surgical reconstruction?
This lesson transforms your debugging experiences into strategic architecture decisions. You'll learn frameworks that senior engineers use to make these calls with confidence, even when both options feel risky.
Core Concepts: The Decision Framework
๐ฏ The Two Paths
When you've identified a systemic issueโwhether through debugging, monitoring, or user feedbackโyou face two fundamental strategies:
| Strategy | Description | Best For | Risk Profile |
|---|---|---|---|
| ๐ง Accept & Mitigate | Keep the current design; add guardrails, monitoring, documentation | Stable systems, low frequency issues, time-critical delivery | Known, bounded risks |
| ๐๏ธ Redesign & Replace | Rebuild the problematic component with a new architecture | Cascading failures, scaling blockers, security vulnerabilities | Unknown risks during transition |
The key insight: Neither is universally correct. Context drives the decision.
๐ The Cost-Benefit Matrix
Every redesign vs. accept decision involves four cost dimensions:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ DECISION COST ANALYSIS โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ Dimension Accept Redesign โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ ๐ฐ Direct Cost Low (patches) High (rebuild) โ โ โฑ๏ธ Time Days-Weeks Months โ โ ๐ฏ Opportunity Medium High (blocked) โ โ โ ๏ธ Risk Known bugs Unknown issues โ โ โ โ ๐ Long-term Accumulating Pays down debt โ โ Impact technical debt (if successful) โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ก Pro Tip: The "opportunity cost" is often the hidden killer. Every week spent redesigning is a week not building new features. Calculate what you're not doing.
๐งฎ The DORA Metrics Lens
High-performing engineering teams evaluate decisions through four key metrics:
- Deployment Frequency: Will this decision speed up or slow down releases?
- Lead Time for Changes: How long until we can iterate again?
- Time to Restore Service: Does this make recovery easier or harder?
- Change Failure Rate: What's the blast radius if this goes wrong?
DECISION IMPACT ON DORA METRICS
Accept Redesign
โโโโโโ โโโโโโโโ
Deployment Freq ๐ข Maintained ๐ด Halted during rebuild
Lead Time ๐ก Slowly degrading ๐ข Improved after completion
Restore Time ๐ด May worsen ๐ข Should improve
Failure Rate ๐ก Known rate โ Unknown initially
๐ข = Positive ๐ก = Neutral ๐ด = Negative โ = Uncertain
๐ Risk Assessment: The Failure Mode Analysis
Before deciding, map out failure scenarios for both paths:
Accept Failure Modes:
- The workaround fails under unexpected load
- New features compound the underlying issue
- Team morale erodes from working with bad code
- You eventually redesign anywayโbut later, at higher cost
Redesign Failure Modes:
- The new design has its own unforeseen issues
- Migration introduces data corruption or downtime
- Timeline overruns kill business opportunities
- You recreate the same problems in new code
๐ก The Second-System Effect: Redesigns often try to fix everything, becoming overengineered. Beware perfectionism.
๐ช The Strangler Fig Pattern: The Third Way
Sometimes you don't have to choose! The Strangler Fig Pattern lets you redesign incrementally:
STRANGLER FIG PATTERN EVOLUTION Phase 1: Wrap Legacy Phase 2: Route New Code โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ New Facade โ โ New Facade โ โ โ โ โโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โ New Module โ โ โ โ Legacy โ โ โ โ 20% โ โ โ โ System โ โ โ โ โโโโโโโโโโโโโโ โ โ โ โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โ Legacy โ โ โ โ โ โ 80% โ โ โ โ โ โโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ Phase 3: Complete Migration โโโโโโโโโโโโโโโโโโโโ โ New System โ โ โโโโโโโโโโโโโโ โ โ โ New Module โ โ โ โ 100% โ โ โ โโโโโโโโโโโโโโ โ โ โ โ (Legacy removed)โ โโโโโโโโโโโโโโโโโโโโ
This approach:
- โ Delivers value continuously
- โ Allows learning and course correction
- โ Reduces big-bang migration risk
- โ Requires maintaining two systems temporarily
- โ Demands excellent testing and monitoring
๐ The Decision Tree
Use this flowchart to guide your decision:
REDESIGN VS ACCEPT DECISION TREE
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Is this a security or โ
โ data integrity issue? โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ โ
โโโโดโโโ โโโโดโโโ
โ YES โ โ NO โ
โโโโฌโโโ โโโโฌโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ REDESIGN NOW โ โ Can you contain the โ
โ (no question)โ โ blast radius? โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ โ
โโโโดโโโ โโโโดโโโ
โ YES โ โ NO โ
โโโโฌโโโ โโโโฌโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ Does the issue block โ โ REDESIGN NOW โ
โ strategic roadmap? โ โ (high cascade โ
โโโโโโโโโโโฌโโโโโโโโโโโโ โ risk) โ
โ โโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ โ
โโโโดโโโ โโโโดโโโ
โ YES โ โ NO โ
โโโโฌโโโ โโโโฌโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Calculate ROI: โ โ Can you live โ
โ - Rebuild cost โ โ with workarounds โ
โ - Feature delay โ โ for 6-12 months? โ
โ - Unlock value โ โโโโโโโโโโฌโโโโโโโโโโ
โโโโโโโโโโฌโโโโโโโโโโ โ
โ โโโโโโโโโโโโโดโโโโโโโโโโโโ
โผ โ โ
โโโโโโโโโโโโโ โโโโดโโโ โโโโดโโโ
โ REDESIGN โ โ YES โ โ NO โ
โ if ROI > โ โโโโฌโโโ โโโโฌโโโ
โ 3:1 โ โ โ
โโโโโโโโโโโโโ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโโโโ
โ ACCEPT โ โ REDESIGN โ
โ & add โ โ or use โ
โ monitoringโ โ Strangler โ
โโโโโโโโโโโ โ Fig patternโ
โโโโโโโโโโโโโโ
๐งช The Scientific Method for Architecture
Treat your decision as a hypothesis:
Hypothesis Template:
"If we [ACCEPT/REDESIGN] the [COMPONENT], then [PREDICTED OUTCOME] will occur within [TIMEFRAME], which we'll measure by [METRIC]."
Example (Accept):
"If we accept the current authentication service and add rate limiting, then unauthorized access attempts will decrease by 80% within 2 weeks, measured by failed login attempts in our security logs."
Example (Redesign):
"If we redesign the order processing pipeline using an event-driven architecture, then order throughput will increase 5x within 3 months, measured by orders processed per minute."
๐ก Pro Tip: Write down your hypothesis before deciding. This creates accountability and clear success criteria.
๐ญ Real-World Scenario: The Payment Gateway
Let's walk through a concrete example:
The Problem: Your payment processing system occasionally fails during peak traffic (Black Friday). Error rate spikes to 2% (normally 0.1%). Debug logs reveal the issue: a monolithic payment gateway written 5 years ago that locks a database row for 500ms per transaction.
Option 1: Accept & Mitigate
## Current code (simplified)
class PaymentGateway:
def process_payment(self, order_id, amount):
with db.transaction():
# Locks the order row
order = Order.get_and_lock(order_id)
# External API call (slow!)
result = payment_api.charge(amount)
# Still holding lock...
order.status = 'paid' if result.success else 'failed'
order.save()
return result
## Mitigation: Add async processing
class PaymentGateway:
def process_payment(self, order_id, amount):
# Immediately queue for processing
payment_queue.enqueue({
'order_id': order_id,
'amount': amount,
'timestamp': time.now()
})
return {'status': 'pending', 'order_id': order_id}
def background_processor(self):
# Process payments without holding locks
while True:
job = payment_queue.dequeue()
result = payment_api.charge(job['amount'])
# Quick update, no long lock
Order.update_status(job['order_id'], result.status)
Cost: 3 engineer-days to implement queue, test, and deploy. Risk: Eventual consistency (users see "pending" briefly). Benefit: Solves 95% of the problem, ships this week.
Option 2: Redesign with Event Sourcing
## New event-driven architecture
class PaymentService:
def initiate_payment(self, order_id, amount):
event = PaymentInitiated(
order_id=order_id,
amount=amount,
timestamp=time.now()
)
event_store.append(event)
return event.id
class PaymentProcessor:
def handle_payment_initiated(self, event):
result = payment_api.charge(event.amount)
if result.success:
event_store.append(PaymentSucceeded(
order_id=event.order_id,
transaction_id=result.transaction_id
))
else:
event_store.append(PaymentFailed(
order_id=event.order_id,
reason=result.error
))
class OrderProjection:
# Read model built from events
def get_payment_status(self, order_id):
events = event_store.get_for_order(order_id)
# Rebuild state from events
return self._calculate_status(events)
Cost: 8-12 engineer-weeks (new event store, migration, testing). Risk: Event store could have its own bugs; complex migration. Benefit: Highly scalable, full audit trail, enables future features.
The Decision:
For most companies, Option 1 (Accept) is correct because:
- โ Problem occurs 2 days per year (Black Friday, Cyber Monday)
- โ 3-day fix vs. 3-month project
- โ Lets team ship holiday features
- โ Proven queue pattern (low risk)
When Option 2 (Redesign) would be right:
- โ ๏ธ Payment failures happen daily
- โ ๏ธ Planning to 10x transaction volume
- โ ๏ธ Need detailed audit trail for compliance
- โ ๏ธ Monolith blocks other critical improvements
Examples: Real Decisions from the Field
Example 1: The Database Schema Mistake ๐๏ธ
Background: A social media startup's user table has a preferences TEXT column storing JSON. As features grew, queries became slow (no indexing on JSON fields), and data inconsistencies emerged.
Debugging Scars:
- Added validation layer to catch malformed JSON
- Cached frequent queries to reduce DB load
- Created data cleanup scripts for inconsistencies
The Crossroads:
Accept Path:
-- Add materialized columns for common queries
ALTER TABLE users
ADD COLUMN email_notifications_enabled BOOLEAN
GENERATED ALWAYS AS (
CAST(JSON_EXTRACT(preferences, '$.notifications.email') AS BOOLEAN)
) STORED;
CREATE INDEX idx_email_notif ON users(email_notifications_enabled);
-- Cost: 2 days, downtime: 1 hour, risk: low
Redesign Path:
-- New normalized schema
CREATE TABLE user_preferences (
user_id BIGINT,
preference_key VARCHAR(100),
preference_value VARCHAR(500),
PRIMARY KEY (user_id, preference_key),
INDEX (preference_key, preference_value)
);
-- Migration for 10M users
-- Cost: 3 weeks, downtime: 4 hours, risk: medium
What They Chose: Accept with materialized columns. Why?
- Growth was linear, not exponential (no urgent scaling need)
- Migration risk too high for 10M active users
- Materialized columns solved 80% of query performance issues
- Bought time to redesign properly in v2 platform
Outcome: Successfully scaled to 25M users before eventually migrating 18 months later during a planned platform upgrade.
Example 2: The Microservices Monolith ๐๏ธ
Background: An e-commerce company's checkout service became a "distributed monolith"โ20 microservices so tightly coupled that deploying one required coordinating 8 others.
Debugging Scars:
- Created deployment runbooks (45 minutes to deploy)
- Added extensive integration testing (2-hour test suite)
- Implemented feature flags to decouple releases
The Crossroads:
Accept Path:
## Improve coordination without restructuring
class DeploymentCoordinator:
def deploy_checkout_change(self, service, version):
# Automated rollout order
dependency_graph = {
'cart': ['inventory', 'pricing'],
'checkout': ['cart', 'payment', 'shipping'],
'order': ['checkout', 'notification']
}
# Deploy dependencies first
for dep in dependency_graph[service]:
self.ensure_compatible_version(dep)
# Canary deployment
self.deploy_canary(service, version, traffic_pct=5)
self.monitor_metrics(duration='10m')
if self.metrics_healthy():
self.deploy_full(service, version)
## Cost: 2 weeks for automation, ongoing: 20min deploys
Redesign Path:
## Merge related services, define clear boundaries
## BEFORE: 20 services
## cart-service, cart-validator, cart-pricer...
## AFTER: 6 bounded contexts
class CheckoutDomain:
"""Encapsulates entire checkout flow"""
# Internal modules (not separate services)
cart_module: CartManager
pricing_module: PricingEngine
inventory_module: InventoryChecker
def process_checkout(self, cart_id):
# All coordination happens in-process
# External calls only for payment, shipping
pass
## Cost: 6 months to consolidate, test, migrate traffic
What They Chose: Redesign (with Strangler Fig approach). Why?
- Deployment friction was killing velocity (1 feature = 3 weeks)
- Every bug required 5 teams to coordinate
- New engineers took 6 months to understand the architecture
- The "microservices" were actually modules forced into services
Outcome: Over 9 months, consolidated to 6 services. Deployment time dropped to 10 minutes, feature velocity increased 3x, and on-call burden decreased 60%.
Example 3: The Caching Layer That Lied ๐
Background: A news website added Redis caching to speed up article pages. Cached data occasionally became stale, showing outdated headlines. Debugging revealed race conditions in cache invalidation logic.
Debugging Scars:
- Added TTL to force refresh every 5 minutes
- Implemented cache warming on article updates
- Created monitoring for cache hit rates
The Crossroads:
Accept Path:
## Current: Cache-aside pattern with manual invalidation
class ArticleCache:
def get_article(self, article_id):
cached = redis.get(f'article:{article_id}')
if cached:
return json.loads(cached)
# Cache miss
article = db.query_article(article_id)
redis.setex(
f'article:{article_id}',
300, # 5 minute TTL
json.dumps(article)
)
return article
def update_article(self, article_id, content):
db.update_article(article_id, content)
# Manual invalidation (race condition prone!)
redis.delete(f'article:{article_id}')
## Mitigation: Add versioning
class ArticleCache:
def update_article(self, article_id, content):
version = redis.incr(f'article:{article_id}:version')
db.update_article(article_id, content, version)
redis.delete(f'article:{article_id}')
# Next read will get new version
## Cost: 3 days
Redesign Path:
## Event-driven cache invalidation
class ArticleEventStream:
def update_article(self, article_id, content):
# Write to DB
article = db.update_article(article_id, content)
# Publish event (guaranteed delivery)
event_bus.publish(ArticleUpdated(
article_id=article_id,
version=article.version,
timestamp=time.now()
))
class CacheInvalidationWorker:
def handle_article_updated(self, event):
# Invalidate all affected caches
redis.delete(f'article:{event.article_id}')
redis.delete(f'author:{event.author_id}:articles')
redis.delete(f'category:{event.category}:latest')
# Warm cache immediately
article = db.query_article(event.article_id)
redis.setex(
f'article:{event.article_id}',
3600,
json.dumps(article)
)
## Cost: 3 weeks (event bus, workers, testing)
What They Chose: Accept with versioning. Why?
- Stale cache only affected <0.1% of page views
- 5-minute TTL acceptable for news content
- Event bus infrastructure didn't exist yet
- Versioning eliminated race conditions for <5% of the cost
Outcome: Stale cache incidents dropped 95%. Team later added event bus for other features, then migrated caching to use itโbut only after proving its value elsewhere.
Example 4: The Authentication Time Bomb โฐ
Background: A SaaS platform used home-built JWT authentication with a custom encryption scheme. A security audit flagged it as vulnerable to timing attacks and lacking key rotation.
Debugging Scars:
- Added rate limiting to slow down attack attempts
- Implemented anomaly detection for suspicious auth patterns
- Created incident response runbooks
The Crossroads:
Accept Path:
// Current: Custom JWT implementation
function verifyToken(token) {
const [header, payload, signature] = token.split('.');
const expectedSig = customHMAC(header + payload, SECRET_KEY);
// Vulnerable to timing attack!
if (signature !== expectedSig) {
throw new Error('Invalid signature');
}
return JSON.parse(base64Decode(payload));
}
// Mitigation: Add constant-time comparison
function verifyToken(token) {
const [header, payload, signature] = token.split('.');
const expectedSig = customHMAC(header + payload, SECRET_KEY);
// Constant-time comparison
if (!crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(expectedSig)
)) {
throw new Error('Invalid signature');
}
return JSON.parse(base64Decode(payload));
}
// Still missing: key rotation, standard algorithms
// Cost: 1 day
Redesign Path:
// Replace with industry-standard library
const jwt = require('jsonwebtoken');
const jwksClient = require('jwks-rsa');
const client = jwksClient({
jwksUri: 'https://your-domain/.well-known/jwks.json',
cache: true,
rateLimit: true
});
function verifyToken(token) {
return new Promise((resolve, reject) => {
jwt.verify(token, getKey, {
algorithms: ['RS256'],
issuer: 'your-domain',
audience: 'your-api'
}, (err, decoded) => {
if (err) reject(err);
else resolve(decoded);
});
});
}
function getKey(header, callback) {
client.getSigningKey(header.kid, (err, key) => {
if (err) return callback(err);
callback(null, key.getPublicKey());
});
}
// Includes: key rotation, standard algorithms, tested library
// Cost: 2 weeks (migration, testing, rollout)
What They Chose: Redesign immediately. Why?
- ๐จ Security vulnerabilities are non-negotiable
- Custom crypto is almost always wrong ("don't roll your own")
- Standard libraries are battle-tested by millions
- The mitigation didn't address underlying flaws (no key rotation)
- Risk of breach far exceeded cost of 2-week project
Outcome: Completed migration in 10 days with zero downtime using a dual-verification period where both old and new tokens were accepted. Security audit passed on retest.
๐ก Key Lesson: Security and data integrity issues collapse the decision treeโredesign wins almost always.
Common Mistakes to Avoid
โ Mistake 1: The Sunk Cost Fallacy
The Error: "We've already spent 3 months on this design, we can't change now."
Why It's Wrong: Past time is gone whether you accept or redesign. Only future costs and benefits matter.
Example:
## Spent 3 months building a custom message queue
class CustomQueue:
def __init__(self):
self.storage = [] # In-memory only!
self.lock = threading.Lock()
def enqueue(self, item):
with self.lock:
self.storage.append(item)
def dequeue(self):
with self.lock:
return self.storage.pop(0) if self.storage else None
## Discovery: It's not durable (loses data on restart)
## Fixing this properly requires... rebuilding it
## WRONG THINKING: "We've invested 3 months, let's add persistence"
## RIGHT THINKING: "Should we spend 1 more month patching, or
## 2 weeks migrating to Redis/RabbitMQ?"
Better Approach: Use Redis or RabbitMQ from day 1. The "save 2 weeks" turned into "waste 3 months."
โ Mistake 2: Perfect Is the Enemy of Done
The Error: Redesigning to fix everything instead of the actual problem.
Why It's Wrong: Scope creep kills projects. The redesign takes 3x longer than estimated and ships with its own bugs.
Example:
## Original problem: Slow database queries
## Targeted fix (Accept path):
## - Add indexes: 1 day
## - Query optimization: 2 days
## - Total: 3 days, solves the problem
## Overengineered redesign:
class UltimateCachingFramework:
"""Let's solve caching forever!"""
def __init__(self):
self.l1_cache = {} # In-memory
self.l2_cache = RedisCache() # Redis
self.l3_cache = MemcachedCache() # Memcached backup
self.cdn_cache = CloudFlarePurger() # CDN layer
# Automatic cache warming
self.predictor = MLCachePredictor() # ML to predict access!
# Distributed invalidation
self.event_bus = KafkaEventBus()
# Monitoring
self.metrics = PrometheusExporter()
# 47 more methods...
## Result: 4 months, still not done, original problem still exists
๐ก Fix: Solve the problem you have, not the problem you imagine. Add complexity only when needed.
โ Mistake 3: Ignoring Team Capacity
The Error: Choosing redesign without considering your team's actual bandwidth.
Why It's Wrong: Junior engineers maintaining critical systems during a redesign leads to outages. Senior engineers context-switching between new and old systems deliver neither well.
Reality Check:
TEAM CAPACITY CALCULATION
5 engineers ร 40 hours/week = 200 hours
Minus:
- On-call rotation: -20 hours/week
- Meetings/planning: -30 hours/week
- Bug fixes/support: -25 hours/week
- Code review: -15 hours/week
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Available for new work: 110 hours/week
Redesign estimate: 400 hours
Realistic timeline: 4 weeks (not 2!)
During those 4 weeks:
- Feature work stops
- Technical debt accumulates elsewhere
- Opportunity cost: What could you ship instead?
๐ก Fix: Be brutally honest about capacity. Consider hiring contractors for the redesign or postponing other work.
โ Mistake 4: No Rollback Plan
The Error: Committing to a redesign without a way to revert if it fails.
Why It's Wrong: Murphy's Law applies. Production issues emerge that testing missed.
Brittle Approach:
## Big-bang migration
def migrate_to_new_system():
print("Shutting down old system...")
old_system.stop()
print("Starting migration...")
migrate_all_data() # 4 hours, no going back
print("Starting new system...")
new_system.start()
# If new_system fails, you're offline for hours
Robust Approach:
## Parallel run with feature flag
def process_request(request):
if feature_flags.new_system_enabled():
# Primary: new system
result_new = new_system.process(request)
# Shadow: old system (compare results)
result_old = old_system.process(request)
metrics.compare(result_new, result_old)
return result_new
else:
# Rollback: instant, just flip flag
return old_system.process(request)
## Gradual rollout:
## Week 1: 5% traffic to new system
## Week 2: 25% if metrics look good
## Week 3: 50%
## Week 4: 100% or rollback if issues found
๐ก Fix: Always have a rollback plan that takes <5 minutes to execute.
โ Mistake 5: Redesigning Without Understanding Why It Failed
The Error: Rebuilding the same thing in a different language/framework without fixing the root cause.
Why It's Wrong: You'll recreate the same problems.
Example:
## Original Ruby service (slow):
class OrderProcessor
def process(order)
# Synchronous API calls
inventory.reserve(order.items)
payment.charge(order.total)
shipping.create_label(order)
order.complete!
end
end
## "Redesigned" in Go (still slow!):
func processOrder(order Order) error {
// Still synchronous!
if err := inventory.Reserve(order.Items); err != nil {
return err
}
if err := payment.Charge(order.Total); err != nil {
return err
}
if err := shipping.CreateLabel(order); err != nil {
return err
}
return order.Complete()
}
// Problem wasn't Rubyโit was synchronous design!
// Should have been async regardless of language
๐ก Fix: Conduct a blameless postmortem to understand why the current design struggles before redesigning.
Key Takeaways
๐ Quick Reference: Redesign vs Accept Decision Framework
| Factor | Accept | Redesign |
|---|---|---|
| Security/Data Integrity | โ Never for critical issues | โ Always for vulnerabilities |
| Frequency | โ Rare problems (< monthly) | โ Daily pain points |
| Blast Radius | โ Isolated, containable | โ Cascading failures |
| Timeline | โ Days to weeks | โ ๏ธ Months (use Strangler Fig) |
| Opportunity Cost | โ Low (doesn't block features) | โ ๏ธ High (delays roadmap) |
| Team Capacity | โ Can be done alongside other work | โ Requires dedicated focus |
| Risk | โ Known, bounded | โ ๏ธ Unknown during transition |
๐ฏ Golden Rules
- Security first: Always redesign for security/data integrity issues
- Measure twice, cut once: Write down your hypothesis and success metrics
- Consider the third way: Strangler Fig lets you redesign incrementally
- ROI threshold: Redesign should provide 3:1 value over cost minimum
- Rollback always: Never commit to a one-way door without escape plan
- Sunk costs don't matter: Only future value counts
- Solve the problem you have: Not the problem you imagine
- Team capacity is real: Account for it honestly
๐ง Decision Shortcuts
Redesign immediately if:
- ๐จ Security vulnerability
- ๐จ Data corruption risk
- ๐จ Cascading failures across systems
- ๐จ Regulatory compliance requirement
Accept if:
- โ Problem occurs < once per month
- โ Workaround takes < 1 week to implement
- โ No blocking effect on roadmap
- โ Team bandwidth is constrained
- โ Redesign cost > 3x the mitigation cost
Use Strangler Fig if:
- ๐ Need continuous delivery during transition
- ๐ System is too large for big-bang migration
- ๐ Want to validate new design incrementally
- ๐ Can't afford extended feature freeze
๐ Further Study
Martin Fowler's Refactoring Catalog - https://refactoring.com/catalog/ - Comprehensive guide to code-level redesign decisions with before/after examples
The Strangler Fig Application Pattern - https://martinfowler.com/bliki/StranglerFigApplication.html - Detailed explanation of incremental migration strategies from the pattern's creator
Technology Radar: Adopt, Trial, Assess, Hold - https://www.thoughtworks.com/radar - Framework for evaluating when to adopt new technologies vs. accepting current tools, updated quarterly by ThoughtWorks