In observability design, {{1}} events with {{2}} dimensions enable arbitrary investigation of system behavior.

["structured","high-cardinality"]

A distributed trace with trace_id 'abc-123' shows a payment-service span lasting 10,400ms while other spans complete in under 100ms. The {{1}} span reveals the {{2}} in the system.

["payment-service","bottleneck"]

Observability as Design Property

Build systems that are inherently observable rather than adding observability as an afterthought

Observability as Design Property

Master observability as a fundamental design property with free flashcards and spaced repetition practice. This lesson covers the shift from monitoring to observability, designing systems for debuggability, and embedding telemetry at the architecture level—essential concepts for building production systems that reveal their internal states.

Welcome 💻

Traditional monitoring focuses on known failures—dashboards showing CPU, memory, and request rates. But what happens when your system fails in ways you never anticipated? When customer complaints describe issues you can't reproduce in your metrics? This is where the observability mindset transforms how we build software.

Observability isn't a tool you bolt on after deployment—it's a design property you architect into systems from day one. Like security or performance, observability must be embedded in your design decisions, not sprinkled on as an afterthought. This fundamental shift changes how you think about data structures, API contracts, error handling, and even team organization.

🔍 The Core Insight: You can't observe what you didn't design to be observable. Every architectural decision either reveals or conceals system behavior.

The Paradigm Shift: From Monitoring to Observability 🔄

Let's clarify what makes observability fundamentally different from traditional monitoring:

Aspect	Traditional Monitoring 📊	Observability 🔬
Question Answering	Known questions only "Is CPU above 80%?"	Unknown questions "Why are checkout failures spiking for Android users in Germany?"
Failure Mode	Predefined alerts Threshold-based	Arbitrary investigation Pattern exploration
Data Structure	Aggregated metrics Pre-computed statistics	High-cardinality events Raw structured data
Time to Insight	Hours to days Requires code deployment	Seconds to minutes Query production directly
Cost Model	Cost of collection	Cost of retention + query

Why This Matters in Production

Imagine your e-commerce platform experiences a 15% drop in conversions. Traditional monitoring shows:

✅ All services healthy (200 OK responses)
✅ Error rates normal (<0.1%)
✅ Latency within SLA (p95 < 500ms)

Yet revenue is bleeding. With observability as a design property, your system can answer:

"Show me all checkout attempts where payment_provider=stripe AND user_country=DE AND client_version=3.2.1"
"What percentage of these had validation_errors in the last hour?"
"How does cart value distribution differ between failing and succeeding requests?"

You discover a recently deployed client version has a locale bug affecting German users—a failure mode you never anticipated, thus never monitored.

💡 Key Insight: Monitoring asks "Is the system broken?" Observability asks "Why is the system behaving this way?"

Designing for Observability: Core Principles 🏗️

Observability as a design property means making explicit architectural decisions that enable runtime investigation. Here are the foundational principles:

1. Structured Events Over Metrics 📦

Traditional approach (limited observability):

## Increment a counter
checkout_counter.inc()
error_counter.inc() if failed

Observability-first approach:

## Emit rich structured event
log_structured_event({
    "event_type": "checkout_attempted",
    "user_id": user.id,
    "cart_value": cart.total,
    "payment_method": payment.method,
    "user_country": user.country,
    "client_version": request.headers["X-App-Version"],
    "items_count": len(cart.items),
    "session_duration_sec": session.duration,
    "outcome": "success" or "failure",
    "error_code": error.code if failed,
    "duration_ms": elapsed_time,
    "trace_id": context.trace_id
})

Why this matters: The second approach lets you slice data arbitrarily after the fact. You can discover that "iOS users with carts over $500 see 3x higher failure rates" without having created a metric for that specific combination.

⚠️ Design consideration: Each event attribute increases cardinality. Design your event schema with dimensions that matter for debugging—user segments, feature flags, deployment versions, geographic regions.

2. Context Propagation is Architecture 🔗

Every request spawns work across multiple services. Without context propagation, you have disconnected logs—making root cause analysis nearly impossible.

┌─────────────────────────────────────────────────────┐
│          REQUEST CONTEXT PROPAGATION                │
└─────────────────────────────────────────────────────┘

   📱 Client Request
        │
        │ trace_id: abc123
        │ user_id: user_456
        │ session_id: sess_789
        ↓
   🌐 API Gateway
        │ ← Extracts context from headers
        │ ← Adds: service=gateway, node=gw-3
        ↓
   ⚙️ Order Service
        │ ← Inherits trace_id
        │ ← Adds: service=orders, order_id=ord_321
        ├──→ 💳 Payment Service
        │        ├ Inherits full context
        │        └ Adds: transaction_id=txn_654
        │
        └──→ 📦 Inventory Service
                 ├ Inherits full context
                 └ Adds: warehouse=warehouse_west

All logs/spans share trace_id → reconstruct entire flow

Design decision: Your framework/library must propagate context automatically. Manually passing trace IDs through every function call doesn't scale.

## Bad: Manual context threading (fragile)
def process_order(order_id, trace_id, user_id, session_id):
    result = payment_service.charge(
        order_id, trace_id, user_id, session_id
    )
    # Easy to forget propagating context!

## Good: Context as thread-local or async context
with observation_context(trace_id=trace_id, user_id=user_id):
    # All downstream calls inherit context automatically
    result = payment_service.charge(order_id)
    inventory_service.reserve(items)
    # Context flows implicitly through execution

3. High-Cardinality Dimensions 🎯

Traditional metrics aggregate away the details that matter. Observability preserves high-cardinality dimensions—attributes with many unique values:

Low-cardinality (traditional metrics can handle):

service_name (10-100 unique values)
endpoint (100-1000 unique values)
status_code (10-50 unique values)

High-cardinality (observability systems required):

user_id (millions of unique values)
trace_id (every request unique)
session_id (millions active)
feature_flag_combination (2^n possibilities)
cart_composition (infinite combinations)

Real-world example: A social media platform needs to debug why video uploads fail for specific users. High-cardinality queries enable:

-- Find common attributes among failing uploads
SELECT 
  device_model,
  os_version,
  app_version,
  COUNT(*) as failure_count
FROM upload_events
WHERE outcome = 'failure'
  AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY device_model, os_version, app_version
ORDER BY failure_count DESC
LIMIT 10

This reveals: "Samsung Galaxy S21 on Android 13 with app version 4.5.2 accounts for 87% of failures"—a pattern invisible to traditional metrics.

💡 Design tip: Add dimensions that segment your user base and system topology—every attribute that might explain "why this subset behaves differently."

4. Design for Debuggability 🐛

Every architectural choice either helps or hinders future debugging. Consider these design decisions:

Data structures that expose state:

## Opaque (hard to observe)
class OrderProcessor:
    def __init__(self):
        self._state = {}  # Hidden internal state
    
    def process(self, order):
        # State changes invisible
        self._state["stage"] = "validating"
        # ...

## Observable (designed for debugging)
class OrderProcessor:
    def __init__(self, observability_context):
        self.state = ObservableState(observability_context)
    
    def process(self, order):
        # Every state transition logged with context
        with self.state.transition("validating"):
            self._validate(order)
            # Automatically logs:
            # - timestamp
            # - previous state
            # - new state
            # - duration
            # - outcome

Error handling with context:

## Low observability
try:
    result = external_api.call()
except Exception as e:
    log.error(f"API failed: {e}")
    raise

## High observability
try:
    result = external_api.call()
except Exception as e:
    # Capture rich context
    log_exception(e, {
        "operation": "external_api_call",
        "endpoint": external_api.endpoint,
        "retry_count": current_retry,
        "timeout_ms": timeout_config,
        "request_payload_size": len(payload),
        "user_tier": user.subscription_tier,
        "feature_flags": active_flags,
        "upstream_trace_id": response_headers.get("trace-id")
    })
    raise

⚠️ Common mistake: Logging the exception message alone. Exceptions need context from the surrounding system state to be debuggable.

Real-World Examples 🌍

Example 1: E-Commerce Checkout Flow

Scenario: Your checkout service handles 10,000 transactions/hour. You need to understand failure patterns without degrading performance.

Observable design:

class CheckoutService:
    def __init__(self, events_emitter):
        self.events = events_emitter
    
    def process_checkout(self, cart, user, payment_method):
        # Start with rich context
        checkout_context = {
            "checkout_id": generate_id(),
            "user_id": user.id,
            "user_segment": user.segment,  # "premium", "regular", "new"
            "cart_value_usd": cart.total,
            "items_count": len(cart.items),
            "payment_method": payment_method.type,
            "user_country": user.country,
            "client_version": request.client_version,
            "feature_flags": self.active_flags(user),
            "session_age_minutes": (now() - user.session_start).minutes
        }
        
        start = time.now()
        
        try:
            # Each step adds context
            self._validate_cart(cart, checkout_context)
            self._apply_promotions(cart, checkout_context)
            result = self._charge_payment(payment_method, checkout_context)
            self._fulfill_order(result, checkout_context)
            
            # Emit success event
            self.events.emit({
                **checkout_context,
                "outcome": "success",
                "duration_ms": (time.now() - start).milliseconds,
                "promotion_discount_usd": cart.discount_applied
            })
            
            return result
            
        except PaymentDeclinedError as e:
            self.events.emit({
                **checkout_context,
                "outcome": "payment_declined",
                "decline_reason": e.reason,
                "decline_code": e.code,
                "duration_ms": (time.now() - start).milliseconds
            })
            raise
            
        except ValidationError as e:
            self.events.emit({
                **checkout_context,
                "outcome": "validation_failed",
                "validation_stage": e.stage,
                "validation_rule": e.rule_id,
                "duration_ms": (time.now() - start).milliseconds
            })
            raise

Query power this enables:

-- "Why are premium users in France seeing higher decline rates?"
SELECT 
  payment_method,
  decline_reason,
  COUNT(*) as occurrences,
  AVG(cart_value_usd) as avg_cart_value
FROM checkout_events
WHERE user_segment = 'premium'
  AND user_country = 'FR'
  AND outcome = 'payment_declined'
  AND timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY payment_method, decline_reason
ORDER BY occurrences DESC

Discovery: Premium French users with payment_method=SEPA and cart_value > 1000 EUR hit a fraud detection rule that's misconfigured—a failure mode you never anticipated monitoring.

Example 2: Microservices with Distributed Tracing

Scenario: A user reports "payment page hangs for 10 seconds then shows error." Your system has 12 microservices involved in payment processing.

Observable architecture:

┌──────────────────────────────────────────────────┐
│      DISTRIBUTED TRACE STRUCTURE                 │
└──────────────────────────────────────────────────┘

Trace ID: abc-123-def
│
├─ Span: api-gateway [120ms]
│  ├─ Span: auth-service [15ms]
│  │  └─ Span: redis-get [2ms]
│  │
│  ├─ Span: checkout-service [105ms]
│  │  ├─ Span: inventory-check [8ms]
│  │  │  └─ Span: postgres-query [6ms]
│  │  │
│  │  ├─ Span: pricing-service [12ms]
│  │  │  └─ Span: cache-lookup [1ms]
│  │  │
│  │  └─ Span: payment-service [10,500ms] ⚠️
│  │     ├─ Span: fraud-check [50ms]
│  │     ├─ Span: stripe-api [10,400ms] ⚠️
│  │     │  └─ (external, timeout occurred)
│  │     └─ Span: fallback-handling [40ms]
│  │
│  └─ Span: response-formatting [5ms]

🔍 Root cause visible: stripe-api span shows 10.4s
   - Stripe's webhook endpoint was down
   - Our timeout was 15s (too long!)
   - No circuit breaker implemented

Design elements enabling this:

Automatic instrumentation: Frameworks inject tracing context

Span attributes carry debugging info:

span.set_attributes({
    "payment.amount": amount,
    "payment.currency": "USD",
    "payment.provider": "stripe",
    "user.tier": user.tier,
    "retry.attempt": retry_count,
    "timeout.configured_ms": timeout_config
})

Error recording with context:

except StripeTimeout as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR))
    span.set_attribute("error.type", "timeout")
    span.set_attribute("timeout.duration_ms", e.duration)

Example 3: Feature Flag Impact Analysis

Scenario: You're rolling out a new recommendation algorithm to 10% of users. You need to measure impact on engagement without biasing the experiment.

Observable design:

class RecommendationService:
    def get_recommendations(self, user):
        # Feature flag decision becomes observable dimension
        algorithm_variant = self.feature_flags.evaluate(
            flag="new_recommendation_algo",
            user=user,
            default="control"
        )
        
        start = time.now()
        
        if algorithm_variant == "treatment":
            recommendations = self._ml_based_recommendations(user)
        else:
            recommendations = self._rule_based_recommendations(user)
        
        # Emit event with variant as dimension
        self.events.emit({
            "event_type": "recommendations_generated",
            "user_id": user.id,
            "algorithm_variant": algorithm_variant,  # Key dimension!
            "recommendations_count": len(recommendations),
            "computation_time_ms": (time.now() - start).milliseconds,
            "user_history_size": len(user.view_history),
            "recommendations": [r.id for r in recommendations]
        })
        
        return recommendations
    
    def record_interaction(self, user, recommendation_id, action):
        # Connect interaction back to variant
        self.events.emit({
            "event_type": "recommendation_interaction",
            "user_id": user.id,
            "recommendation_id": recommendation_id,
            "action": action,  # "clicked", "dismissed", "purchased"
            "algorithm_variant": self._get_user_variant(user)
        })

Analysis query:

-- Compare click-through rates by variant
WITH impressions AS (
  SELECT 
    algorithm_variant,
    COUNT(*) as shown_count
  FROM recommendations_generated
  WHERE timestamp > experiment_start_time
),
clicks AS (
  SELECT 
    algorithm_variant,
    COUNT(*) as click_count
  FROM recommendation_interaction
  WHERE action = 'clicked'
    AND timestamp > experiment_start_time
)
SELECT 
  i.algorithm_variant,
  i.shown_count,
  c.click_count,
  (c.click_count * 100.0 / i.shown_count) as ctr_percent
FROM impressions i
JOIN clicks c ON i.algorithm_variant = c.algorithm_variant

Result: Treatment group shows 23% higher CTR, but also 15% higher computation time. Trade-off is now visible and quantifiable.

Example 4: Database Query Performance Patterns

Scenario: Your database is experiencing intermittent slowdowns. Traditional metrics show average query time is acceptable, but users report occasional hangs.

Observable design:

class DatabaseRepository:
    def __init__(self, db_connection, observability):
        self.db = db_connection
        self.obs = observability
    
    def find_user_orders(self, user_id, filters):
        # Capture query context
        query_context = {
            "query_type": "find_user_orders",
            "user_id": user_id,
            "filter_count": len(filters),
            "has_date_range": "date_range" in filters,
            "has_status_filter": "status" in filters,
            "expected_rows": self._estimate_rows(user_id, filters)
        }
        
        with self.obs.span("database.query", query_context) as span:
            query = self._build_query(user_id, filters)
            span.set_attribute("query.sql", query.to_sql())
            
            result = self.db.execute(query)
            
            # Capture result characteristics
            span.set_attribute("result.row_count", len(result))
            span.set_attribute("query.execution_plan", result.explain())
            
        return result

Discovery through query:

-- Find queries with high variance (p99 >> p50)
SELECT 
  query_type,
  has_date_range,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99,
  (p99 / p50) as variance_ratio
FROM database_query_spans
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY query_type, has_date_range
HAVING variance_ratio > 10
ORDER BY variance_ratio DESC

Finding: Queries with has_date_range=true and expected_rows > 10000 have p99 latency 50x higher than p50. Missing index on created_at column for large result sets.

Common Mistakes to Avoid ⚠️

1. Logging Instead of Structured Events

❌ Wrong:

logger.info(f"User {user_id} checked out with cart value ${cart.total}")

✅ Right:

events.emit({
    "event_type": "checkout_completed",
    "user_id": user_id,
    "cart_value_usd": cart.total,
    "timestamp": now()
})

Why: String logs require regex parsing. Structured events enable direct querying by any dimension.

2. Adding Observability After Design

❌ Wrong: "We'll add logging once the feature works."

✅ Right: "What questions will we need to answer when this fails? Let's emit events that enable those questions."

Impact: Retrofitting observability requires redesigning data flows, which is expensive and often incomplete.

3. Over-Aggregating Data

❌ Wrong: Computing averages/percentiles in the application and emitting only those.

✅ Right: Emit raw event data; aggregate during query time.

Reason: Pre-aggregation destroys the ability to slice data by unforeseen dimensions. You lose the "unknown questions" superpower.

4. Ignoring Cardinality Costs

❌ Wrong: Adding full_user_agent_string (millions of unique values) to every event without considering storage costs.

✅ Right: Include high-cardinality dimensions strategically. Parse user agent to extract browser, os, device_type (lower cardinality).

Balance: High cardinality enables deep investigation but increases storage costs. Design your schema with cost awareness.

5. Missing Context Propagation

❌ Wrong: Each service logs independently without shared trace ID.

✅ Right: Context (trace ID, user ID, session ID) flows automatically through all services.

Result: Without context propagation, you have disconnected log fragments. With it, you reconstruct entire request flows.

6. Sampling Too Aggressively

❌ Wrong: "We'll sample 1% of traces to save costs."

✅ Right: Sample based on trace characteristics—keep all errors, sample successful requests, always keep traces for specific user segments.

Risk: Aggressive uniform sampling means rare bugs disappear from your data. Use intelligent sampling:

Keep 100% of errors
Keep 100% of slow requests (p95+)
Keep 100% of specific user tiers
Sample 10% of fast successful requests

7. Forgetting the "Why" Context

❌ Wrong: Logging what happened without capturing why decisions were made.

log.info("Using cache for user data")

✅ Right: Capture the decision context.

events.emit({
    "decision": "cache_used",
    "reason": "user_tier_premium",
    "cache_hit": True,
    "cache_age_seconds": 45,
    "fallback_available": True
})

Value: Understanding why the system behaved a certain way is often more valuable than knowing what it did.

Key Takeaways 🎯

📋 Observability Design Principles

Core Philosophy:

Observability is a design property, not a tool
Design systems to answer unknown questions, not just known failures
Every architectural choice either reveals or conceals behavior

Implementation Checklist:

✅ Emit structured events with high-cardinality dimensions
✅ Propagate context (trace ID, user ID) automatically through all services
✅ Preserve raw data—aggregate at query time, not collection time
✅ Instrument decisions, not just outcomes (capture "why")
✅ Design error handling to include rich context
✅ Make state changes observable through explicit events
✅ Use intelligent sampling—keep errors, slow requests, specific user segments
✅ Balance cardinality against storage costs

Mental Model:

Traditional: "Did something break?"
Observability: "Why did the system behave this way?"

Success Metric:
Can you answer: "Show me all requests where [any combination of dimensions] exhibited [any behavior]" without deploying new code?

🧠 Memory Aid: The DECADE Framework

Remember observability design principles with DECADE:

Dimensions: High-cardinality attributes that segment behavior
Events: Structured data over plain text logs
Context: Propagate trace ID, user ID, session ID everywhere
Arbitrary: Enable unknown questions, not just predefined dashboards
Decisions: Log why decisions were made, not just outcomes
Errors: Rich context in error handling (who, what, when, why)

🔧 Try This: Audit Your Current System

Pick one critical user flow in your system (checkout, signup, search). Ask:

Can you answer: "Show me all failures where the user was on mobile, had items from category X, and used payment method Y"?
Can you reconstruct: The entire flow across all services for a specific request ID?
Can you discover: What common attributes exist among failing requests that don't exist in successful ones?

If any answer is "no," you have an observability design gap.

💡 Did You Know?

Google's Dapper (2010) pioneered distributed tracing at scale, processing 2 billion traces per second with less than 0.01% performance overhead. The key insight: observability must be built into infrastructure from day one, not bolted on afterward. Modern systems like OpenTelemetry inherit this philosophy.

📚 Further Study

OpenTelemetry Documentation - https://opentelemetry.io/docs/concepts/observability-primer/ - Industry-standard observability framework and conceptual foundations
Charity Majors on Observability - https://www.honeycomb.io/blog/observability-a-manifesto - Seminal writing on observability as practice from a pioneer in the field
Google SRE Book: Monitoring Distributed Systems - https://sre.google/sre-book/monitoring-distributed-systems/ - Production observability patterns at scale from Google's SRE practices

Next Steps: Now that you understand observability as a design property, the next lesson explores The Three Pillars: Logs, Metrics, and Traces—how these signal types work together to create comprehensive system understanding.

📝

Ready to practice?

This lesson has 15 questions to help you learn