Instrumentation Strategy

Learn what to instrument, how to name spans, and design decisions that scale with system evolution

Instrumentation Strategy

Master production observability with free flashcards and spaced repetition practice. This lesson covers instrumentation patterns, signal collection strategies, and context propagation techniques—essential concepts for building maintainable distributed systems with effective root cause analysis capabilities.

Welcome to Instrumentation Strategy

When production systems fail at 3 AM, the difference between a 5-minute fix and a 5-hour investigation often comes down to one thing: instrumentation strategy. 💻 Think of instrumentation as your system's nervous system—it needs to be comprehensive enough to feel everything important, but efficient enough not to overwhelm the organism.

In this lesson, you'll learn how to design instrumentation that captures meaningful signals without drowning your team in noise or tanking your application's performance. We'll explore the three pillars of observability (metrics, logs, and traces), strategic placement of instrumentation points, and how to propagate context across distributed boundaries.

Core Concepts

The Three Pillars of Observability

Modern observability rests on three complementary signal types, each serving distinct purposes:

Signal Type	Purpose	Example	Best For
📊 Metrics	Aggregated numerical data over time	Request rate, error percentage, latency percentiles	Alerting, dashboards, trend analysis
📝 Logs	Discrete event records with context	"User 12345 failed authentication from IP 10.2.3.4"	Debugging specific issues, audit trails
🔗 Traces	Request flow across service boundaries	API call → Database query → Cache lookup sequence	Understanding distributed system behavior

The key insight: These signals work together, not in isolation. A metric spike alerts you there's a problem, logs help you understand what happened to specific requests, and traces show you where in your distributed system the failure occurred.

💡 Pro Tip: Start with metrics for detection, use traces for localization, and leverage logs for root cause confirmation. This "funnel approach" prevents analysis paralysis.

Strategic Instrumentation Placement

Not all code paths deserve equal instrumentation attention. Strategic placement follows the 80/20 rule: instrument the 20% of code paths that handle 80% of your business value and failure scenarios.

┌─────────────────────────────────────────────────┐
│        INSTRUMENTATION PRIORITY MATRIX          │
└─────────────────────────────────────────────────┘

   High Business Impact
         ↑
         │   🔴 CRITICAL          🟡 IMPORTANT
         │   - Auth flows         - Background jobs
         │   - Payment paths      - Report generation
         │   - Data mutations     - Cache operations
         │
         │   🟢 MONITOR           ⚪ MINIMAL
         │   - Read-only APIs     - Health checks
         │   - Static content     - Internal tooling
         │
         └──────────────────────────────→
                High Failure Risk

Critical instrumentation points you should never skip:

Service boundaries (HTTP handlers, message consumers, RPC endpoints)
External dependencies (database calls, API clients, queue operations)
Authentication/authorization checkpoints
State transitions (order placement, payment processing, user registration)
Resource exhaustion points (connection pools, rate limiters, circuit breakers)

⚠️ Common Mistake: Over-instrumenting low-value code paths (like health checks) while under-instrumenting critical business logic. This creates signal noise that obscures real problems.

Context Propagation Fundamentals

Context propagation is the mechanism that links related signals across service boundaries. Without it, distributed tracing becomes impossible and you're left piecing together unrelated log entries like a detective with no forensic tools.

┌──────────────────────────────────────────────────┐
│     CONTEXT PROPAGATION FLOW                     │
└──────────────────────────────────────────────────┘

  Client Request
       │
       ↓
  ┌─────────────────┐
  │  API Gateway    │ ← Generate trace_id + span_id
  │  trace: abc123  │
  └────────┬────────┘
           │ Inject headers: X-Trace-Id: abc123
           ↓
  ┌─────────────────┐
  │  Auth Service   │ ← Extract + create child span
  │  trace: abc123  │
  │  span: xyz789   │
  └────────┬────────┘
           │ Propagate trace_id
           ↓
  ┌─────────────────┐
  │  Database       │ ← Extract + tag queries
  │  trace: abc123  │
  │  span: def456   │
  └─────────────────┘

  All signals share trace_id = abc123
  Query: "Show me all logs/spans for abc123"

Key context elements to propagate:

Trace ID: Unique identifier for the entire request flow
Span ID: Identifier for this specific operation
Parent Span ID: Links child operations to their caller
Baggage: Key-value pairs (user_id, tenant_id, feature_flags)
Sampling decision: Whether to collect detailed telemetry

💡 Memory Device: Think of context as a baton in a relay race. Each service receives it, adds its contribution (timing, errors, metadata), and passes it to the next runner. Drop the baton, and you lose continuity.

Instrumentation Patterns

Successful instrumentation follows proven patterns. Here are the most impactful:

1. Semantic Conventions

Use standardized naming and attributes across your organization:

Convention	❌ Bad	✅ Good
HTTP attributes	response_code, url, verb	http.status_code, http.url, http.method
Database attributes	db, query_time, sql	db.system, db.operation, db.statement
Error attributes	error, err_msg, failed	error.type, error.message, error.stack

Why this matters: Standardized attributes enable cross-service queries like "show me all database errors with http.status_code=500 across all services."

2. Structured Logging

Log events should be machine-parseable with consistent structure:

// ❌ Unstructured: Hard to query
"User login failed for john@example.com from 10.2.3.4"

// ✅ Structured: Queryable fields
{
  "timestamp": "2026-01-15T10:23:45Z",
  "level": "error",
  "message": "authentication_failed",
  "user_email": "john@example.com",
  "source_ip": "10.2.3.4",
  "trace_id": "abc123",
  "error_code": "invalid_credentials"
}

3. Cardinality Management

Cardinality = number of unique values for a dimension. High cardinality metrics (like user_id tags) can explode storage costs and query performance.

Cardinality Level	Examples	Safe For
🟢 Low (< 100)	environment, region, service_name	Metrics, aggregations
🟡 Medium (100-10K)	endpoint, error_type, customer_tier	Metrics with care
🔴 High (> 10K)	user_id, session_id, request_id	Logs/traces only, never metrics

⚠️ Critical Rule: Never use unbounded identifiers (user IDs, request IDs) as metric dimensions. Use them in logs and traces where they belong.

4. Sampling Strategies

Collecting 100% of traces is expensive and often unnecessary. Smart sampling reduces costs while maintaining visibility:

┌─────────────────────────────────────────────┐
│        SAMPLING DECISION TREE               │
└─────────────────────────────────────────────┘

              Request arrives
                    │
        ┌───────────┴───────────┐
        │                       │
    Has error?              Slow?
    (status ≥ 400)         (> P95)
        │                       │
       YES                     YES
        │                       │
        └───────────┬───────────┘
                    ↓
              ✅ Sample 100%
                    │
              ─────────────
                    │
                   NO
                    ↓
            Random sample?
            (1% probability)
                    │
        ┌───────────┴───────────┐
       YES                     NO
        │                       │
   ✅ Sample                ❌ Drop

💡 Tail-based sampling: Make sampling decisions AFTER seeing the full trace. Keep all traces with errors or high latency, sample normal traffic at 1%.

Instrumentation Anti-Patterns

Avoid these common mistakes that create technical debt:

❌ Anti-Pattern 1: Logging in Loops

## ❌ BAD: Creates log explosion
for item in items:  # 10,000 items
    logger.info(f"Processing item {item.id}")
    process(item)

## ✅ GOOD: Log summary metrics
logger.info(f"Starting batch processing", {"item_count": len(items)})
for item in items:
    process(item)
logger.info(f"Completed batch processing", {
    "processed": success_count,
    "failed": error_count,
    "duration_ms": elapsed
})

❌ Anti-Pattern 2: Sensitive Data in Logs

## ❌ BAD: Leaks PII and credentials
logger.info(f"User {user.email} authenticated with password {password}")

## ✅ GOOD: Redact sensitive fields
logger.info(f"User authenticated", {
    "user_id": user.id,  # Use ID, not email
    "auth_method": "password",  # No actual password
    "source_ip": request.ip
})

❌ Anti-Pattern 3: Synchronous Instrumentation

## ❌ BAD: Blocks request path
def handle_request(request):
    result = process(request)
    send_metric_to_collector(result)  # Network I/O blocks response
    return result

## ✅ GOOD: Async or buffered
def handle_request(request):
    result = process(request)
    metric_buffer.record(result)  # In-memory, flushed async
    return result

Examples

Example 1: Instrumenting an HTTP Service

Let's instrument a payment processing endpoint with all three signal types:

import time
from opentelemetry import trace, metrics
from opentelemetry.trace import Status, StatusCode
import structlog

logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

## Define metrics
payment_counter = meter.create_counter(
    "payments.processed",
    description="Number of payment attempts"
)
payment_duration = meter.create_histogram(
    "payments.duration",
    description="Payment processing time in ms",
    unit="ms"
)

def process_payment(request):
    # Start trace span
    with tracer.start_as_current_span(
        "process_payment",
        attributes={
            "payment.amount": request.amount,
            "payment.currency": request.currency,
            "payment.method": request.method
        }
    ) as span:
        start_time = time.time()
        
        try:
            # Add context to logs
            log = logger.bind(
                trace_id=span.get_span_context().trace_id,
                user_id=request.user_id,
                amount=request.amount
            )
            
            log.info("payment_started")
            
            # Business logic
            result = charge_payment_provider(request)
            
            # Record success metrics
            duration = (time.time() - start_time) * 1000
            payment_counter.add(1, {
                "status": "success",
                "method": request.method,
                "currency": request.currency
            })
            payment_duration.record(duration, {
                "status": "success"
            })
            
            # Add trace attributes
            span.set_attribute("payment.provider_id", result.provider_id)
            span.set_status(Status(StatusCode.OK))
            
            log.info("payment_completed", 
                    provider_id=result.provider_id,
                    duration_ms=duration)
            
            return result
            
        except PaymentDeclinedError as e:
            # Record business error (not system failure)
            duration = (time.time() - start_time) * 1000
            payment_counter.add(1, {
                "status": "declined",
                "method": request.method,
                "reason": e.decline_reason
            })
            
            span.set_attribute("payment.decline_reason", e.decline_reason)
            span.set_status(Status(StatusCode.OK))  # Not a trace error
            
            log.warning("payment_declined",
                       decline_reason=e.decline_reason,
                       duration_ms=duration)
            raise
            
        except Exception as e:
            # Record system failure
            duration = (time.time() - start_time) * 1000
            payment_counter.add(1, {
                "status": "error",
                "method": request.method,
                "error_type": type(e).__name__
            })
            
            # Mark trace as error
            span.set_status(Status(StatusCode.ERROR))
            span.record_exception(e)
            
            log.error("payment_failed",
                     error_type=type(e).__name__,
                     error_message=str(e),
                     duration_ms=duration)
            raise

What makes this instrumentation effective:

✅ Unified context: trace_id flows from span to logs
✅ Semantic attributes: Uses standard naming (payment.*, error.type)
✅ Appropriate signals: Metrics for aggregation, traces for flow, logs for details
✅ Business vs system errors: Declined payments ≠ system failures
✅ Low cardinality: Metric dimensions are bounded (method, currency, status)

Example 2: Context Propagation in Microservices

How context flows through a distributed system:

## Service A: API Gateway
import requests
from opentelemetry import trace
from opentelemetry.propagate import inject

tracer = trace.get_tracer(__name__)

def handle_checkout(request):
    with tracer.start_as_current_span("checkout") as span:
        # Add business context
        span.set_attribute("user_id", request.user_id)
        span.set_attribute("cart_total", request.total)
        
        # Create headers with trace context
        headers = {}
        inject(headers)  # Injects traceparent, tracestate
        
        # Call downstream service
        response = requests.post(
            "http://payment-service/process",
            json=request.payment_data,
            headers=headers  # Propagates context!
        )
        
        return response

## Service B: Payment Service
from opentelemetry.propagate import extract

def process_payment_handler(flask_request):
    # Extract context from headers
    context = extract(flask_request.headers)
    
    # Start child span with extracted context
    with tracer.start_as_current_span(
        "process_payment",
        context=context  # Links to parent!
    ) as span:
        span.set_attribute("payment.method", "credit_card")
        
        # Further downstream call
        headers = {}
        inject(headers)
        result = requests.post(
            "http://fraud-service/check",
            headers=headers
        )
        
        return result

## Service C: Fraud Detection
def check_fraud_handler(flask_request):
    context = extract(flask_request.headers)
    
    with tracer.start_as_current_span(
        "fraud_check",
        context=context
    ) as span:
        # All spans share the same trace_id!
        score = calculate_fraud_score()
        span.set_attribute("fraud.score", score)
        
        return {"score": score}

Trace visualization from this flow:

┌─────────────────────────────────────────────┐
│  Trace ID: abc123 (120ms total)            │
└─────────────────────────────────────────────┘

├─ checkout (Service A)               [0-120ms]
│  └─ process_payment (Service B)    [10-110ms]
│     ├─ fraud_check (Service C)     [15-45ms]
│     └─ charge_provider (Service B) [50-105ms]
└─ update_inventory (Service A)      [115-120ms]

Now you can query: "Show me all operations for trace abc123" and see the entire request flow with timing breakdowns.

Example 3: Dynamic Sampling Configuration

Implementing intelligent sampling that adapts to traffic patterns:

import random
from opentelemetry.sdk.trace import sampling

class AdaptiveSampler(sampling.Sampler):
    """
    Samples based on request characteristics:
    - 100% of errors and slow requests
    - 10% of authenticated user requests  
    - 1% of anonymous requests
    """
    
    def should_sample(self, context, trace_id, name, 
                     attributes=None, links=None):
        attributes = attributes or {}
        
        # Always sample errors
        if attributes.get("http.status_code", 0) >= 400:
            return sampling.Decision.RECORD_AND_SAMPLE
        
        # Always sample slow requests (> 1s)
        duration = attributes.get("duration_ms", 0)
        if duration > 1000:
            return sampling.Decision.RECORD_AND_SAMPLE
        
        # Sample authenticated users at 10%
        if attributes.get("user.authenticated") is True:
            if random.random() < 0.10:
                return sampling.Decision.RECORD_AND_SAMPLE
        
        # Sample anonymous at 1%
        if random.random() < 0.01:
            return sampling.Decision.RECORD_AND_SAMPLE
        
        # Drop everything else
        return sampling.Decision.DROP

## Configure tracer with adaptive sampler
from opentelemetry.sdk.trace import TracerProvider

provider = TracerProvider(sampler=AdaptiveSampler())
trace.set_tracer_provider(provider)

Result: You capture 100% of problems (errors, slow requests) while sampling only 1-10% of normal traffic, reducing telemetry costs by 90%+ without losing visibility into issues.

Example 4: Instrumentation Testing

How to verify your instrumentation works correctly:

import pytest
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export.in_memory_span_exporter import \
    InMemorySpanExporter
from opentelemetry.sdk.trace import export

@pytest.fixture
def span_exporter():
    """Capture spans for testing"""
    exporter = InMemorySpanExporter()
    provider = TracerProvider()
    provider.add_span_processor(
        export.SimpleSpanProcessor(exporter)
    )
    trace.set_tracer_provider(provider)
    return exporter

def test_payment_instrumentation(span_exporter):
    # Execute instrumented code
    request = PaymentRequest(
        amount=100.00,
        currency="USD",
        method="credit_card"
    )
    process_payment(request)
    
    # Verify spans were created
    spans = span_exporter.get_finished_spans()
    assert len(spans) == 1
    
    span = spans[0]
    assert span.name == "process_payment"
    
    # Verify attributes
    attrs = span.attributes
    assert attrs["payment.amount"] == 100.00
    assert attrs["payment.currency"] == "USD"
    assert attrs["payment.method"] == "credit_card"
    
    # Verify status
    assert span.status.status_code == StatusCode.OK

def test_payment_error_instrumentation(span_exporter):
    request = PaymentRequest(amount=-100)  # Invalid
    
    with pytest.raises(ValidationError):
        process_payment(request)
    
    spans = span_exporter.get_finished_spans()
    span = spans[0]
    
    # Verify error was recorded
    assert span.status.status_code == StatusCode.ERROR
    assert len(span.events) > 0  # Exception recorded
    
    event = span.events[0]
    assert event.name == "exception"
    assert "ValidationError" in event.attributes["exception.type"]

Testing your instrumentation ensures it won't fail silently in production.

Common Mistakes

⚠️ Mistake 1: Over-relying on a Single Signal Type

Problem: Teams instrument only logs, or only metrics, missing the full picture.

Example: You see error_rate metrics spike, but have no logs explaining what failed or traces showing where.

Solution: Implement all three pillars with unified context. Use metrics for detection, traces for localization, logs for explanation.

⚠️ Mistake 2: Forgetting to Propagate Context

Problem: Each service logs independently without shared trace IDs.

Example:

Service A logs: "Request failed" (no trace_id)
Service B logs: "Database timeout" (no trace_id)

You can't prove they're related!

Solution: Always use inject() when making calls and extract() when receiving them. Add trace_id to all log entries.

⚠️ Mistake 3: High-Cardinality Metric Dimensions

Problem: Adding user_id, request_id, or session_id as metric tags explodes storage.

Example:

## ❌ Creates millions of unique metric series
request_counter.add(1, {"user_id": user_id})  # 1M users = 1M series!

Solution: Use high-cardinality IDs only in logs and trace attributes. Metric dimensions should have < 100 unique values.

⚠️ Mistake 4: Blocking on Instrumentation

Problem: Synchronous metric/log shipping adds latency to request path.

Example:

## ❌ HTTP call blocks response
def handler():
    result = process()
    requests.post("http://metrics-collector", json=result)
    return result

Solution: Buffer telemetry in memory and flush async. Use client libraries with built-in batching.

⚠️ Mistake 5: Insufficient Error Context

Problem: Errors logged without enough information to debug.

Example:

logger.error("Payment failed")  # Which user? What amount? Why?

Solution: Include request context, error type, and relevant business data:

logger.error("payment_failed", 
            user_id=user.id,
            amount=payment.amount,
            error_type=type(e).__name__,
            provider_response=e.provider_message)

⚠️ Mistake 6: Not Instrumenting External Dependencies

Problem: You instrument your code but not database calls, HTTP clients, or queue operations.

Result: Blind spots when third-party services fail.

Solution: Use auto-instrumentation libraries for common frameworks, or manually wrap external calls with spans.

⚠️ Mistake 7: Treating Business Errors as System Failures

Problem: Declined payments, invalid inputs, or "not found" results marked as trace errors.

Example:

if payment_declined:
    span.set_status(Status(StatusCode.ERROR))  # ❌ Not a system error!

Solution: Reserve ERROR status for actual system failures (crashes, timeouts). Log business errors with OK status and descriptive attributes.

Key Takeaways

🎯 Core Principles:

Three pillars work together: Metrics detect, traces localize, logs explain
Context is king: Always propagate trace_id across service boundaries
Instrument strategically: Focus on high-value, high-risk code paths first
Semantic consistency: Use standard naming conventions across services
Mind cardinality: Never use unbounded IDs as metric dimensions
Sample intelligently: Keep all errors/slow requests, sample normal traffic
Test your instrumentation: Verify signals are collected correctly

🧠 Remember:

MILD = Metrics, Instrumentation, Logs, Distributed traces
PEP = Propagate, Extract, Process (context flow pattern)
SAS = Semantic, Asynchronous, Sampled (instrumentation best practices)

💡 Quick Decision Tree:

Need alerting? → Use metrics
Need to see request flow? → Use traces
Need specific event details? → Use logs
All of the above? → Use all three with shared context!

🔧 Action Items:

Audit existing instrumentation for the three pillars
Implement context propagation in your inter-service calls
Review metric dimensions for high-cardinality issues
Add sampling to reduce telemetry costs
Write tests that verify instrumentation correctness

📋 Quick Reference Card

Concept	Key Point	Watch Out For
Metrics	Aggregated numbers, low cardinality	Don't tag with user_id or request_id
Logs	Structured JSON with trace_id	Avoid logging in tight loops
Traces	Request flow across services	Must propagate context with inject/extract
Context	trace_id + span_id + baggage	Links all signals together
Sampling	100% errors, 1% normal traffic	Never sample error traces
Attributes	Use semantic conventions (http., db.)	Enables cross-service queries

📚 Further Study

OpenTelemetry Documentation - https://opentelemetry.io/docs/concepts/observability-primer/ - Comprehensive guide to observability concepts and instrumentation patterns
Observability Engineering (O'Reilly) - https://www.oreilly.com/library/view/observability-engineering/9781492076438/ - Deep dive into building observable systems by Charity Majors, Liz Fong-Jones, and George Miranda
Distributed Tracing in Practice - https://www.cncf.io/blog/2021/09/07/distributed-tracing-in-practice/ - CNCF guide with real-world examples and anti-patterns to avoid

📝

Ready to practice?

This lesson has 15 questions to help you learn