You are viewing a preview of this lesson. Sign in to start learning
Back to Production Observability: From Signals to Root Cause (2026)

OpenTelemetry Mental Model

Understand OTel as a specification and SDK ecosystem for vendor-neutral observability

OpenTelemetry Mental Model

Master the OpenTelemetry mental model with free flashcards and spaced repetition practice. This lesson covers the three core telemetry signals, the semantic conventions framework, and the unified collection architectureβ€”essential concepts for building production observability systems in 2026 and beyond.

Welcome 🎯

OpenTelemetry (OTel) represents a paradigm shift in how we think about observability. Rather than treating metrics, logs, and traces as separate systems with different collection mechanisms, OpenTelemetry provides a unified mental model that treats all telemetry as interconnected signals flowing through a standardized pipeline.

Think of OpenTelemetry as the "USB standard" of observability. Before USB, every device had its own proprietary connector. OpenTelemetry does the same for telemetry dataβ€”one instrumentation approach, one collection pipeline, many backends. This mental model shift is crucial because it changes how you architect observability from the ground up.

In this lesson, you'll build an intuitive understanding of:

  • πŸ“Š The three pillars of observability as complementary signals
  • πŸ”— How context propagation connects signals across distributed systems
  • πŸ“¦ The collector architecture as a processing pipeline
  • 🏷️ Semantic conventions as the shared vocabulary
  • 🌊 The flow of telemetry from instrumentation to backend

Core Concepts: The Signal-Centric Mental Model πŸ’‘

1. Three Signals, One Unified System πŸ“‘

Traditional observability treats metrics, logs, and traces as separate data types requiring different tools. OpenTelemetry's mental model flips this: they're all telemetry signals that describe different aspects of the same underlying system behavior.

Signal Type What It Captures Mental Model Key Strength
Traces πŸ”— Request flow through distributed system "The journey of a request" Shows causality & timing
Metrics πŸ“Š Aggregated measurements over time "The health dashboard" Efficient for trends & alerting
Logs πŸ“ Discrete events with rich context "The detailed narrative" Debugging specific instances

The key insight: These signals are complementary views of the same reality. A single user request generates:

  • A trace showing the request path (API β†’ Auth β†’ Database β†’ Cache)
  • Metrics incrementing counters (request_count, latency_bucket)
  • Logs recording specific events ("User 12345 authenticated", "Cache miss for key xyz")

All three share the same context: trace ID, span ID, service name, timestamp. This shared context is what makes them navigable together.

2. Context Propagation: The Invisible Thread 🧡

The most powerful concept in OpenTelemetry is context propagationβ€”the mechanism that links telemetry signals across service boundaries.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         CONTEXT PROPAGATION FLOW                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Client Request
      β”‚
      β”œβ”€ trace_id: abc123
      β”œβ”€ span_id: span001
      └─ baggage: user_id=5678
      β”‚
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  HTTP Headers    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Service A  │──────────────────→│  Service B  β”‚
  β”‚ (Frontend)  β”‚  W3C Traceparent  β”‚  (Auth API) β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                                     β”‚
      β”‚ Creates span002                     β”‚ Creates span003
      β”‚ (child of span001)                  β”‚ (child of span002)
      β”‚                                     β”‚
      β–Ό                                     β–Ό
  All signals inherit:                  All signals inherit:
  - trace_id: abc123                    - trace_id: abc123
  - parent_span_id: span001             - parent_span_id: span002
  - baggage: user_id=5678               - baggage: user_id=5678

Mental model: Think of context like a backpack that travels with the request. Every service:

  1. Receives the backpack (extracts context from headers)
  2. Uses the backpack (adds context to its telemetry)
  3. Passes the backpack forward (injects context into outgoing calls)

This is why you can click a trace ID in logs and see the full distributed trace, or filter metrics by trace IDβ€”they all share the same context.

πŸ’‘ Pro tip: Context propagation is automatic when you use OpenTelemetry auto-instrumentation libraries. Manual instrumentation requires explicit context passing.

3. The Collector: Your Telemetry Router 🚦

The OpenTelemetry Collector is a vendor-neutral telemetry gateway. Think of it as a specialized proxy that sits between your applications and your observability backends.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         OTEL COLLECTOR ARCHITECTURE               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Applications (instrumented with OTel SDKs)
     β”‚        β”‚        β”‚
     β–Ό        β–Ό        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”
  β”‚App Aβ”‚  β”‚App Bβ”‚  β”‚App Cβ”‚
  β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜
     β”‚        β”‚        β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚ OTLP (OpenTelemetry Protocol)
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   OTEL COLLECTOR    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚  πŸ“₯ RECEIVERS       β”‚ ← Ingest data (OTLP, Jaeger, Prometheus)
    β”‚  βš™οΈ  PROCESSORS     β”‚ ← Transform (filter, sample, enrich)
    β”‚  πŸ“€ EXPORTERS       β”‚ ← Send to backends (Jaeger, Prometheus, etc.)
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β–Ό        β–Ό        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”
  β”‚Jaegerβ”‚ β”‚Prom β”‚ β”‚Loki β”‚
  β””β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”˜
   Traces   Metrics   Logs

Mental model: The collector is a data processing pipeline with three stages:

  1. Receivers πŸ“₯: Accept telemetry in various formats

    • OTLP receiver (native OpenTelemetry protocol)
    • Jaeger receiver (backward compatibility)
    • Prometheus receiver (scrapes metrics)
    • Zipkin receiver (migration path)
  2. Processors βš™οΈ: Transform data in flight

    • Batch processor: Groups data for efficient export
    • Filter processor: Drops unwanted telemetry (reduce costs)
    • Attributes processor: Adds/removes/modifies tags
    • Sampling processor: Keeps representative subset of traces
  3. Exporters πŸ“€: Send to observability backends

    • Jaeger exporter (traces)
    • Prometheus exporter (metrics)
    • OTLP exporter (vendor-neutral)
    • Logging exporter (debugging the collector itself)

Why this matters: The collector decouples your instrumentation (in applications) from your backend choice. Change backends? Reconfigure the collector. Applications stay unchanged.

4. Semantic Conventions: The Shared Language 🏷️

Semantic conventions are standardized naming rules for attributes, metrics, and spans. They're like a dictionary that ensures everyone describes the same things the same way.

Concept Without Conventions With Semantic Conventions
HTTP method method, http_method, verb, request_type http.method
HTTP status status, status_code, response_code, http_status http.status_code
Service name service, app, application_name, svc service.name
Database system db, database, db_type, database_system db.system

Mental model: Semantic conventions are like design patterns for telemetry. Instead of inventing attribute names, you follow conventions that make your telemetry:

  • Queryable: Dashboards work across services
  • Comparable: Metrics from different teams use same labels
  • Interoperable: Third-party tools understand your data

Key convention categories:

  • Resource conventions: Describe the entity producing telemetry (service.name, host.name, k8s.pod.name)
  • Span conventions: Describe operations (http., db., rpc.*)
  • Metric conventions: Define standard measurements (http.server.duration, system.cpu.utilization)
  • Event conventions: Structure log events (exception.type, exception.message)

πŸ’‘ Best practice: Always use semantic conventions when they exist. Only create custom attributes for domain-specific concepts.

5. The Resource Model: Who's Producing This Signal? 🏒

Every telemetry signal is produced by a resourceβ€”a logical entity like a service, container, or serverless function. The resource model provides identity to your telemetry.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              RESOURCE HIERARCHY                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  🏒 Organization: acme-corp
       β”‚
       β”œβ”€ 🌍 Region: us-east-1
       β”‚    β”‚
       β”‚    β”œβ”€ ☁️ Cluster: prod-k8s-01
       β”‚    β”‚    β”‚
       β”‚    β”‚    β”œβ”€ πŸ“¦ Namespace: payments
       β”‚    β”‚    β”‚    β”‚
       β”‚    β”‚    β”‚    β”œβ”€ 🎯 Service: checkout-api
       β”‚    β”‚    β”‚    β”‚    β”‚
       β”‚    β”‚    β”‚    β”‚    β”œβ”€ πŸ”² Pod: checkout-api-7d8f9
       β”‚    β”‚    β”‚    β”‚    β”‚    β”‚
       β”‚    β”‚    β”‚    β”‚    β”‚    └─ πŸ“Š Instance: checkout-api-7d8f9-container
       β”‚    β”‚    β”‚    β”‚    β”‚
       β”‚    β”‚    β”‚    β”‚    └─ πŸ”² Pod: checkout-api-a3b2c

Mental model: Resources are like mailing addresses for telemetry. A signal without resource attributes is like mail without a return addressβ€”you can't tell where it came from.

Common resource attributes:

service.name = "checkout-api"
service.version = "2.3.1"
service.namespace = "payments"
deployment.environment = "production"
k8s.cluster.name = "prod-k8s-01"
k8s.pod.name = "checkout-api-7d8f9"
host.name = "ip-10-0-45-123.ec2.internal"
cloud.provider = "aws"
cloud.region = "us-east-1"

Why this matters: Resources enable multi-dimensional aggregation. You can slice telemetry by service, environment, cluster, region, or any combination. This is how you answer questions like "How's the checkout-api performing in us-east-1 vs eu-west-1?"

6. The Span Model: Anatomy of Work πŸ”¬

A span represents a single unit of work. Understanding spans deeply is crucial because they're the building blocks of distributed traces.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              SPAN STRUCTURE                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Span: "GET /api/checkout"                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ trace_id:      abc123def456                      β”‚
β”‚ span_id:       span001                           β”‚
β”‚ parent_span_id: null (root span)                 β”‚
β”‚                                                  β”‚
β”‚ start_time:    2026-01-15T10:30:00.000Z         β”‚
β”‚ end_time:      2026-01-15T10:30:00.245Z         β”‚
β”‚ duration:      245ms                             β”‚
β”‚                                                  β”‚
β”‚ attributes:                                      β”‚
β”‚   http.method = "GET"                            β”‚
β”‚   http.route = "/api/checkout"                   β”‚
β”‚   http.status_code = 200                         β”‚
β”‚   user.id = "user_5678"                          β”‚
β”‚                                                  β”‚
β”‚ events: [                                        β”‚
β”‚   {time: 10:30:00.050, name: "cache_miss"},     β”‚
β”‚   {time: 10:30:00.180, name: "payment_validated"}β”‚
β”‚ ]                                                β”‚
β”‚                                                  β”‚
β”‚ status: OK                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Span types (by kind attribute):

  • SERVER: Receives requests (API endpoints)
  • CLIENT: Makes outgoing calls (HTTP client, DB query)
  • INTERNAL: In-process work (function calls, business logic)
  • PRODUCER: Publishes messages (Kafka producer)
  • CONSUMER: Processes messages (Kafka consumer)

Mental model: A span is like a stopwatch entry with metadata. It records:

  1. What was done (operation name, attributes)
  2. When it happened (timestamps)
  3. How long it took (duration)
  4. Where in the request flow (parent/child relationships)
  5. How it went (status: OK, ERROR, UNSET)

πŸ’‘ Critical insight: Spans form a tree structure (the trace tree). Parent spans represent higher-level operations, children represent sub-operations. Following parent_span_id links reconstructs the execution flow.

Examples: Mental Models in Action 🎬

Example 1: Trace-First Debugging Mental Model

Scenario: Users report "checkout is slow." How does the OpenTelemetry mental model guide investigation?

Traditional approach (siloed):

  1. Check API metrics β†’ "Average latency is 2.3s"
  2. Search logs β†’ "Payment service mentioned in errors"
  3. Check payment service metrics β†’ "Database queries slow"
  4. Try to correlate timestamps β†’ "Maybe this log matches that metric?"

OpenTelemetry mental model:

  1. Start with a trace of a slow checkout request
  2. Follow the critical path (spans on the timeline)
  3. Identify the bottleneck span (payment_service.validate_card = 1.8s)
  4. Jump to that span's logs (same trace_id) β†’ "connection pool exhausted"
  5. Check correlated metrics (db.connections.active by service.name) β†’ "Payment service at 100/100 connections"
  6. Root cause identified: Payment service connection pool too small for load
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         TRACE-FIRST INVESTIGATION                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Trace Timeline (total: 2.1s)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ GET /checkout              [2.1s]          β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€ validate_session      [50ms] βœ…
       β”‚
       β”œβ”€ fetch_cart            [100ms] βœ…
       β”‚
       β”œβ”€ validate_card         [1.8s] πŸ”΄ BOTTLENECK
       β”‚    β”‚
       β”‚    └─ db.query(SELECT...) [1.75s] πŸ”΄
       β”‚         └─ "connection_pool_exhausted" event
       β”‚
       └─ create_order          [150ms] βœ…

  Action: Scale payment service DB pool from 100 β†’ 200

Key insight: The mental model shifts from "search for clues" to "follow the data flow." Traces provide the narrative structure, metrics quantify the problem, logs explain why.

Example 2: The Sampling Mental Model

Challenge: Recording 100% of traces in high-traffic systems is expensive. How do you balance visibility with cost?

Mental model: Think of sampling like journalismβ€”you don't interview every person to understand public opinion, you sample strategically.

Sampling strategies in OpenTelemetry:

Strategy Mental Model Use Case Trade-off
Head sampling "Flip coin at start" Predictable cost May miss rare errors
Tail sampling "Decide after seeing full story" Keep all errors, sample successes Requires buffering (complexity)
Adaptive sampling "Adjust rate based on traffic" Handle traffic spikes Rate fluctuates
Priority sampling "VIP requests always recorded" Never lose important traces Needs priority logic

Example configuration (tail sampling in collector):

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: sample_successes
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Mental model in action:

  • All errors β†’ Keep (100% of errors for debugging)
  • Slow requests (>1s) β†’ Keep (performance investigation)
  • Fast successes β†’ Keep 10% (represent normal behavior)

Result: Full error visibility, representative performance data, 70-90% cost reduction.

Example 3: The Cardinality Mental Model

Challenge: You add user_id as a metric label. Suddenly your metrics backend bill explodes. Why?

Mental model: Think of metric labels as database indexes. Each unique combination of label values creates a new time series.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         CARDINALITY EXPLOSION                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Metric: http_requests_total

Low cardinality (safe):
  Labels: {method, status, endpoint}
  method: 5 values (GET, POST, PUT, DELETE, PATCH)
  status: 6 values (2xx, 3xx, 4xx, 5xx, timeout, unknown)
  endpoint: 20 values (/api/users, /api/checkout, ...)
  
  Total series = 5 Γ— 6 Γ— 20 = 600 time series βœ…

High cardinality (dangerous):
  Labels: {method, status, endpoint, user_id}
  method: 5 values
  status: 6 values
  endpoint: 20 values
  user_id: 1,000,000 values (unique users)
  
  Total series = 5 Γ— 6 Γ— 20 Γ— 1,000,000
                = 600,000,000 time series πŸ’₯

The OpenTelemetry mental model solution:

  • High-cardinality data (user_id, trace_id) β†’ Put in span attributes (traces)
  • Low-cardinality data (endpoint, status) β†’ Put in metric labels
  • Need user-level metrics? β†’ Use exemplars (link metrics to traces)

Example: Metric with exemplar

http_requests_total{method="GET", endpoint="/api/checkout"} = 1523
exemplar: {value=0.234, trace_id="abc123", span_id="span001", user_id="5678"}

Mental model: Metrics give you the aggregate view (1523 requests), exemplars give you representative samples with high-cardinality context ("here's one request with full trace context").

Example 4: The Context Baggage Mental Model

Scenario: You need to track which A/B test variant a user saw, across all services in the request path.

Wrong approach: Add ab_test_variant as a tag to every span manually.

OpenTelemetry mental model: Use baggageβ€”context that propagates automatically.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         BAGGAGE PROPAGATION                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Frontend Service
       β”‚
       β”œβ”€ User enters: ab_test_variant = "new_checkout"
       β”œβ”€ Added to baggage (in context)
       β”‚
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Baggage:   β”‚
  β”‚  ab_test =  β”‚ ──────HTTP Request──────▢
  β”‚  new_checkoutβ”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
  Payment Service (receives baggage automatically)
       β”‚
       β”œβ”€ Reads: baggage.get("ab_test") = "new_checkout"
       β”œβ”€ Adds to span attributes (optional)
       β”œβ”€ Adds to metric labels (optional)
       β”‚
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Baggage:   β”‚ ──────HTTP Request──────▢
  β”‚  ab_test =  β”‚
  β”‚  new_checkoutβ”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
  Inventory Service (receives baggage automatically)
       β”‚
       └─ All telemetry can access ab_test variant

Code example:

## Frontend service
from opentelemetry import baggage

variant = run_ab_test(user_id)
baggage.set_baggage("ab_test_variant", variant)

## Payment service (no code needed - automatic)
## Inventory service (optional: use baggage)
variant = baggage.get_baggage("ab_test_variant")
span.set_attribute("ab_test_variant", variant)

Mental model: Baggage is like sticky notes on the context backpack. Add it once, every service downstream can read it. Use for:

  • Feature flags
  • User segments
  • Experiment variants
  • Request priorities

⚠️ Warning: Baggage adds data to every request header. Keep it small (<1KB) and don't put sensitive data (it's visible in headers).

Common Mistakes ⚠️

Mistake 1: Treating Signals as Independent Systems

❌ Wrong: "We'll send traces to Jaeger, metrics to Prometheus, logs to Elasticsearch. They're separate."

βœ… Right: "All signals share trace_id and resource attributes. We'll configure the collector to route each signal type, but ensure context propagation works across all."

Why it matters: Independent signals lose the correlation power. You can't jump from an alert (metric) to a trace to relevant logs if they don't share context.

Mistake 2: Over-Instrumenting with Spans

❌ Wrong: Creating spans for every function call.

@trace_span("calculate_total")  # Too granular!
def calculate_total(items):
    return sum(item.price for item in items)

βœ… Right: Span meaningful units of workβ€”operations that cross boundaries or have meaningful duration.

@trace_span("process_order")  # Meaningful operation
def process_order(order):
    total = calculate_total(order.items)  # No span
    payment = charge_card(total)  # Child span (external call)
    return create_receipt(order, payment)  # Child span (DB write)

Rule of thumb: If it takes <1ms and doesn't cross a boundary (network, process, thread), it probably doesn't need a span.

Mistake 3: Ignoring Semantic Conventions

❌ Wrong: Inventing your own attribute names.

span.set_attribute("request_method", "GET")  # Non-standard
span.set_attribute("response_status", 200)   # Non-standard

βœ… Right: Use semantic conventions.

span.set_attribute("http.method", "GET")
span.set_attribute("http.status_code", 200)

Why it matters: Dashboards, queries, and tooling expect standard names. Custom names break interoperability.

Mistake 4: High-Cardinality Metric Labels

❌ Wrong:

requests_counter.add(1, {"user_id": user_id})  # πŸ’₯ Million+ time series

βœ… Right:

## Keep metric labels low-cardinality
requests_counter.add(1, {"endpoint": "/api/checkout", "status": "200"})
## Put high-cardinality data in span attributes
span.set_attribute("user.id", user_id)

Mistake 5: Not Using the Collector

❌ Wrong: Applications export directly to backends.

App β†’ Jaeger
App β†’ Prometheus
App β†’ Elasticsearch

Problem: Change backend = change every app. No central processing.

βœ… Right: Route through collector.

Apps β†’ Collector β†’ (Jaeger, Prometheus, Elasticsearch)

Benefit: Decouple apps from backends. Add sampling, filtering, enrichment in one place.

Mistake 6: Forgetting Resource Attributes

❌ Wrong: Spans have no resource context.

## Missing resource setup
tracer = trace.get_tracer(__name__)

βœ… Right: Configure resources at SDK initialization.

from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "checkout-api",
    "service.version": "2.3.1",
    "deployment.environment": "production"
})
tracer_provider = TracerProvider(resource=resource)

Why it matters: Without resource attributes, you can't tell which service produced telemetry.

Key Takeaways 🎯

  1. Unified Signals: Metrics, traces, and logs are complementary views of the same system, connected by shared context (trace_id, span_id, resource).

  2. Context is King: Context propagation (via W3C Traceparent headers and baggage) is what makes distributed tracing work. It's the invisible thread connecting all telemetry.

  3. Collector = Flexibility: The OpenTelemetry Collector decouples instrumentation from backends. Change vendors without changing code.

  4. Semantic Conventions = Interoperability: Use standardized attribute names. Custom attributes break dashboards and queries.

  5. Resources Provide Identity: Every signal needs resource attributes (service.name, deployment.environment) to be useful.

  6. Cardinality Matters: High-cardinality data (user_id, trace_id) goes in span attributes, not metric labels.

  7. Sampling is Strategic: Use tail sampling to keep all errors while sampling successes. Balance visibility with cost.

  8. Spans Represent Work: Create spans for meaningful operations (network calls, DB queries, significant processing), not every function.

  9. Baggage for Cross-Cutting Context: Use baggage to propagate cross-cutting context (feature flags, user segments) automatically.

  10. Trace-First Debugging: When debugging, start with a trace to see the request flow, then use correlated metrics and logs to understand details.

πŸ“‹ OpenTelemetry Mental Model Quick Reference

Three SignalsTraces (request flow), Metrics (aggregates), Logs (events)
Context PropagationTrace_id + span_id travel across services via HTTP headers
Collector StagesReceivers β†’ Processors β†’ Exporters
Semantic Conventionshttp.method, http.status_code, db.system (standard names)
Resource Attributesservice.name, deployment.environment (who produced this?)
Span Anatomytrace_id, span_id, parent_span_id, timestamps, attributes, events
Cardinality RuleLow-cardinality β†’ metric labels, High-cardinality β†’ span attributes
Baggage UseFeature flags, experiment variants, user segments (propagates automatically)
Sampling StrategyTail sampling: keep errors 100%, sample successes 10%
Debugging FlowAlert (metric) β†’ Trace (identify bottleneck) β†’ Logs (explain why)

πŸ“š Further Study

  1. OpenTelemetry Official Documentation: https://opentelemetry.io/docs/concepts/ - Comprehensive coverage of concepts, specifications, and semantic conventions.

  2. W3C Trace Context Specification: https://www.w3.org/TR/trace-context/ - Deep dive into how context propagation works at the protocol level.

  3. OpenTelemetry Semantic Conventions: https://github.com/open-telemetry/semantic-conventions - Complete reference for standardized attribute names across all signals.


🧠 Memory Device: Remember "CRISS" for the OpenTelemetry mental model:

  • Context propagation (links everything)
  • Resources (identity)
  • Instrumentation (SDKs)
  • Signals (traces, metrics, logs)
  • Semantic conventions (shared language)

Master this mental model, and you'll understand not just how to use OpenTelemetry, but why it's designed the way it isβ€”and how to architect observability that scales from prototype to production. πŸš€