OpenTelemetry Mental Model
Understand OTel as a specification and SDK ecosystem for vendor-neutral observability
OpenTelemetry Mental Model
Master the OpenTelemetry mental model with free flashcards and spaced repetition practice. This lesson covers the three core telemetry signals, the semantic conventions framework, and the unified collection architectureβessential concepts for building production observability systems in 2026 and beyond.
Welcome π―
OpenTelemetry (OTel) represents a paradigm shift in how we think about observability. Rather than treating metrics, logs, and traces as separate systems with different collection mechanisms, OpenTelemetry provides a unified mental model that treats all telemetry as interconnected signals flowing through a standardized pipeline.
Think of OpenTelemetry as the "USB standard" of observability. Before USB, every device had its own proprietary connector. OpenTelemetry does the same for telemetry dataβone instrumentation approach, one collection pipeline, many backends. This mental model shift is crucial because it changes how you architect observability from the ground up.
In this lesson, you'll build an intuitive understanding of:
- π The three pillars of observability as complementary signals
- π How context propagation connects signals across distributed systems
- π¦ The collector architecture as a processing pipeline
- π·οΈ Semantic conventions as the shared vocabulary
- π The flow of telemetry from instrumentation to backend
Core Concepts: The Signal-Centric Mental Model π‘
1. Three Signals, One Unified System π‘
Traditional observability treats metrics, logs, and traces as separate data types requiring different tools. OpenTelemetry's mental model flips this: they're all telemetry signals that describe different aspects of the same underlying system behavior.
| Signal Type | What It Captures | Mental Model | Key Strength |
|---|---|---|---|
| Traces π | Request flow through distributed system | "The journey of a request" | Shows causality & timing |
| Metrics π | Aggregated measurements over time | "The health dashboard" | Efficient for trends & alerting |
| Logs π | Discrete events with rich context | "The detailed narrative" | Debugging specific instances |
The key insight: These signals are complementary views of the same reality. A single user request generates:
- A trace showing the request path (API β Auth β Database β Cache)
- Metrics incrementing counters (request_count, latency_bucket)
- Logs recording specific events ("User 12345 authenticated", "Cache miss for key xyz")
All three share the same context: trace ID, span ID, service name, timestamp. This shared context is what makes them navigable together.
2. Context Propagation: The Invisible Thread π§΅
The most powerful concept in OpenTelemetry is context propagationβthe mechanism that links telemetry signals across service boundaries.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTEXT PROPAGATION FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Client Request
β
ββ trace_id: abc123
ββ span_id: span001
ββ baggage: user_id=5678
β
βΌ
βββββββββββββββ HTTP Headers βββββββββββββββ
β Service A βββββββββββββββββββββ Service B β
β (Frontend) β W3C Traceparent β (Auth API) β
βββββββββββββββ βββββββββββββββ
β β
β Creates span002 β Creates span003
β (child of span001) β (child of span002)
β β
βΌ βΌ
All signals inherit: All signals inherit:
- trace_id: abc123 - trace_id: abc123
- parent_span_id: span001 - parent_span_id: span002
- baggage: user_id=5678 - baggage: user_id=5678
Mental model: Think of context like a backpack that travels with the request. Every service:
- Receives the backpack (extracts context from headers)
- Uses the backpack (adds context to its telemetry)
- Passes the backpack forward (injects context into outgoing calls)
This is why you can click a trace ID in logs and see the full distributed trace, or filter metrics by trace IDβthey all share the same context.
π‘ Pro tip: Context propagation is automatic when you use OpenTelemetry auto-instrumentation libraries. Manual instrumentation requires explicit context passing.
3. The Collector: Your Telemetry Router π¦
The OpenTelemetry Collector is a vendor-neutral telemetry gateway. Think of it as a specialized proxy that sits between your applications and your observability backends.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OTEL COLLECTOR ARCHITECTURE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Applications (instrumented with OTel SDKs)
β β β
βΌ βΌ βΌ
βββββββ βββββββ βββββββ
βApp Aβ βApp Bβ βApp Cβ
ββββ¬βββ ββββ¬βββ ββββ¬βββ
β β β
ββββββββββΌβββββββββ
β OTLP (OpenTelemetry Protocol)
βΌ
βββββββββββββββββββββββ
β OTEL COLLECTOR β
βββββββββββββββββββββββ€
β π₯ RECEIVERS β β Ingest data (OTLP, Jaeger, Prometheus)
β βοΈ PROCESSORS β β Transform (filter, sample, enrich)
β π€ EXPORTERS β β Send to backends (Jaeger, Prometheus, etc.)
βββββββββββββββββββββββ
β
ββββββββββΌβββββββββ
βΌ βΌ βΌ
βββββββ βββββββ βββββββ
βJaegerβ βProm β βLoki β
βββββββ βββββββ βββββββ
Traces Metrics Logs
Mental model: The collector is a data processing pipeline with three stages:
Receivers π₯: Accept telemetry in various formats
- OTLP receiver (native OpenTelemetry protocol)
- Jaeger receiver (backward compatibility)
- Prometheus receiver (scrapes metrics)
- Zipkin receiver (migration path)
Processors βοΈ: Transform data in flight
- Batch processor: Groups data for efficient export
- Filter processor: Drops unwanted telemetry (reduce costs)
- Attributes processor: Adds/removes/modifies tags
- Sampling processor: Keeps representative subset of traces
Exporters π€: Send to observability backends
- Jaeger exporter (traces)
- Prometheus exporter (metrics)
- OTLP exporter (vendor-neutral)
- Logging exporter (debugging the collector itself)
Why this matters: The collector decouples your instrumentation (in applications) from your backend choice. Change backends? Reconfigure the collector. Applications stay unchanged.
4. Semantic Conventions: The Shared Language π·οΈ
Semantic conventions are standardized naming rules for attributes, metrics, and spans. They're like a dictionary that ensures everyone describes the same things the same way.
| Concept | Without Conventions | With Semantic Conventions |
|---|---|---|
| HTTP method | method, http_method, verb, request_type | http.method |
| HTTP status | status, status_code, response_code, http_status | http.status_code |
| Service name | service, app, application_name, svc | service.name |
| Database system | db, database, db_type, database_system | db.system |
Mental model: Semantic conventions are like design patterns for telemetry. Instead of inventing attribute names, you follow conventions that make your telemetry:
- Queryable: Dashboards work across services
- Comparable: Metrics from different teams use same labels
- Interoperable: Third-party tools understand your data
Key convention categories:
- Resource conventions: Describe the entity producing telemetry (service.name, host.name, k8s.pod.name)
- Span conventions: Describe operations (http., db., rpc.*)
- Metric conventions: Define standard measurements (http.server.duration, system.cpu.utilization)
- Event conventions: Structure log events (exception.type, exception.message)
π‘ Best practice: Always use semantic conventions when they exist. Only create custom attributes for domain-specific concepts.
5. The Resource Model: Who's Producing This Signal? π’
Every telemetry signal is produced by a resourceβa logical entity like a service, container, or serverless function. The resource model provides identity to your telemetry.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESOURCE HIERARCHY β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π’ Organization: acme-corp
β
ββ π Region: us-east-1
β β
β ββ βοΈ Cluster: prod-k8s-01
β β β
β β ββ π¦ Namespace: payments
β β β β
β β β ββ π― Service: checkout-api
β β β β β
β β β β ββ π² Pod: checkout-api-7d8f9
β β β β β β
β β β β β ββ π Instance: checkout-api-7d8f9-container
β β β β β
β β β β ββ π² Pod: checkout-api-a3b2c
Mental model: Resources are like mailing addresses for telemetry. A signal without resource attributes is like mail without a return addressβyou can't tell where it came from.
Common resource attributes:
service.name = "checkout-api"
service.version = "2.3.1"
service.namespace = "payments"
deployment.environment = "production"
k8s.cluster.name = "prod-k8s-01"
k8s.pod.name = "checkout-api-7d8f9"
host.name = "ip-10-0-45-123.ec2.internal"
cloud.provider = "aws"
cloud.region = "us-east-1"
Why this matters: Resources enable multi-dimensional aggregation. You can slice telemetry by service, environment, cluster, region, or any combination. This is how you answer questions like "How's the checkout-api performing in us-east-1 vs eu-west-1?"
6. The Span Model: Anatomy of Work π¬
A span represents a single unit of work. Understanding spans deeply is crucial because they're the building blocks of distributed traces.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SPAN STRUCTURE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Span: "GET /api/checkout" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β trace_id: abc123def456 β
β span_id: span001 β
β parent_span_id: null (root span) β
β β
β start_time: 2026-01-15T10:30:00.000Z β
β end_time: 2026-01-15T10:30:00.245Z β
β duration: 245ms β
β β
β attributes: β
β http.method = "GET" β
β http.route = "/api/checkout" β
β http.status_code = 200 β
β user.id = "user_5678" β
β β
β events: [ β
β {time: 10:30:00.050, name: "cache_miss"}, β
β {time: 10:30:00.180, name: "payment_validated"}β
β ] β
β β
β status: OK β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Span types (by kind attribute):
- SERVER: Receives requests (API endpoints)
- CLIENT: Makes outgoing calls (HTTP client, DB query)
- INTERNAL: In-process work (function calls, business logic)
- PRODUCER: Publishes messages (Kafka producer)
- CONSUMER: Processes messages (Kafka consumer)
Mental model: A span is like a stopwatch entry with metadata. It records:
- What was done (operation name, attributes)
- When it happened (timestamps)
- How long it took (duration)
- Where in the request flow (parent/child relationships)
- How it went (status: OK, ERROR, UNSET)
π‘ Critical insight: Spans form a tree structure (the trace tree). Parent spans represent higher-level operations, children represent sub-operations. Following parent_span_id links reconstructs the execution flow.
Examples: Mental Models in Action π¬
Example 1: Trace-First Debugging Mental Model
Scenario: Users report "checkout is slow." How does the OpenTelemetry mental model guide investigation?
Traditional approach (siloed):
- Check API metrics β "Average latency is 2.3s"
- Search logs β "Payment service mentioned in errors"
- Check payment service metrics β "Database queries slow"
- Try to correlate timestamps β "Maybe this log matches that metric?"
OpenTelemetry mental model:
- Start with a trace of a slow checkout request
- Follow the critical path (spans on the timeline)
- Identify the bottleneck span (payment_service.validate_card = 1.8s)
- Jump to that span's logs (same trace_id) β "connection pool exhausted"
- Check correlated metrics (db.connections.active by service.name) β "Payment service at 100/100 connections"
- Root cause identified: Payment service connection pool too small for load
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRACE-FIRST INVESTIGATION β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Trace Timeline (total: 2.1s)
ββββββββββββββββββββββββββββββββββββββββββββββ
β GET /checkout [2.1s] β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
ββ validate_session [50ms] β
β
ββ fetch_cart [100ms] β
β
ββ validate_card [1.8s] π΄ BOTTLENECK
β β
β ββ db.query(SELECT...) [1.75s] π΄
β ββ "connection_pool_exhausted" event
β
ββ create_order [150ms] β
Action: Scale payment service DB pool from 100 β 200
Key insight: The mental model shifts from "search for clues" to "follow the data flow." Traces provide the narrative structure, metrics quantify the problem, logs explain why.
Example 2: The Sampling Mental Model
Challenge: Recording 100% of traces in high-traffic systems is expensive. How do you balance visibility with cost?
Mental model: Think of sampling like journalismβyou don't interview every person to understand public opinion, you sample strategically.
Sampling strategies in OpenTelemetry:
| Strategy | Mental Model | Use Case | Trade-off |
|---|---|---|---|
| Head sampling | "Flip coin at start" | Predictable cost | May miss rare errors |
| Tail sampling | "Decide after seeing full story" | Keep all errors, sample successes | Requires buffering (complexity) |
| Adaptive sampling | "Adjust rate based on traffic" | Handle traffic spikes | Rate fluctuates |
| Priority sampling | "VIP requests always recorded" | Never lose important traces | Needs priority logic |
Example configuration (tail sampling in collector):
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow_requests
type: latency
latency: {threshold_ms: 1000}
- name: sample_successes
type: probabilistic
probabilistic: {sampling_percentage: 10}
Mental model in action:
- All errors β Keep (100% of errors for debugging)
- Slow requests (>1s) β Keep (performance investigation)
- Fast successes β Keep 10% (represent normal behavior)
Result: Full error visibility, representative performance data, 70-90% cost reduction.
Example 3: The Cardinality Mental Model
Challenge: You add user_id as a metric label. Suddenly your metrics backend bill explodes. Why?
Mental model: Think of metric labels as database indexes. Each unique combination of label values creates a new time series.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CARDINALITY EXPLOSION β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Metric: http_requests_total
Low cardinality (safe):
Labels: {method, status, endpoint}
method: 5 values (GET, POST, PUT, DELETE, PATCH)
status: 6 values (2xx, 3xx, 4xx, 5xx, timeout, unknown)
endpoint: 20 values (/api/users, /api/checkout, ...)
Total series = 5 Γ 6 Γ 20 = 600 time series β
High cardinality (dangerous):
Labels: {method, status, endpoint, user_id}
method: 5 values
status: 6 values
endpoint: 20 values
user_id: 1,000,000 values (unique users)
Total series = 5 Γ 6 Γ 20 Γ 1,000,000
= 600,000,000 time series π₯
The OpenTelemetry mental model solution:
- High-cardinality data (user_id, trace_id) β Put in span attributes (traces)
- Low-cardinality data (endpoint, status) β Put in metric labels
- Need user-level metrics? β Use exemplars (link metrics to traces)
Example: Metric with exemplar
http_requests_total{method="GET", endpoint="/api/checkout"} = 1523
exemplar: {value=0.234, trace_id="abc123", span_id="span001", user_id="5678"}
Mental model: Metrics give you the aggregate view (1523 requests), exemplars give you representative samples with high-cardinality context ("here's one request with full trace context").
Example 4: The Context Baggage Mental Model
Scenario: You need to track which A/B test variant a user saw, across all services in the request path.
Wrong approach: Add ab_test_variant as a tag to every span manually.
OpenTelemetry mental model: Use baggageβcontext that propagates automatically.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BAGGAGE PROPAGATION β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Frontend Service
β
ββ User enters: ab_test_variant = "new_checkout"
ββ Added to baggage (in context)
β
βΌ
βββββββββββββββ
β Baggage: β
β ab_test = β ββββββHTTP RequestβββββββΆ
β new_checkoutβ
βββββββββββββββ
β
βΌ
Payment Service (receives baggage automatically)
β
ββ Reads: baggage.get("ab_test") = "new_checkout"
ββ Adds to span attributes (optional)
ββ Adds to metric labels (optional)
β
βΌ
βββββββββββββββ
β Baggage: β ββββββHTTP RequestβββββββΆ
β ab_test = β
β new_checkoutβ
βββββββββββββββ
β
βΌ
Inventory Service (receives baggage automatically)
β
ββ All telemetry can access ab_test variant
Code example:
## Frontend service
from opentelemetry import baggage
variant = run_ab_test(user_id)
baggage.set_baggage("ab_test_variant", variant)
## Payment service (no code needed - automatic)
## Inventory service (optional: use baggage)
variant = baggage.get_baggage("ab_test_variant")
span.set_attribute("ab_test_variant", variant)
Mental model: Baggage is like sticky notes on the context backpack. Add it once, every service downstream can read it. Use for:
- Feature flags
- User segments
- Experiment variants
- Request priorities
β οΈ Warning: Baggage adds data to every request header. Keep it small (<1KB) and don't put sensitive data (it's visible in headers).
Common Mistakes β οΈ
Mistake 1: Treating Signals as Independent Systems
β Wrong: "We'll send traces to Jaeger, metrics to Prometheus, logs to Elasticsearch. They're separate."
β Right: "All signals share trace_id and resource attributes. We'll configure the collector to route each signal type, but ensure context propagation works across all."
Why it matters: Independent signals lose the correlation power. You can't jump from an alert (metric) to a trace to relevant logs if they don't share context.
Mistake 2: Over-Instrumenting with Spans
β Wrong: Creating spans for every function call.
@trace_span("calculate_total") # Too granular!
def calculate_total(items):
return sum(item.price for item in items)
β Right: Span meaningful units of workβoperations that cross boundaries or have meaningful duration.
@trace_span("process_order") # Meaningful operation
def process_order(order):
total = calculate_total(order.items) # No span
payment = charge_card(total) # Child span (external call)
return create_receipt(order, payment) # Child span (DB write)
Rule of thumb: If it takes <1ms and doesn't cross a boundary (network, process, thread), it probably doesn't need a span.
Mistake 3: Ignoring Semantic Conventions
β Wrong: Inventing your own attribute names.
span.set_attribute("request_method", "GET") # Non-standard
span.set_attribute("response_status", 200) # Non-standard
β Right: Use semantic conventions.
span.set_attribute("http.method", "GET")
span.set_attribute("http.status_code", 200)
Why it matters: Dashboards, queries, and tooling expect standard names. Custom names break interoperability.
Mistake 4: High-Cardinality Metric Labels
β Wrong:
requests_counter.add(1, {"user_id": user_id}) # π₯ Million+ time series
β Right:
## Keep metric labels low-cardinality
requests_counter.add(1, {"endpoint": "/api/checkout", "status": "200"})
## Put high-cardinality data in span attributes
span.set_attribute("user.id", user_id)
Mistake 5: Not Using the Collector
β Wrong: Applications export directly to backends.
App β Jaeger
App β Prometheus
App β Elasticsearch
Problem: Change backend = change every app. No central processing.
β Right: Route through collector.
Apps β Collector β (Jaeger, Prometheus, Elasticsearch)
Benefit: Decouple apps from backends. Add sampling, filtering, enrichment in one place.
Mistake 6: Forgetting Resource Attributes
β Wrong: Spans have no resource context.
## Missing resource setup
tracer = trace.get_tracer(__name__)
β Right: Configure resources at SDK initialization.
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "checkout-api",
"service.version": "2.3.1",
"deployment.environment": "production"
})
tracer_provider = TracerProvider(resource=resource)
Why it matters: Without resource attributes, you can't tell which service produced telemetry.
Key Takeaways π―
Unified Signals: Metrics, traces, and logs are complementary views of the same system, connected by shared context (trace_id, span_id, resource).
Context is King: Context propagation (via W3C Traceparent headers and baggage) is what makes distributed tracing work. It's the invisible thread connecting all telemetry.
Collector = Flexibility: The OpenTelemetry Collector decouples instrumentation from backends. Change vendors without changing code.
Semantic Conventions = Interoperability: Use standardized attribute names. Custom attributes break dashboards and queries.
Resources Provide Identity: Every signal needs resource attributes (service.name, deployment.environment) to be useful.
Cardinality Matters: High-cardinality data (user_id, trace_id) goes in span attributes, not metric labels.
Sampling is Strategic: Use tail sampling to keep all errors while sampling successes. Balance visibility with cost.
Spans Represent Work: Create spans for meaningful operations (network calls, DB queries, significant processing), not every function.
Baggage for Cross-Cutting Context: Use baggage to propagate cross-cutting context (feature flags, user segments) automatically.
Trace-First Debugging: When debugging, start with a trace to see the request flow, then use correlated metrics and logs to understand details.
π OpenTelemetry Mental Model Quick Reference
| Three Signals | Traces (request flow), Metrics (aggregates), Logs (events) |
| Context Propagation | Trace_id + span_id travel across services via HTTP headers |
| Collector Stages | Receivers β Processors β Exporters |
| Semantic Conventions | http.method, http.status_code, db.system (standard names) |
| Resource Attributes | service.name, deployment.environment (who produced this?) |
| Span Anatomy | trace_id, span_id, parent_span_id, timestamps, attributes, events |
| Cardinality Rule | Low-cardinality β metric labels, High-cardinality β span attributes |
| Baggage Use | Feature flags, experiment variants, user segments (propagates automatically) |
| Sampling Strategy | Tail sampling: keep errors 100%, sample successes 10% |
| Debugging Flow | Alert (metric) β Trace (identify bottleneck) β Logs (explain why) |
π Further Study
OpenTelemetry Official Documentation: https://opentelemetry.io/docs/concepts/ - Comprehensive coverage of concepts, specifications, and semantic conventions.
W3C Trace Context Specification: https://www.w3.org/TR/trace-context/ - Deep dive into how context propagation works at the protocol level.
OpenTelemetry Semantic Conventions: https://github.com/open-telemetry/semantic-conventions - Complete reference for standardized attribute names across all signals.
π§ Memory Device: Remember "CRISS" for the OpenTelemetry mental model:
- Context propagation (links everything)
- Resources (identity)
- Instrumentation (SDKs)
- Signals (traces, metrics, logs)
- Semantic conventions (shared language)
Master this mental model, and you'll understand not just how to use OpenTelemetry, but why it's designed the way it isβand how to architect observability that scales from prototype to production. π