You are viewing a preview of this lesson. Sign in to start learning
Back to Production Observability: From Signals to Root Cause (2026)

Observability Foundations

Master the conceptual shift from monitoring to true observability and understand why traditional approaches fail during critical incidents

Observability Foundations

Master production observability with free flashcards and spaced repetition practice. This lesson covers core observability pillars, signal types, telemetry collection, and the shift from traditional monitoring to modern observabilityβ€”essential concepts for DevOps engineers, SREs, and platform teams building reliable distributed systems.

Welcome to Observability

πŸ’» In today's world of microservices, containers, and cloud-native architectures, understanding what's happening inside your production systems is more criticalβ€”and more challengingβ€”than ever. Observability is the practice of understanding the internal state of a system by examining its external outputs. Unlike traditional monitoring that asks "is this specific thing broken?", observability asks "what is broken and why?"

Think of observability as the difference between checking if your car's engine light is on (monitoring) versus having a diagnostic tool that shows you exactly which cylinder is misfiring, why, and what cascading effects it's causing (observability). πŸš—

🎯 What You'll Learn

  • The Three Pillars: Metrics, logs, and tracesβ€”and why we need all three
  • Signal Types: Understanding telemetry data and its characteristics
  • Cardinality: The hidden challenge that breaks observability systems
  • Correlation: Connecting signals across systems to find root causes
  • Instrumentation: How to capture meaningful data from your applications

Core Concepts

The Three Pillars of Observability

Modern observability rests on three foundational signal types, often called the "three pillars":

PillarWhat It CapturesBest ForExample
πŸ“Š MetricsNumerical measurements over timeTrends, aggregations, alertsCPU usage, request rate, error count
πŸ“ LogsDiscrete event recordsDebugging specific events"User 123 failed login at 10:34"
πŸ”— TracesRequest paths through distributed systemsUnderstanding latency and dependenciesAPI call β†’ database β†’ cache β†’ response

Why you need all three: Each pillar answers different questions. Metrics tell you that something is wrong, logs tell you what happened, and traces tell you where in your system the problem occurred.

πŸ’‘ Tip: Think of them as complementary tools in a diagnostic toolkit. You wouldn't try to fix a car with only a wrenchβ€”you need multiple tools for different jobs.

Understanding Metrics

Metrics are aggregated numerical data points collected at regular intervals. They're efficient to store and query, making them perfect for:

  • Dashboards showing system health at a glance
  • Alerts triggering when thresholds are crossed
  • Trend analysis over days, weeks, or months
METRIC TYPES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                             β”‚
β”‚  πŸ“ˆ Counter: Only increases                β”‚
β”‚     Example: total_requests = 1,245,892    β”‚
β”‚     (resets to 0 on restart)               β”‚
β”‚                                             β”‚
β”‚  πŸ“Š Gauge: Can go up or down               β”‚
β”‚     Example: active_connections = 42       β”‚
β”‚     (current state snapshot)               β”‚
β”‚                                             β”‚
β”‚  ⏱️ Histogram: Distribution of values      β”‚
β”‚     Example: request_duration_ms           β”‚
β”‚     Buckets: <100ms, <200ms, <500ms, 1s+   β”‚
β”‚                                             β”‚
β”‚  πŸ“‰ Summary: Like histogram + percentiles  β”‚
β”‚     Example: p50=120ms, p95=450ms, p99=2s  β”‚
β”‚                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Metric dimensions (also called labels or tags) add context:

http_requests_total{method="GET", endpoint="/api/users", status="200"} = 15420
http_requests_total{method="POST", endpoint="/api/users", status="201"} = 892
http_requests_total{method="GET", endpoint="/api/users", status="404"} = 23

⚠️ Cardinality warning: Each unique combination of label values creates a new time series. With 10 endpoints, 5 methods, and 10 status codes, you have 500 time series. Add a user_id label with 100,000 users, and suddenly you have 50,000,000 time seriesβ€”enough to overwhelm most metric systems!

Understanding Logs

Logs are timestamped text records of discrete events. They're the oldest form of telemetry and remain invaluable for:

  • Detailed debugging of specific failures
  • Audit trails and compliance
  • Understanding exact sequences of events
Log LevelPurposeExample
TRACEVery detailed, for development"Entering method processPayment()"
DEBUGDetailed information for debugging"SQL query took 45ms"
INFONormal operational events"User logged in successfully"
WARNPotentially problematic situations"Retry attempt 3 of 5"
ERRORError events that allow continuation"Failed to send email notification"
FATALSevere errors causing shutdown"Cannot connect to database, exiting"

Structured logging transforms logs from raw text into queryable data:

## Unstructured (harder to query)
User john@example.com failed login from 192.168.1.100 at 2026-01-15 10:34:22

## Structured (easily queryable)
{
  "timestamp": "2026-01-15T10:34:22Z",
  "level": "WARN",
  "event": "login_failed",
  "user_email": "john@example.com",
  "source_ip": "192.168.1.100",
  "reason": "invalid_password",
  "attempt_count": 3
}

πŸ’‘ Best practice: Always include correlation IDs (request IDs, trace IDs) in your logs so you can connect related events across multiple services.

Understanding Traces

Distributed tracing tracks requests as they flow through multiple services in a system. Each request becomes a trace, and each operation within that trace is a span.

DISTRIBUTED TRACE ANATOMY

Trace ID: abc123xyz (represents entire request journey)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Span: API Gateway (125ms total)                        β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Span: User Service (45ms)                          β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚ β”‚
β”‚ β”‚ β”‚ Span: Database Query (15ms)      β”‚               β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Span: Order Service (80ms)                         β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Span: Cache (5ms)β”‚  β”‚ Span: Payment API (60ms) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each span contains:
- Operation name
- Start time & duration
- Parent span ID (forms the tree)
- Tags/attributes (metadata)
- Logs/events within the span

Why traces matter: Without tracing, debugging a slow request in a 20-service architecture means checking logs in 20 different places. With tracing, you immediately see that 80% of the time was spent in the Payment API's database call.

πŸ” Sampling: Recording every single trace creates massive data volumes. Most systems use sampling strategies:

  • Head-based sampling: Decide to record immediately (e.g., 1% of all requests)
  • Tail-based sampling: Decide after seeing the complete trace (e.g., keep all errors and slow requests)

Cardinality: The Hidden Challenge

Cardinality refers to the number of unique values in a dataset. In observability, high cardinality destroys performance and costs.

Dimension TypeCardinalityImpact
Low cardinality
environment, region, service
3-100 valuesβœ… Efficient, inexpensive
Medium cardinality
endpoint, host, container
100-1000 values⚠️ Manageable with care
High cardinality
user_id, trace_id, session_id
1000s-millions❌ Expensive, slow queries

The cardinality explosion:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LABEL COMBINATION EXPLOSION                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                               β”‚
β”‚  3 services Γ— 20 endpoints Γ— 8 status codes  β”‚
β”‚  = 480 time series βœ… (manageable)           β”‚
β”‚                                               β”‚
β”‚  Add user_id (50,000 users):                 β”‚
β”‚  = 24,000,000 time series ❌ (disaster!)     β”‚
β”‚                                               β”‚
β”‚  Storage & query costs increase linearly     β”‚
β”‚  with number of unique time series           β”‚
β”‚                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Solution: Use high-cardinality dimensions in traces and logs (designed for it), not in metrics. If you need to filter metrics by user, aggregate first or use exemplars (sample traces linked from metrics).

Signal Correlation

Correlation is connecting different signals to understand the complete picture. This is where observability becomes more than the sum of its parts.

CORRELATION FLOW: DEBUG SLOW API

1️⃣ Metric shows spike
    ↓
   πŸ“Š "p95 latency jumped from 200ms to 5s"
   http_request_duration{endpoint="/checkout"}
    ↓
    
2️⃣ Use time range to query traces
    ↓
   πŸ”— Find slow traces in that time window
   trace_duration > 4s AND endpoint="/checkout"
    ↓
    
3️⃣ Examine trace spans
    ↓
   "Payment Service β†’ Database span = 4.8s"
   (Identified the slow component)
    ↓
    
4️⃣ Get span's trace_id, search logs
    ↓
   πŸ“ trace_id="xyz789" shows:
   "ERROR: Connection pool exhausted, waited 4.7s"
    ↓
    
5️⃣ Root cause found!
   Database connection pool too small

Correlation IDs make this possible:

{
  "trace_id": "xyz789",
  "span_id": "abc123",
  "request_id": "req-456",
  "user_id": "u-789",
  "session_id": "sess-012"
}

Include these in all three pillars:

  • Metrics: Use exemplars to link to sample traces
  • Logs: Always log trace_id and span_id
  • Traces: Include request_id and session context

Instrumentation Approaches

Instrumentation is the process of adding observability code to your applications. There are three main approaches:

ApproachHow It WorksProsCons
ManualWrite telemetry code yourselfFull control, custom business metricsTime-consuming, inconsistent
Auto-instrumentationAgent/library instruments automaticallyZero code changes, quick setupGeneric, limited customization
HybridAuto + manual for critical pathsBalance of ease and customizationRequires both skill sets

Instrumentation levels:

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  🎯 Business Metrics        β”‚
        β”‚  (manual instrumentation)   β”‚
        β”‚  "checkout_completed"       β”‚
        β”‚  "cart_abandoned"           β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
             ↑ Custom, domain-specific
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  πŸ“¦ Application Metrics     β”‚
        β”‚  (framework instrumentation)β”‚
        β”‚  HTTP requests, DB queries  β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
             ↑ Language/framework
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  βš™οΈ System Metrics          β”‚
        β”‚  (agent instrumentation)    β”‚
        β”‚  CPU, memory, disk, network β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↑ Infrastructure layer

πŸ’‘ Best practice: Start with auto-instrumentation for the foundation, then add manual instrumentation for business-critical flows and domain-specific metrics.

From Monitoring to Observability

The shift from monitoring to observability reflects the evolution from simple to complex systems:

AspectTraditional MonitoringModern Observability
Question"Is X broken?""What is broken and why?"
ApproachPredefined dashboards & alertsExploratory investigation
Known vs UnknownKnown problems onlyUnknown problems discoverable
DataMetrics and some logsMetrics, logs, traces correlated
Query pattern"Show me dashboard #5""Show me all requests where X..."
System complexityMonoliths, simple architecturesMicroservices, distributed systems

πŸ€” Did you know? The term "observability" comes from control theory, where a system is observable if you can determine its internal state by examining its outputs. The concept was borrowed by software engineering around 2017-2018 as microservices made traditional monitoring insufficient.

Telemetry Pipeline

Understanding how signals flow from your application to your observability platform:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              TELEMETRY PIPELINE                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“± APPLICATION
  β”‚
  β”œβ”€β†’ Instrumentation Library (OpenTelemetry SDK)
  β”‚   └─→ Generates: spans, metrics, logs
  β”‚
  ↓
πŸ”„ LOCAL AGENT/COLLECTOR
  β”‚   └─→ Buffers, batches, samples
  β”‚   └─→ Adds metadata (host, cluster, etc.)
  β”‚
  ↓
🌐 TELEMETRY BACKEND
  β”‚   β”œβ”€β†’ Metrics β†’ Time-series database (Prometheus)
  β”‚   β”œβ”€β†’ Logs β†’ Log aggregator (Elasticsearch)
  β”‚   └─→ Traces β†’ Trace store (Jaeger, Tempo)
  β”‚
  ↓
πŸ“Š ANALYSIS & VISUALIZATION
      └─→ Dashboards, alerts, queries (Grafana)

Key pipeline concepts:

  • Buffering: Temporarily store data to handle bursts
  • Batching: Send multiple signals together for efficiency
  • Sampling: Reduce volume while maintaining statistical validity
  • Enrichment: Add context (environment, version, region)
  • Routing: Send different signals to different backends

Real-World Examples

Example 1: Debugging a Latency Spike

Scenario: Your e-commerce site's checkout endpoint suddenly becomes slow. Users are complaining. How do observability pillars work together?

Step 1 - Metrics detect the problem:

Metric: http_request_duration_p95{endpoint="/checkout"}
Baseline: 250ms
Current:  4.2s (16x increase!) πŸ”΄
Alert triggered at 10:34 AM

Step 2 - Traces identify the component:

Query for slow traces: trace_duration > 3s AND endpoint="/checkout"

Trace ID: abc-123-xyz
Total: 4.1s

β”œβ”€ API Gateway              50ms   β–‘β–‘ (1%)
β”œβ”€ Checkout Service        100ms   β–‘β–‘β–‘ (2%)
β”‚  β”œβ”€ Validate Cart         30ms
β”‚  └─ Calculate Tax         70ms
β”œβ”€ Payment Service        3,900ms  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (95%)
β”‚  β”œβ”€ Create Transaction    100ms
β”‚  └─ Database Query      3,800ms  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
└─ Notification Service     50ms   β–‘β–‘ (1%)

πŸ” Payment Service DB query is the bottleneck!

Step 3 - Logs reveal the root cause:

Query logs with trace_id="abc-123-xyz" and service="payment":

{
  "timestamp": "2026-01-15T10:34:18Z",
  "level": "ERROR",
  "trace_id": "abc-123-xyz",
  "span_id": "span-payment-db",
  "message": "Database connection pool exhausted",
  "details": {
    "pool_size": 10,
    "active_connections": 10,
    "wait_time_ms": 3780,
    "query": "SELECT * FROM transactions WHERE..."
  }
}

Root cause: Connection pool is too small for current load. Fix: Increase pool size from 10 to 50 connections.

πŸ’‘ Key insight: Each pillar played a role. Metrics detected that something was wrong, traces identified where, and logs explained why.

Example 2: Cardinality Disaster

Scenario: A well-meaning engineer adds user_id as a metric label to track per-user request rates.

The code:

request_counter = Counter(
    'api_requests_total',
    'Total API requests',
    ['endpoint', 'method', 'status', 'user_id']  # ❌ DANGER!
)

@app.route('/api/<endpoint>')
def handle_request(endpoint):
    request_counter.labels(
        endpoint=endpoint,
        method=request.method,
        status=200,
        user_id=current_user.id  # Creates unique time series per user!
    ).inc()

The math:

  • 50 endpoints Γ— 5 methods Γ— 10 status codes = 2,500 base time series βœ…
  • Add 100,000 active users = 250,000,000 time series ❌

The consequences:

  • Prometheus server runs out of memory
  • Query times jump from 100ms to 30+ seconds
  • Monthly observability bill increases by 40x
  • System becomes unusable

The fix:

## Option 1: Remove high-cardinality label
request_counter = Counter(
    'api_requests_total',
    'Total API requests',
    ['endpoint', 'method', 'status']  # βœ… Low cardinality only
)

## Option 2: Use traces/logs for high-cardinality data
trace.set_attribute('user_id', current_user.id)  # βœ… Traces handle it
logger.info('Request processed', extra={'user_id': current_user.id})

## Option 3: Aggregate before recording
if current_user.plan == 'premium':
    premium_counter.inc()  # Track by plan type, not individual user

🧠 Memory device: METRIC = MATH, TRACE = TRACKING. If you need to do calculations (aggregations, averages), use metrics with low cardinality. If you need to track individual items (users, sessions), use traces.

Example 3: Trace Sampling Strategy

Scenario: Your system processes 10,000 requests/second. Recording all traces generates 500 GB/day of data, costing $15,000/month. You need a smarter sampling strategy.

Head-based sampling (decide immediately):

## Sample 1% of all traffic
import random

if random.random() < 0.01:  # 1% sampling rate
    tracer.start_trace()
else:
    tracer.skip_trace()

Pros: Simple, predictable volume reduction (99% less data) Cons: Might miss the exact trace that had the error

Tail-based sampling (decide after completion):

## Smart sampling: keep interesting traces
def should_keep_trace(trace):
    # Always keep errors
    if trace.has_error():
        return True
    
    # Always keep slow requests
    if trace.duration > 1000ms:
        return True
    
    # Keep 1% of normal traffic for baseline
    if random.random() < 0.01:
        return True
    
    return False

Result comparison:

MetricNo SamplingHead (1%)Tail (smart)
Traces/day864M8.64M12M
Data size/day500 GB5 GB7 GB
Cost/month$15,000$150$210
Error coverage100%~1%100%
Slow request coverage100%~1%100%

πŸ’‘ Best practice: Use tail-based sampling in production. The slightly higher cost (7 GB vs 5 GB) is worth having 100% of errors and performance issues.

Example 4: Effective Correlation

Scenario: A user reports "my payment failed" but you have millions of requests to search through. Correlation IDs save the day.

User provides: "I tried to pay around 2:30 PM, order #ORD-78945"

Investigation flow:

1. Start with structured logs:

Query: order_id="ORD-78945" AND timestamp>="2026-01-15T14:25:00Z"

Result:
{
  "timestamp": "2026-01-15T14:32:18Z",
  "order_id": "ORD-78945",
  "trace_id": "xyz-789-abc",  ← Got it!
  "user_id": "U-12345",
  "event": "payment_failed",
  "reason": "gateway_timeout"
}

2. Use trace_id to examine the full request:

Query traces: trace_id="xyz-789-abc"

Trace spans show:
- Checkout Service: 150ms βœ…
- Payment Gateway API: 30,000ms ❌ (30 second timeout!)
- Response: 504 Gateway Timeout

3. Check metrics for pattern:

Query: payment_gateway_duration{time="14:30-14:35"}

Shows: Spike in latency across ALL payment requests
Indicates: External payment provider issue, not our code

4. Correlate with external status: Check payment provider's status page β†’ "Degraded performance 14:28-14:41 UTC"

Complete picture:

  • User's payment failed due to external provider outage βœ…
  • Our system correctly timed out and didn't charge user βœ…
  • Need better error message to user about retrying βœ…

πŸ”§ Try this: In your next project, generate a UUID at the start of each request and include it in every log, metric exemplar, and trace. You'll be amazed how much easier debugging becomes.

Common Mistakes

⚠️ Mistake #1: Using metrics for high-cardinality data

  • Wrong: Adding user_id, session_id, or transaction_id as metric labels
  • Why it's bad: Creates millions of time series, crashes metric systems
  • Fix: Use these IDs in traces and logs, use aggregated categories in metrics

⚠️ Mistake #2: Logging everything at DEBUG level in production

  • Wrong: Leaving verbose DEBUG logs enabled in production
  • Why it's bad: Generates terabytes of mostly useless data, increases costs 10-100x
  • Fix: Use INFO for production, enable DEBUG temporarily when investigating issues

⚠️ Mistake #3: No correlation between signals

  • Wrong: Metrics, logs, and traces exist in isolation with no shared IDs
  • Why it's bad: Impossible to connect "this metric spike" to "these specific traces"
  • Fix: Always include trace_id in logs, use exemplars in metrics to link to sample traces

⚠️ Mistake #4: Ignoring sampling

  • Wrong: Recording 100% of traces "to be safe"
  • Why it's bad: Overwhelming data volume, unsustainable costs
  • Fix: Implement tail-based sampling to keep all interesting traces (errors, slow requests) while reducing normal traffic

⚠️ Mistake #5: Monitoring symptoms instead of causes

  • Wrong: Alert fires: "Disk usage is 95%" but you don't know which service is filling it
  • Why it's bad: Alerts without context require manual investigation every time
  • Fix: Add rich context to metrics (service, pod, process) and correlate with logs showing what's writing to disk

⚠️ Mistake #6: Treating observability as an afterthought

  • Wrong: Building the entire system, then trying to "add observability" before launch
  • Why it's bad: Retrofitting instrumentation is 10x harder than building it in from the start
  • Fix: Make observability a requirement in every story, instrument as you build

⚠️ Mistake #7: Not defining SLIs/SLOs

  • Wrong: Collecting metrics without clear definitions of "good" performance
  • Why it's bad: Can't tell if your system is meeting user expectations
  • Fix: Define Service Level Indicators (e.g., "99% of requests < 500ms") and track them

⚠️ Mistake #8: Alert fatigue

  • Wrong: Creating alerts for every possible metric anomaly
  • Why it's bad: Engineers ignore alerts when 95% are false positives
  • Fix: Alert only on user-impacting issues, use warnings/dashboards for everything else

Key Takeaways

πŸ“‹ Quick Reference Card: Observability Foundations

ConceptKey Points
Three PillarsπŸ“Š Metrics (what's wrong) + πŸ“ Logs (what happened) + πŸ”— Traces (where it happened)
CardinalityKeep metrics LOW cardinality (<1000 series). Use traces/logs for high cardinality data.
CorrelationAlways include trace_id in logs. Use correlation IDs to connect signals across pillars.
SamplingTail-based sampling: Keep 100% of errors/slow requests, 1% of normal traffic.
InstrumentationStart with auto-instrumentation, add manual for business metrics.
Metric TypesCounter (always up), Gauge (up/down), Histogram (distribution), Summary (percentiles)
Log LevelsTRACE < DEBUG < INFO < WARN < ERROR < FATAL. Production = INFO level.
SpansIndividual operations in a trace. Form parent-child tree showing request flow.

🧠 Memory Device - The Three Questions:

  • Metrics: "Is something wrong?" (trends, aggregates)
  • Traces: "Where is it wrong?" (distributed system flow)
  • Logs: "Why is it wrong?" (detailed event context)

⚑ Quick Decision Tree:

Need to store data about...
  β”œβ”€ Numbers that change over time? β†’ Metrics
  β”œβ”€ Individual event details? β†’ Logs
  β”œβ”€ Request flow across services? β†’ Traces
  └─ User/session specific data? β†’ Traces + Logs (never Metrics!)

Summary

Observability transforms how we understand and debug production systems. Unlike traditional monitoring that asks "is service X down?", observability lets you ask arbitrary questions: "show me all requests from mobile users in Europe that took longer than 2 seconds and touched the payment service." This exploratory capability is essential for modern distributed systems where failures are complex and unpredictable.

The three pillarsβ€”metrics, logs, and tracesβ€”work together synergistically. Metrics provide efficient aggregation and alerting. Logs offer detailed event context. Traces map request flows through distributed systems. Used together with proper correlation IDs, they enable you to move from "alert fired" to "root cause identified" in minutes instead of hours.

The hidden challenges of observability are cardinality (which destroys metric systems) and data volume (which destroys budgets). Smart sampling strategies, careful metric label selection, and using the right signal type for the right data are essential for sustainable observability at scale.

As you build your observability practice, remember: instrumentation is not a one-time task but an ongoing part of software development. Every new feature should include observability from day one. The time invested upfront pays dividends when production issues ariseβ€”and they always do.

πŸ“š Further Study