Observability Foundations

Master the conceptual shift from monitoring to true observability and understand why traditional approaches fail during critical incidents

Observability Foundations

Master production observability with free flashcards and spaced repetition practice. This lesson covers core observability pillars, signal types, telemetry collection, and the shift from traditional monitoring to modern observability—essential concepts for DevOps engineers, SREs, and platform teams building reliable distributed systems.

Welcome to Observability

💻 In today's world of microservices, containers, and cloud-native architectures, understanding what's happening inside your production systems is more critical—and more challenging—than ever. Observability is the practice of understanding the internal state of a system by examining its external outputs. Unlike traditional monitoring that asks "is this specific thing broken?", observability asks "what is broken and why?"

Think of observability as the difference between checking if your car's engine light is on (monitoring) versus having a diagnostic tool that shows you exactly which cylinder is misfiring, why, and what cascading effects it's causing (observability). 🚗

🎯 What You'll Learn

The Three Pillars: Metrics, logs, and traces—and why we need all three
Signal Types: Understanding telemetry data and its characteristics
Cardinality: The hidden challenge that breaks observability systems
Correlation: Connecting signals across systems to find root causes
Instrumentation: How to capture meaningful data from your applications

Core Concepts

The Three Pillars of Observability

Modern observability rests on three foundational signal types, often called the "three pillars":

Pillar	What It Captures	Best For	Example
📊 Metrics	Numerical measurements over time	Trends, aggregations, alerts	CPU usage, request rate, error count
📝 Logs	Discrete event records	Debugging specific events	"User 123 failed login at 10:34"
🔗 Traces	Request paths through distributed systems	Understanding latency and dependencies	API call → database → cache → response

Why you need all three: Each pillar answers different questions. Metrics tell you that something is wrong, logs tell you what happened, and traces tell you where in your system the problem occurred.

💡 Tip: Think of them as complementary tools in a diagnostic toolkit. You wouldn't try to fix a car with only a wrench—you need multiple tools for different jobs.

Understanding Metrics

Metrics are aggregated numerical data points collected at regular intervals. They're efficient to store and query, making them perfect for:

Dashboards showing system health at a glance
Alerts triggering when thresholds are crossed
Trend analysis over days, weeks, or months

METRIC TYPES
┌─────────────────────────────────────────────┐
│                                             │
│  📈 Counter: Only increases                │
│     Example: total_requests = 1,245,892    │
│     (resets to 0 on restart)               │
│                                             │
│  📊 Gauge: Can go up or down               │
│     Example: active_connections = 42       │
│     (current state snapshot)               │
│                                             │
│  ⏱️ Histogram: Distribution of values      │
│     Example: request_duration_ms           │
│     Buckets: <100ms, <200ms, <500ms, 1s+   │
│                                             │
│  📉 Summary: Like histogram + percentiles  │
│     Example: p50=120ms, p95=450ms, p99=2s  │
│                                             │
└─────────────────────────────────────────────┘

Metric dimensions (also called labels or tags) add context:

http_requests_total{method="GET", endpoint="/api/users", status="200"} = 15420
http_requests_total{method="POST", endpoint="/api/users", status="201"} = 892
http_requests_total{method="GET", endpoint="/api/users", status="404"} = 23

⚠️ Cardinality warning: Each unique combination of label values creates a new time series. With 10 endpoints, 5 methods, and 10 status codes, you have 500 time series. Add a user_id label with 100,000 users, and suddenly you have 50,000,000 time series—enough to overwhelm most metric systems!

Understanding Logs

Logs are timestamped text records of discrete events. They're the oldest form of telemetry and remain invaluable for:

Detailed debugging of specific failures
Audit trails and compliance
Understanding exact sequences of events

Log Level	Purpose	Example
TRACE	Very detailed, for development	"Entering method processPayment()"
DEBUG	Detailed information for debugging	"SQL query took 45ms"
INFO	Normal operational events	"User logged in successfully"
WARN	Potentially problematic situations	"Retry attempt 3 of 5"
ERROR	Error events that allow continuation	"Failed to send email notification"
FATAL	Severe errors causing shutdown	"Cannot connect to database, exiting"

Structured logging transforms logs from raw text into queryable data:

## Unstructured (harder to query)
User john@example.com failed login from 192.168.1.100 at 2026-01-15 10:34:22

## Structured (easily queryable)
{
  "timestamp": "2026-01-15T10:34:22Z",
  "level": "WARN",
  "event": "login_failed",
  "user_email": "john@example.com",
  "source_ip": "192.168.1.100",
  "reason": "invalid_password",
  "attempt_count": 3
}

💡 Best practice: Always include correlation IDs (request IDs, trace IDs) in your logs so you can connect related events across multiple services.

Understanding Traces

Distributed tracing tracks requests as they flow through multiple services in a system. Each request becomes a trace, and each operation within that trace is a span.

DISTRIBUTED TRACE ANATOMY

Trace ID: abc123xyz (represents entire request journey)

┌────────────────────────────────────────────────────────┐
│ Span: API Gateway (125ms total)                        │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Span: User Service (45ms)                          │ │
│ │ ┌──────────────────────────────────┐               │ │
│ │ │ Span: Database Query (15ms)      │               │ │
│ │ └──────────────────────────────────┘               │ │
│ └────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Span: Order Service (80ms)                         │ │
│ │ ┌──────────────────┐  ┌──────────────────────────┐ │ │
│ │ │ Span: Cache (5ms)│  │ Span: Payment API (60ms) │ │ │
│ │ └──────────────────┘  └──────────────────────────┘ │ │
│ └────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘

Each span contains:
- Operation name
- Start time & duration
- Parent span ID (forms the tree)
- Tags/attributes (metadata)
- Logs/events within the span

Why traces matter: Without tracing, debugging a slow request in a 20-service architecture means checking logs in 20 different places. With tracing, you immediately see that 80% of the time was spent in the Payment API's database call.

🔍 Sampling: Recording every single trace creates massive data volumes. Most systems use sampling strategies:

Head-based sampling: Decide to record immediately (e.g., 1% of all requests)
Tail-based sampling: Decide after seeing the complete trace (e.g., keep all errors and slow requests)

Cardinality: The Hidden Challenge

Cardinality refers to the number of unique values in a dataset. In observability, high cardinality destroys performance and costs.

Dimension Type	Cardinality	Impact
Low cardinality environment, region, service	3-100 values	✅ Efficient, inexpensive
Medium cardinality endpoint, host, container	100-1000 values	⚠️ Manageable with care
High cardinality user_id, trace_id, session_id	1000s-millions	❌ Expensive, slow queries

The cardinality explosion:

┌───────────────────────────────────────────────┐
│ LABEL COMBINATION EXPLOSION                   │
├───────────────────────────────────────────────┤
│                                               │
│  3 services × 20 endpoints × 8 status codes  │
│  = 480 time series ✅ (manageable)           │
│                                               │
│  Add user_id (50,000 users):                 │
│  = 24,000,000 time series ❌ (disaster!)     │
│                                               │
│  Storage & query costs increase linearly     │
│  with number of unique time series           │
│                                               │
└───────────────────────────────────────────────┘

💡 Solution: Use high-cardinality dimensions in traces and logs (designed for it), not in metrics. If you need to filter metrics by user, aggregate first or use exemplars (sample traces linked from metrics).

Signal Correlation

Correlation is connecting different signals to understand the complete picture. This is where observability becomes more than the sum of its parts.

CORRELATION FLOW: DEBUG SLOW API

1️⃣ Metric shows spike
    ↓
   📊 "p95 latency jumped from 200ms to 5s"
   http_request_duration{endpoint="/checkout"}
    ↓
    
2️⃣ Use time range to query traces
    ↓
   🔗 Find slow traces in that time window
   trace_duration > 4s AND endpoint="/checkout"
    ↓
    
3️⃣ Examine trace spans
    ↓
   "Payment Service → Database span = 4.8s"
   (Identified the slow component)
    ↓
    
4️⃣ Get span's trace_id, search logs
    ↓
   📝 trace_id="xyz789" shows:
   "ERROR: Connection pool exhausted, waited 4.7s"
    ↓
    
5️⃣ Root cause found!
   Database connection pool too small

Correlation IDs make this possible:

{
  "trace_id": "xyz789",
  "span_id": "abc123",
  "request_id": "req-456",
  "user_id": "u-789",
  "session_id": "sess-012"
}

Include these in all three pillars:

Metrics: Use exemplars to link to sample traces
Logs: Always log trace_id and span_id
Traces: Include request_id and session context

Instrumentation Approaches

Instrumentation is the process of adding observability code to your applications. There are three main approaches:

Approach	How It Works	Pros	Cons
Manual	Write telemetry code yourself	Full control, custom business metrics	Time-consuming, inconsistent
Auto-instrumentation	Agent/library instruments automatically	Zero code changes, quick setup	Generic, limited customization
Hybrid	Auto + manual for critical paths	Balance of ease and customization	Requires both skill sets

Instrumentation levels:

        ┌─────────────────────────────┐
        │  🎯 Business Metrics        │
        │  (manual instrumentation)   │
        │  "checkout_completed"       │
        │  "cart_abandoned"           │
        ├─────────────────────────────┤
             ↑ Custom, domain-specific
        ┌─────────────────────────────┐
        │  📦 Application Metrics     │
        │  (framework instrumentation)│
        │  HTTP requests, DB queries  │
        ├─────────────────────────────┤
             ↑ Language/framework
        ┌─────────────────────────────┐
        │  ⚙️ System Metrics          │
        │  (agent instrumentation)    │
        │  CPU, memory, disk, network │
        └─────────────────────────────┘
             ↑ Infrastructure layer

💡 Best practice: Start with auto-instrumentation for the foundation, then add manual instrumentation for business-critical flows and domain-specific metrics.

From Monitoring to Observability

The shift from monitoring to observability reflects the evolution from simple to complex systems:

Aspect	Traditional Monitoring	Modern Observability
Question	"Is X broken?"	"What is broken and why?"
Approach	Predefined dashboards & alerts	Exploratory investigation
Known vs Unknown	Known problems only	Unknown problems discoverable
Data	Metrics and some logs	Metrics, logs, traces correlated
Query pattern	"Show me dashboard #5"	"Show me all requests where X..."
System complexity	Monoliths, simple architectures	Microservices, distributed systems

🤔 Did you know? The term "observability" comes from control theory, where a system is observable if you can determine its internal state by examining its outputs. The concept was borrowed by software engineering around 2017-2018 as microservices made traditional monitoring insufficient.

Telemetry Pipeline

Understanding how signals flow from your application to your observability platform:

┌──────────────────────────────────────────────────────┐
│              TELEMETRY PIPELINE                      │
└──────────────────────────────────────────────────────┘

📱 APPLICATION
  │
  ├─→ Instrumentation Library (OpenTelemetry SDK)
  │   └─→ Generates: spans, metrics, logs
  │
  ↓
🔄 LOCAL AGENT/COLLECTOR
  │   └─→ Buffers, batches, samples
  │   └─→ Adds metadata (host, cluster, etc.)
  │
  ↓
🌐 TELEMETRY BACKEND
  │   ├─→ Metrics → Time-series database (Prometheus)
  │   ├─→ Logs → Log aggregator (Elasticsearch)
  │   └─→ Traces → Trace store (Jaeger, Tempo)
  │
  ↓
📊 ANALYSIS & VISUALIZATION
      └─→ Dashboards, alerts, queries (Grafana)

Key pipeline concepts:

Buffering: Temporarily store data to handle bursts
Batching: Send multiple signals together for efficiency
Sampling: Reduce volume while maintaining statistical validity
Enrichment: Add context (environment, version, region)
Routing: Send different signals to different backends

Real-World Examples

Example 1: Debugging a Latency Spike

Scenario: Your e-commerce site's checkout endpoint suddenly becomes slow. Users are complaining. How do observability pillars work together?

Step 1 - Metrics detect the problem:

Metric: http_request_duration_p95{endpoint="/checkout"}
Baseline: 250ms
Current:  4.2s (16x increase!) 🔴
Alert triggered at 10:34 AM

Step 2 - Traces identify the component:

Query for slow traces: trace_duration > 3s AND endpoint="/checkout"

Trace ID: abc-123-xyz
Total: 4.1s

├─ API Gateway              50ms   ░░ (1%)
├─ Checkout Service        100ms   ░░░ (2%)
│  ├─ Validate Cart         30ms
│  └─ Calculate Tax         70ms
├─ Payment Service        3,900ms  ████████████████ (95%)
│  ├─ Create Transaction    100ms
│  └─ Database Query      3,800ms  ████████████████
└─ Notification Service     50ms   ░░ (1%)

🔍 Payment Service DB query is the bottleneck!

Step 3 - Logs reveal the root cause:

Query logs with trace_id="abc-123-xyz" and service="payment":

{
  "timestamp": "2026-01-15T10:34:18Z",
  "level": "ERROR",
  "trace_id": "abc-123-xyz",
  "span_id": "span-payment-db",
  "message": "Database connection pool exhausted",
  "details": {
    "pool_size": 10,
    "active_connections": 10,
    "wait_time_ms": 3780,
    "query": "SELECT * FROM transactions WHERE..."
  }
}

Root cause: Connection pool is too small for current load. Fix: Increase pool size from 10 to 50 connections.

💡 Key insight: Each pillar played a role. Metrics detected that something was wrong, traces identified where, and logs explained why.

Example 2: Cardinality Disaster

Scenario: A well-meaning engineer adds user_id as a metric label to track per-user request rates.

The code:

request_counter = Counter(
    'api_requests_total',
    'Total API requests',
    ['endpoint', 'method', 'status', 'user_id']  # ❌ DANGER!
)

@app.route('/api/<endpoint>')
def handle_request(endpoint):
    request_counter.labels(
        endpoint=endpoint,
        method=request.method,
        status=200,
        user_id=current_user.id  # Creates unique time series per user!
    ).inc()

The math:

50 endpoints × 5 methods × 10 status codes = 2,500 base time series ✅
Add 100,000 active users = 250,000,000 time series ❌

The consequences:

Prometheus server runs out of memory
Query times jump from 100ms to 30+ seconds
Monthly observability bill increases by 40x
System becomes unusable

The fix:

## Option 1: Remove high-cardinality label
request_counter = Counter(
    'api_requests_total',
    'Total API requests',
    ['endpoint', 'method', 'status']  # ✅ Low cardinality only
)

## Option 2: Use traces/logs for high-cardinality data
trace.set_attribute('user_id', current_user.id)  # ✅ Traces handle it
logger.info('Request processed', extra={'user_id': current_user.id})

## Option 3: Aggregate before recording
if current_user.plan == 'premium':
    premium_counter.inc()  # Track by plan type, not individual user

🧠 Memory device: METRIC = MATH, TRACE = TRACKING. If you need to do calculations (aggregations, averages), use metrics with low cardinality. If you need to track individual items (users, sessions), use traces.

Example 3: Trace Sampling Strategy

Scenario: Your system processes 10,000 requests/second. Recording all traces generates 500 GB/day of data, costing $15,000/month. You need a smarter sampling strategy.

Head-based sampling (decide immediately):

## Sample 1% of all traffic
import random

if random.random() < 0.01:  # 1% sampling rate
    tracer.start_trace()
else:
    tracer.skip_trace()

Pros: Simple, predictable volume reduction (99% less data) Cons: Might miss the exact trace that had the error

Tail-based sampling (decide after completion):

## Smart sampling: keep interesting traces
def should_keep_trace(trace):
    # Always keep errors
    if trace.has_error():
        return True
    
    # Always keep slow requests
    if trace.duration > 1000ms:
        return True
    
    # Keep 1% of normal traffic for baseline
    if random.random() < 0.01:
        return True
    
    return False

Result comparison:

Metric	No Sampling	Head (1%)	Tail (smart)
Traces/day	864M	8.64M	12M
Data size/day	500 GB	5 GB	7 GB
Cost/month	$15,000	$150	$210
Error coverage	100%	~1%	100%
Slow request coverage	100%	~1%	100%

💡 Best practice: Use tail-based sampling in production. The slightly higher cost (7 GB vs 5 GB) is worth having 100% of errors and performance issues.

Example 4: Effective Correlation

Scenario: A user reports "my payment failed" but you have millions of requests to search through. Correlation IDs save the day.

User provides: "I tried to pay around 2:30 PM, order #ORD-78945"

Investigation flow:

1. Start with structured logs:

Query: order_id="ORD-78945" AND timestamp>="2026-01-15T14:25:00Z"

Result:
{
  "timestamp": "2026-01-15T14:32:18Z",
  "order_id": "ORD-78945",
  "trace_id": "xyz-789-abc",  ← Got it!
  "user_id": "U-12345",
  "event": "payment_failed",
  "reason": "gateway_timeout"
}

2. Use trace_id to examine the full request:

Query traces: trace_id="xyz-789-abc"

Trace spans show:
- Checkout Service: 150ms ✅
- Payment Gateway API: 30,000ms ❌ (30 second timeout!)
- Response: 504 Gateway Timeout

3. Check metrics for pattern:

Query: payment_gateway_duration{time="14:30-14:35"}

Shows: Spike in latency across ALL payment requests
Indicates: External payment provider issue, not our code

4. Correlate with external status: Check payment provider's status page → "Degraded performance 14:28-14:41 UTC"

Complete picture:

User's payment failed due to external provider outage ✅
Our system correctly timed out and didn't charge user ✅
Need better error message to user about retrying ✅

🔧 Try this: In your next project, generate a UUID at the start of each request and include it in every log, metric exemplar, and trace. You'll be amazed how much easier debugging becomes.

Common Mistakes

⚠️ Mistake #1: Using metrics for high-cardinality data

Wrong: Adding user_id, session_id, or transaction_id as metric labels
Why it's bad: Creates millions of time series, crashes metric systems
Fix: Use these IDs in traces and logs, use aggregated categories in metrics

⚠️ Mistake #2: Logging everything at DEBUG level in production

Wrong: Leaving verbose DEBUG logs enabled in production
Why it's bad: Generates terabytes of mostly useless data, increases costs 10-100x
Fix: Use INFO for production, enable DEBUG temporarily when investigating issues

⚠️ Mistake #3: No correlation between signals

Wrong: Metrics, logs, and traces exist in isolation with no shared IDs
Why it's bad: Impossible to connect "this metric spike" to "these specific traces"
Fix: Always include trace_id in logs, use exemplars in metrics to link to sample traces

⚠️ Mistake #4: Ignoring sampling

Wrong: Recording 100% of traces "to be safe"
Why it's bad: Overwhelming data volume, unsustainable costs
Fix: Implement tail-based sampling to keep all interesting traces (errors, slow requests) while reducing normal traffic

⚠️ Mistake #5: Monitoring symptoms instead of causes

Wrong: Alert fires: "Disk usage is 95%" but you don't know which service is filling it
Why it's bad: Alerts without context require manual investigation every time
Fix: Add rich context to metrics (service, pod, process) and correlate with logs showing what's writing to disk

⚠️ Mistake #6: Treating observability as an afterthought

Wrong: Building the entire system, then trying to "add observability" before launch
Why it's bad: Retrofitting instrumentation is 10x harder than building it in from the start
Fix: Make observability a requirement in every story, instrument as you build

⚠️ Mistake #7: Not defining SLIs/SLOs

Wrong: Collecting metrics without clear definitions of "good" performance
Why it's bad: Can't tell if your system is meeting user expectations
Fix: Define Service Level Indicators (e.g., "99% of requests < 500ms") and track them

⚠️ Mistake #8: Alert fatigue

Wrong: Creating alerts for every possible metric anomaly
Why it's bad: Engineers ignore alerts when 95% are false positives
Fix: Alert only on user-impacting issues, use warnings/dashboards for everything else

Key Takeaways

📋 Quick Reference Card: Observability Foundations

Concept	Key Points
Three Pillars	📊 Metrics (what's wrong) + 📝 Logs (what happened) + 🔗 Traces (where it happened)
Cardinality	Keep metrics LOW cardinality (<1000 series). Use traces/logs for high cardinality data.
Correlation	Always include trace_id in logs. Use correlation IDs to connect signals across pillars.
Sampling	Tail-based sampling: Keep 100% of errors/slow requests, 1% of normal traffic.
Instrumentation	Start with auto-instrumentation, add manual for business metrics.
Metric Types	Counter (always up), Gauge (up/down), Histogram (distribution), Summary (percentiles)
Log Levels	TRACE < DEBUG < INFO < WARN < ERROR < FATAL. Production = INFO level.
Spans	Individual operations in a trace. Form parent-child tree showing request flow.

🧠 Memory Device - The Three Questions:

Metrics: "Is something wrong?" (trends, aggregates)
Traces: "Where is it wrong?" (distributed system flow)
Logs: "Why is it wrong?" (detailed event context)

⚡ Quick Decision Tree:

Need to store data about...
  ├─ Numbers that change over time? → Metrics
  ├─ Individual event details? → Logs
  ├─ Request flow across services? → Traces
  └─ User/session specific data? → Traces + Logs (never Metrics!)

Summary

Observability transforms how we understand and debug production systems. Unlike traditional monitoring that asks "is service X down?", observability lets you ask arbitrary questions: "show me all requests from mobile users in Europe that took longer than 2 seconds and touched the payment service." This exploratory capability is essential for modern distributed systems where failures are complex and unpredictable.

The three pillars—metrics, logs, and traces—work together synergistically. Metrics provide efficient aggregation and alerting. Logs offer detailed event context. Traces map request flows through distributed systems. Used together with proper correlation IDs, they enable you to move from "alert fired" to "root cause identified" in minutes instead of hours.

The hidden challenges of observability are cardinality (which destroys metric systems) and data volume (which destroys budgets). Smart sampling strategies, careful metric label selection, and using the right signal type for the right data are essential for sustainable observability at scale.

As you build your observability practice, remember: instrumentation is not a one-time task but an ongoing part of software development. Every new feature should include observability from day one. The time invested upfront pays dividends when production issues arise—and they always do.

📚 Further Study

OpenTelemetry Documentation: https://opentelemetry.io/docs/ - The vendor-neutral standard for instrumentation, metrics, logs, and traces
Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - Chapter covering observability best practices from Google's SRE team
Charity Majors - Observability Engineering: https://www.honeycomb.io/blog - Co-author of "Observability Engineering" book, frequent writer on modern observability practices

📝

Ready to practice?

This lesson has 15 questions to help you learn