A distributed trace shows this span breakdown: ``` POST /checkout (3200ms total) ├─ Auth (50ms) ├─ Inventory (80ms) ├─ Payment (3000ms) │ ├─ Fraud check (100ms) │ └─ Gateway call (2850ms) └─ Email (70ms) ``` What is the {{1}}? Where should you investigate next: {{2}}?

["Gateway call","payment gateway logs"]

Logs vs Metrics vs Traces

Choosing the right observability tool under time pressure

Logs vs Metrics vs Traces

Master the three pillars of observability with free flashcards and spaced repetition practice. This lesson covers structured logging practices, time-series metrics collection, and distributed tracing fundamentals—essential concepts for debugging production systems under pressure.

Welcome to Observability Fundamentals 👁️

When your production system crashes at 3 AM, you need data—fast. But not just any data. You need the right kind of observability data that tells you exactly what went wrong, where, and why. The three pillars of observability—logs, metrics, and traces—each serve distinct purposes in your debugging arsenal.

Think of observability like investigating a crime scene: 🔍

Logs are the witness statements (detailed accounts of what happened)
Metrics are the security camera timestamps (numerical patterns over time)
Traces are the footprints connecting locations (requests flowing through services)

Understanding when to use each—and how they complement each other—separates engineers who panic under pressure from those who methodically isolate root causes.

Core Concepts: The Three Pillars Explained 🏛️

Logs: The Detective's Notebook 📝

Logs are immutable, timestamped records of discrete events that happened in your system. They're your most detailed observability signal—the raw material for understanding what happened.

Characteristics:

Discrete events: Each log represents a single occurrence
High cardinality: Can contain unlimited unique values (user IDs, error messages, stack traces)
Text-based: Usually structured (JSON) or semi-structured (key=value)
Storage-intensive: Growing linearly with system activity

Common use cases:

Debugging specific user issues ("Why did order #12345 fail?")
Audit trails and compliance
Error stack traces and exception details
Application state at specific moments

Example structured log:

{
  "timestamp": "2024-01-15T14:32:18.123Z",
  "level": "ERROR",
  "service": "payment-processor",
  "trace_id": "a1b2c3d4e5f6",
  "user_id": "usr_98765",
  "event": "payment_failed",
  "error": "CardDeclinedException",
  "message": "Payment declined: insufficient funds",
  "amount": 149.99,
  "currency": "USD",
  "card_last4": "4242"
}

💡 Pro tip: Always include correlation IDs (trace_id, request_id) in logs so you can connect related events across services.

⚠️ Common mistake: Logging too much sensitive data (PII, passwords, credit cards). Use log sanitization libraries and configure log levels appropriately.

Log Levels Hierarchy

Level	Purpose	Production Volume	Example
TRACE	Extremely detailed debugging	❌ Never	"Entering function validateInput()"
DEBUG	Diagnostic information	⚠️ Rarely	"Cache hit for key: user:123"
INFO	Normal operations	✅ Moderate	"Payment processed successfully"
WARN	Unusual but handled	✅ Low	"API rate limit approaching"
ERROR	Failures requiring attention	✅ Very low	"Database connection failed"
FATAL	System-crashing errors	🚨 Extremely rare	"Out of memory, shutting down"

Metrics: The System's Vital Signs 📊

Metrics are numerical measurements aggregated over time intervals. They answer "how much" and "how often" questions with minimal storage overhead.

Characteristics:

Time-series data: Values measured at regular intervals
Low cardinality: Limited set of dimensions (tags/labels)
Aggregatable: Can be summed, averaged, percentiled
Storage-efficient: Pre-aggregated, constant size regardless of traffic

Metric types:

Type	Description	Example	Use Case
Counter	Monotonically increasing value	http_requests_total	Request counts, errors
Gauge	Value that goes up and down	memory_usage_bytes	CPU, memory, queue size
Histogram	Distribution of values in buckets	response_time_ms	Latency percentiles
Summary	Similar to histogram, client-calculated	request_duration_summary	Pre-computed quantiles

Example Prometheus metrics:

## Counter - total HTTP requests
http_requests_total{service="api", endpoint="/users", status="200"} 15847

## Gauge - current active connections
active_connections{service="database", instance="db-1"} 42

## Histogram - request duration buckets
http_request_duration_seconds_bucket{le="0.1"} 9234
http_request_duration_seconds_bucket{le="0.5"} 11532
http_request_duration_seconds_bucket{le="1.0"} 11890
http_request_duration_seconds_sum 5847.32
http_request_duration_seconds_count 11932

💡 Golden Signals (Google SRE methodology):

Latency: How long requests take
Traffic: How much demand on your system
Errors: Rate of failed requests
Saturation: How "full" your service is

When to use metrics:

Real-time dashboards and alerts
Capacity planning and trends
SLO/SLA monitoring (99.9% uptime)
Resource utilization tracking
High-level system health

⚠️ Cardinality explosion danger: Adding high-cardinality dimensions (user_id, session_id) to metrics can create millions of time series, overwhelming your metrics backend.

## ❌ BAD - creates millions of unique metrics
request_counter.labels(user_id=user_id, session_id=session_id).inc()

## ✅ GOOD - bounded cardinality
request_counter.labels(endpoint=endpoint, status_code=status).inc()

Traces: Following the Request Journey 🗺️

Traces show the path of a single request as it flows through distributed systems. They answer "where did the time go?" by breaking down latency across service boundaries.

Characteristics:

Request-scoped: One trace per user request
Tree structure: Parent-child relationships between spans
Cross-service: Follows requests across network boundaries
Sampling required: Can't trace 100% of traffic at scale

Trace anatomy:

Trace: User checkout request (trace_id: abc123)
├─ Span 1: API Gateway [120ms]
│  ├─ Span 2: Auth Service [15ms]
│  ├─ Span 3: Inventory Service [45ms]
│  │  └─ Span 4: Database query [40ms]
│  └─ Span 5: Payment Service [55ms]
│     ├─ Span 6: Fraud check API [20ms]
│     └─ Span 7: Stripe API [30ms]
└─ Total duration: 120ms

Span attributes:

{
  "trace_id": "abc123def456",
  "span_id": "span789",
  "parent_span_id": "span456",
  "name": "POST /checkout",
  "service": "payment-service",
  "start_time": "2024-01-15T14:32:18.100Z",
  "duration_ms": 55,
  "status": "OK",
  "attributes": {
    "http.method": "POST",
    "http.url": "/api/v1/checkout",
    "http.status_code": 200,
    "user.id": "usr_98765",
    "payment.amount": 149.99
  },
  "events": [
    {"timestamp": "...", "name": "fraud_check_passed"},
    {"timestamp": "...", "name": "payment_authorized"}
  ]
}

Distributed tracing context propagation:

## Service A generates trace context
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    
    # Inject trace context into HTTP headers
    headers = {}
    inject(headers)  # Adds traceparent header
    
    # Call Service B with context
    response = requests.post(
        "http://inventory-service/reserve",
        headers=headers,
        json={"items": items}
    )

// Service B extracts trace context and continues span
func reserveInventory(w http.ResponseWriter, r *http.Request) {
    // Extract parent trace context from headers
    ctx := otel.GetTextMapPropagator().Extract(r.Context(), 
        propagation.HeaderCarrier(r.Header))
    
    // Create child span
    ctx, span := tracer.Start(ctx, "reserve_inventory")
    defer span.End()
    
    // Your business logic here
    checkInventory(ctx, items)
}

💡 Sampling strategies:

Head-based sampling: Decide at trace start (e.g., sample 1% of requests)
Tail-based sampling: Decide after trace completes (keep all errors, slow requests)
Adaptive sampling: Adjust rate based on traffic volume

When to use traces:

Debugging latency issues across services
Understanding service dependencies
Finding bottlenecks in request flows
Root cause analysis for specific user requests

Choosing the Right Tool for the Job 🔧

Decision Matrix

Question	Use Logs	Use Metrics	Use Traces
"Why did order #12345 fail?"	✅ Search logs for order_id	❌	✅ View trace for request_id
"Is error rate increasing?"	⚠️ Expensive aggregation	✅ Alert on error_rate metric	❌
"Which service is slowest?"	❌	⚠️ Shows symptoms only	✅ Trace breakdown shows bottleneck
"What was the exact error message?"	✅ Full stack trace in logs	❌	⚠️ Limited span attributes
"Is CPU usage trending up?"	❌	✅ CPU gauge metric over time	❌
"What's our p99 latency?"	⚠️ Can calculate but expensive	✅ Histogram metrics	⚠️ Only sampled requests

The Complementary Nature of Observability 🔗

┌─────────────────────────────────────────────────────┐
│         OBSERVABILITY DEBUGGING FLOW                │
└─────────────────────────────────────────────────────┘

    📊 METRICS (Discovery)
    "Error rate spiking!"
           │
           ↓
    🗺️ TRACES (Isolation)
    "Payment service taking 5 seconds"
           │
           ↓
    📝 LOGS (Root Cause)
    "CardDeclinedException: Gateway timeout"

Real-world debugging scenario:

Alert fires: Metrics show http_request_duration_seconds{p99} jumped from 200ms to 3000ms
Narrow down: Traces reveal that /checkout endpoint's payment-service span is slow
Find root cause: Logs from payment-service show "error": "upstream timeout connecting to payment gateway"
Confirm: Check payment gateway's status page (external dependency issue)

## Unified instrumentation example (Python + OpenTelemetry)
import logging
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.logging import LoggingInstrumentor

## Auto-inject trace context into logs
LoggingInstrumentor().instrument()

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

## Define metrics
checkout_counter = meter.create_counter(
    "checkouts_total",
    description="Total checkout attempts"
)
checkout_duration = meter.create_histogram(
    "checkout_duration_ms",
    description="Checkout processing time"
)

def process_checkout(user_id, items):
    # Create trace span
    with tracer.start_as_current_span("process_checkout") as span:
        span.set_attribute("user.id", user_id)
        
        start_time = time.time()
        
        try:
            # Business logic
            result = charge_payment(user_id, items)
            
            # Success metrics
            checkout_counter.add(1, {"status": "success"})
            
            # Structured log with trace context (auto-injected)
            logging.info(
                "Checkout completed",
                extra={
                    "user_id": user_id,
                    "amount": result.amount,
                    "order_id": result.order_id
                }
            )
            
            return result
            
        except PaymentException as e:
            # Error metrics
            checkout_counter.add(1, {"status": "failed", "error": type(e).__name__})
            
            # Error log with stack trace
            logging.error(
                f"Checkout failed: {str(e)}",
                extra={"user_id": user_id},
                exc_info=True
            )
            
            # Mark span as error
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(e)
            
            raise
        
        finally:
            # Record duration metric
            duration_ms = (time.time() - start_time) * 1000
            checkout_duration.record(duration_ms)

Storage and Cost Considerations 💰

Relative Costs

Storage Cost Comparison (per GB ingested)

📝 LOGS:      ████████████████████ $$$$ (Highest)
               - Full-text indexing
               - Retention: 7-30 days typical
               - Grows linearly with traffic

🗺️ TRACES:    ████████████ $$$ (Medium-High)
               - Sampling reduces volume
               - Complex relationships to store
               - Retention: 7-14 days typical

📊 METRICS:   ████ $ (Lowest)
               - Pre-aggregated time series
               - Constant size per cardinality
               - Retention: 30-365+ days typical

Optimization strategies:

## Log optimization
log_levels:
  production: INFO  # Not DEBUG
  sampling: 0.1  # Sample 10% of INFO logs
  structured: true  # Enables efficient querying
  retention:
    error: 30 days
    info: 7 days
    debug: 1 day

## Metric optimization  
metrics:
  cardinality_limit: 1000  # per metric name
  aggregation_interval: 15s
  retention:
    raw: 7 days
    5min_avg: 30 days
    1hour_avg: 365 days

## Trace optimization
tracing:
  sampling_strategy: tail_based
  sample_rate: 0.01  # 1% of successful requests
  always_sample:
    - errors: true
    - slow_requests: true  # p95+ latency
  retention: 7 days

Common Mistakes to Avoid ⚠️

1. Logging in Metrics Clothing

## ❌ WRONG - Treating logs like metrics
for user_id in user_ids:
    logger.info(f"User {user_id} logged in")  # Creates log spam

## ✅ RIGHT - Use metrics for aggregation, logs for details
login_counter.labels(status="success").inc()
logger.info("User logged in", extra={"user_id": user_id})  # Only when needed

2. High-Cardinality Metrics

## ❌ WRONG - Unbounded cardinality
http_requests{user_id="usr_12345", session_id="sess_67890"}  # Millions of series!

## ✅ RIGHT - Bounded dimensions
http_requests{endpoint="/api/users", status="200", method="GET"}

3. Missing Trace Context

// ❌ WRONG - Logs and traces disconnected
logger.error('Payment failed');

// ✅ RIGHT - Correlate with trace_id
const span = trace.getActiveSpan();
logger.error('Payment failed', {
  trace_id: span.spanContext().traceId,
  span_id: span.spanContext().spanId
});

4. Over-Sampling Traces

// ❌ WRONG - Trace 100% in production
sampler := trace.AlwaysSample()

// ✅ RIGHT - Use adaptive sampling
sampler := trace.ParentBased(
    trace.TraceIDRatioBased(0.01),  // 1% base rate
)

5. Log Level Confusion

## ❌ WRONG - Everything at ERROR level
logger.error("User logged in")  # This is INFO!
logger.error("Cache miss")       # This is DEBUG!

## ✅ RIGHT - Appropriate levels
logger.info("User logged in", extra={"user_id": user_id})
logger.debug("Cache miss", extra={"key": cache_key})
logger.error("Database connection failed", exc_info=True)

6. Synchronous Observability Calls

## ❌ WRONG - Blocking on telemetry
def process_request():
    log_to_remote_service("Request started")  # Network call!
    # ... business logic ...

## ✅ RIGHT - Async/buffered telemetry
def process_request():
    logger.info("Request started")  # Buffered by logging library
    # ... business logic ...

Example Scenarios 🎯

Scenario 1: Database Slowdown Investigation

Problem: Users reporting slow page loads

Step 1 - Check metrics dashboard:

## Query Prometheus
rate(http_request_duration_seconds_sum[5m]) / 
  rate(http_request_duration_seconds_count[5m])
## Result: Average latency jumped from 150ms to 2500ms

Step 2 - Sample traces:

-- Query trace backend for slow requests
SELECT trace_id, duration_ms 
FROM traces 
WHERE duration_ms > 2000 
ORDER BY start_time DESC 
LIMIT 10

Trace visualization shows:

GET /products (2400ms)
├─ API Gateway (50ms)
├─ Auth middleware (30ms)
└─ Database query (2300ms)  ← BOTTLENECK!
   SELECT * FROM products WHERE category = ?

Step 3 - Check database logs:

{
  "timestamp": "2024-01-15T14:45:23Z",
  "level": "WARN",
  "service": "postgres",
  "message": "Slow query detected",
  "query": "SELECT * FROM products WHERE category = 'electronics'",
  "duration_ms": 2300,
  "rows_scanned": 1500000,
  "rows_returned": 150,
  "trace_id": "abc123"
}

Root cause: Missing index on category column, causing full table scan.

Scenario 2: Intermittent API Errors

Problem: Alert fires: error_rate > 5% for payment API

Metrics show:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
## /api/payment/charge: 8% error rate (above threshold)

Trace filtering:

-- Find failed payment traces
SELECT trace_id, error_message 
FROM spans 
WHERE service = 'payment-service' 
  AND status = 'ERROR' 
  AND timestamp > NOW() - INTERVAL '1 hour'

Example trace:

POST /payment/charge (failed)
├─ Validate card details (5ms) ✅
├─ Check fraud rules (12ms) ✅
└─ Call Stripe API (timeout after 30s) ❌
   Error: ConnectionTimeout

Logs reveal pattern:

[
  {"timestamp": "14:32:15", "error": "stripe.timeout", "duration_ms": 30000},
  {"timestamp": "14:35:42", "error": "stripe.timeout", "duration_ms": 30000},
  {"timestamp": "14:38:09", "error": "stripe.timeout", "duration_ms": 30000}
]

Root cause: Stripe API experiencing intermittent timeouts (check their status page).

Scenario 3: Memory Leak Detection

Metrics alert: memory_usage_bytes steadily increasing

## Memory gauge showing growth
process_resident_memory_bytes{service="user-service"}
## 500MB → 800MB → 1.2GB → 1.8GB over 4 hours

Check heap profile traces:

// Enable runtime profiling
import _ "net/http/pprof"

// Capture heap snapshot
// curl http://localhost:6060/debug/pprof/heap > heap.prof
// go tool pprof heap.prof

Logs show repeated warnings:

{
  "level": "WARN",
  "message": "Large object allocation",
  "size_bytes": 10485760,
  "source": "cache.Store()",
  "trace_id": "xyz789"
}

Trace shows caching behavior:

GET /users/:id
├─ Check cache (cache miss)
├─ Query database (user data: 10MB)
└─ Store in cache (10MB added, never evicted)

Root cause: Cache has no eviction policy, accumulating data indefinitely.

Scenario 4: Cross-Service Debugging

Problem: Checkout flow failing sporadically

Distributed trace reveals dependency chain:

POST /checkout (trace_id: abc123) [FAILED]
├─ Frontend service (20ms) ✅
├─ API Gateway (15ms) ✅
├─ Order service (80ms) ✅
│  ├─ Inventory service (50ms) ✅
│  │  └─ Database read (45ms) ✅
│  └─ Pricing service (25ms) ✅
│     └─ Redis cache (5ms) ✅
└─ Payment service (ERROR) ❌
   ├─ Fraud detection (30ms) ✅
   └─ Payment gateway (timeout 10s) ❌
      Error: CircuitBreakerOpenException

Payment service logs:

{
  "trace_id": "abc123",
  "service": "payment-service",
  "error": "CircuitBreakerOpenException",
  "message": "Circuit breaker open after 10 consecutive failures",
  "upstream": "payment-gateway",
  "failure_count": 10,
  "last_success": "2024-01-15T14:20:00Z"
}

Root cause: Payment gateway is down, circuit breaker protecting system from cascading failures.

Key Takeaways 🎓

📋 Quick Reference Card

Pillar	Best For	Cardinality	Cost	Retention
📝 Logs	Debugging specific events, audit trails	Unlimited	$$$$	7-30 days
📊 Metrics	Alerts, trends, dashboards	Low (bounded)	$	30-365 days
🗺️ Traces	Latency analysis, dependencies	Request-scoped	$$$	7-14 days

🧠 Memory Device: "LMT" = "Let Me Trace"

Logs = Lookup specific events
Metrics = Monitor trends & alert
Traces = Track requests across services

🔑 Golden Rules

Correlate everything: Always include trace_id in logs and metrics labels
Sample intelligently: 100% tracing is wasteful; sample errors and slow requests
Mind cardinality: Never use user_id or session_id as metric labels
Structure logs: Use JSON with consistent field names
Use the right tool: Metrics for "what", traces for "where", logs for "why"

🔧 Try This: Instrument a Simple Service

Challenge: Add observability to this basic HTTP handler

## Before: No observability
def handle_user_request(user_id):
    user = database.get_user(user_id)
    if not user:
        return {"error": "User not found"}, 404
    return {"user": user.to_dict()}, 200

Your task: Add:

Structured logging with trace context
Metrics for request count and duration
Distributed tracing span
Error handling with proper observability

💡 Solution (click to expand)

import logging
import time
from opentelemetry import trace, metrics

logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("user_requests_total")
request_duration = meter.create_histogram("user_request_duration_ms")

def handle_user_request(user_id):
    with tracer.start_as_current_span("handle_user_request") as span:
        span.set_attribute("user.id", user_id)
        start_time = time.time()
        
        try:
            logger.info(
                "Fetching user",
                extra={"user_id": user_id}
            )
            
            user = database.get_user(user_id)
            
            if not user:
                request_counter.add(1, {"status": "not_found"})
                logger.warning(
                    "User not found",
                    extra={"user_id": user_id}
                )
                return {"error": "User not found"}, 404
            
            request_counter.add(1, {"status": "success"})
            logger.info(
                "User fetched successfully",
                extra={"user_id": user_id}
            )
            return {"user": user.to_dict()}, 200
            
        except DatabaseException as e:
            request_counter.add(1, {"status": "error"})
            logger.error(
                f"Database error: {str(e)}",
                extra={"user_id": user_id},
                exc_info=True
            )
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(e)
            return {"error": "Internal server error"}, 500
            
        finally:
            duration_ms = (time.time() - start_time) * 1000
            request_duration.record(duration_ms)

📚 Further Study

Essential resources for deeper learning:

OpenTelemetry Documentation - Industry standard for observability instrumentation
https://opentelemetry.io/docs/
Google SRE Book - Monitoring Distributed Systems - Foundational monitoring principles
https://sre.google/sre-book/monitoring-distributed-systems/
Charity Majors - Observability Engineering - Modern observability practices from Honeycomb CTO
https://www.honeycomb.io/blog/observability-engineering-101

Ready to test your knowledge? The practice questions below will challenge you with real-world debugging scenarios using logs, metrics, and traces. Remember: under pressure, the right observability signal makes all the difference between panic and precision. 🎯

📝

Ready to practice?

This lesson has 15 questions to help you learn