You are viewing a preview of this lesson. Sign in to start learning
Back to Debugging Under Pressure

Logs vs Metrics vs Traces

Choosing the right observability tool under time pressure

Logs vs Metrics vs Traces

Master the three pillars of observability with free flashcards and spaced repetition practice. This lesson covers structured logging practices, time-series metrics collection, and distributed tracing fundamentalsβ€”essential concepts for debugging production systems under pressure.

Welcome to Observability Fundamentals πŸ‘οΈ

When your production system crashes at 3 AM, you need dataβ€”fast. But not just any data. You need the right kind of observability data that tells you exactly what went wrong, where, and why. The three pillars of observabilityβ€”logs, metrics, and tracesβ€”each serve distinct purposes in your debugging arsenal.

Think of observability like investigating a crime scene: πŸ”

  • Logs are the witness statements (detailed accounts of what happened)
  • Metrics are the security camera timestamps (numerical patterns over time)
  • Traces are the footprints connecting locations (requests flowing through services)

Understanding when to use eachβ€”and how they complement each otherβ€”separates engineers who panic under pressure from those who methodically isolate root causes.

Core Concepts: The Three Pillars Explained πŸ›οΈ

Logs: The Detective's Notebook πŸ“

Logs are immutable, timestamped records of discrete events that happened in your system. They're your most detailed observability signalβ€”the raw material for understanding what happened.

Characteristics:

  • Discrete events: Each log represents a single occurrence
  • High cardinality: Can contain unlimited unique values (user IDs, error messages, stack traces)
  • Text-based: Usually structured (JSON) or semi-structured (key=value)
  • Storage-intensive: Growing linearly with system activity

Common use cases:

  • Debugging specific user issues ("Why did order #12345 fail?")
  • Audit trails and compliance
  • Error stack traces and exception details
  • Application state at specific moments

Example structured log:

{
  "timestamp": "2024-01-15T14:32:18.123Z",
  "level": "ERROR",
  "service": "payment-processor",
  "trace_id": "a1b2c3d4e5f6",
  "user_id": "usr_98765",
  "event": "payment_failed",
  "error": "CardDeclinedException",
  "message": "Payment declined: insufficient funds",
  "amount": 149.99,
  "currency": "USD",
  "card_last4": "4242"
}

πŸ’‘ Pro tip: Always include correlation IDs (trace_id, request_id) in logs so you can connect related events across services.

⚠️ Common mistake: Logging too much sensitive data (PII, passwords, credit cards). Use log sanitization libraries and configure log levels appropriately.

Log Levels Hierarchy

LevelPurposeProduction VolumeExample
TRACEExtremely detailed debugging❌ Never"Entering function validateInput()"
DEBUGDiagnostic information⚠️ Rarely"Cache hit for key: user:123"
INFONormal operationsβœ… Moderate"Payment processed successfully"
WARNUnusual but handledβœ… Low"API rate limit approaching"
ERRORFailures requiring attentionβœ… Very low"Database connection failed"
FATALSystem-crashing errors🚨 Extremely rare"Out of memory, shutting down"

Metrics: The System's Vital Signs πŸ“Š

Metrics are numerical measurements aggregated over time intervals. They answer "how much" and "how often" questions with minimal storage overhead.

Characteristics:

  • Time-series data: Values measured at regular intervals
  • Low cardinality: Limited set of dimensions (tags/labels)
  • Aggregatable: Can be summed, averaged, percentiled
  • Storage-efficient: Pre-aggregated, constant size regardless of traffic

Metric types:

TypeDescriptionExampleUse Case
CounterMonotonically increasing valuehttp_requests_totalRequest counts, errors
GaugeValue that goes up and downmemory_usage_bytesCPU, memory, queue size
HistogramDistribution of values in bucketsresponse_time_msLatency percentiles
SummarySimilar to histogram, client-calculatedrequest_duration_summaryPre-computed quantiles

Example Prometheus metrics:

## Counter - total HTTP requests
http_requests_total{service="api", endpoint="/users", status="200"} 15847

## Gauge - current active connections
active_connections{service="database", instance="db-1"} 42

## Histogram - request duration buckets
http_request_duration_seconds_bucket{le="0.1"} 9234
http_request_duration_seconds_bucket{le="0.5"} 11532
http_request_duration_seconds_bucket{le="1.0"} 11890
http_request_duration_seconds_sum 5847.32
http_request_duration_seconds_count 11932

πŸ’‘ Golden Signals (Google SRE methodology):

  1. Latency: How long requests take
  2. Traffic: How much demand on your system
  3. Errors: Rate of failed requests
  4. Saturation: How "full" your service is

When to use metrics:

  • Real-time dashboards and alerts
  • Capacity planning and trends
  • SLO/SLA monitoring (99.9% uptime)
  • Resource utilization tracking
  • High-level system health

⚠️ Cardinality explosion danger: Adding high-cardinality dimensions (user_id, session_id) to metrics can create millions of time series, overwhelming your metrics backend.

## ❌ BAD - creates millions of unique metrics
request_counter.labels(user_id=user_id, session_id=session_id).inc()

## βœ… GOOD - bounded cardinality
request_counter.labels(endpoint=endpoint, status_code=status).inc()

Traces: Following the Request Journey πŸ—ΊοΈ

Traces show the path of a single request as it flows through distributed systems. They answer "where did the time go?" by breaking down latency across service boundaries.

Characteristics:

  • Request-scoped: One trace per user request
  • Tree structure: Parent-child relationships between spans
  • Cross-service: Follows requests across network boundaries
  • Sampling required: Can't trace 100% of traffic at scale

Trace anatomy:

Trace: User checkout request (trace_id: abc123)
β”œβ”€ Span 1: API Gateway [120ms]
β”‚  β”œβ”€ Span 2: Auth Service [15ms]
β”‚  β”œβ”€ Span 3: Inventory Service [45ms]
β”‚  β”‚  └─ Span 4: Database query [40ms]
β”‚  └─ Span 5: Payment Service [55ms]
β”‚     β”œβ”€ Span 6: Fraud check API [20ms]
β”‚     └─ Span 7: Stripe API [30ms]
└─ Total duration: 120ms

Span attributes:

{
  "trace_id": "abc123def456",
  "span_id": "span789",
  "parent_span_id": "span456",
  "name": "POST /checkout",
  "service": "payment-service",
  "start_time": "2024-01-15T14:32:18.100Z",
  "duration_ms": 55,
  "status": "OK",
  "attributes": {
    "http.method": "POST",
    "http.url": "/api/v1/checkout",
    "http.status_code": 200,
    "user.id": "usr_98765",
    "payment.amount": 149.99
  },
  "events": [
    {"timestamp": "...", "name": "fraud_check_passed"},
    {"timestamp": "...", "name": "payment_authorized"}
  ]
}

Distributed tracing context propagation:

## Service A generates trace context
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    
    # Inject trace context into HTTP headers
    headers = {}
    inject(headers)  # Adds traceparent header
    
    # Call Service B with context
    response = requests.post(
        "http://inventory-service/reserve",
        headers=headers,
        json={"items": items}
    )
// Service B extracts trace context and continues span
func reserveInventory(w http.ResponseWriter, r *http.Request) {
    // Extract parent trace context from headers
    ctx := otel.GetTextMapPropagator().Extract(r.Context(), 
        propagation.HeaderCarrier(r.Header))
    
    // Create child span
    ctx, span := tracer.Start(ctx, "reserve_inventory")
    defer span.End()
    
    // Your business logic here
    checkInventory(ctx, items)
}

πŸ’‘ Sampling strategies:

  • Head-based sampling: Decide at trace start (e.g., sample 1% of requests)
  • Tail-based sampling: Decide after trace completes (keep all errors, slow requests)
  • Adaptive sampling: Adjust rate based on traffic volume

When to use traces:

  • Debugging latency issues across services
  • Understanding service dependencies
  • Finding bottlenecks in request flows
  • Root cause analysis for specific user requests

Choosing the Right Tool for the Job πŸ”§

Decision Matrix

QuestionUse LogsUse MetricsUse Traces
"Why did order #12345 fail?"βœ… Search logs for order_idβŒβœ… View trace for request_id
"Is error rate increasing?"⚠️ Expensive aggregationβœ… Alert on error_rate metric❌
"Which service is slowest?"❌⚠️ Shows symptoms onlyβœ… Trace breakdown shows bottleneck
"What was the exact error message?"βœ… Full stack trace in logs❌⚠️ Limited span attributes
"Is CPU usage trending up?"βŒβœ… CPU gauge metric over time❌
"What's our p99 latency?"⚠️ Can calculate but expensiveβœ… Histogram metrics⚠️ Only sampled requests

The Complementary Nature of Observability πŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         OBSERVABILITY DEBUGGING FLOW                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    πŸ“Š METRICS (Discovery)
    "Error rate spiking!"
           β”‚
           ↓
    πŸ—ΊοΈ TRACES (Isolation)
    "Payment service taking 5 seconds"
           β”‚
           ↓
    πŸ“ LOGS (Root Cause)
    "CardDeclinedException: Gateway timeout"

Real-world debugging scenario:

  1. Alert fires: Metrics show http_request_duration_seconds{p99} jumped from 200ms to 3000ms
  2. Narrow down: Traces reveal that /checkout endpoint's payment-service span is slow
  3. Find root cause: Logs from payment-service show "error": "upstream timeout connecting to payment gateway"
  4. Confirm: Check payment gateway's status page (external dependency issue)
## Unified instrumentation example (Python + OpenTelemetry)
import logging
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.logging import LoggingInstrumentor

## Auto-inject trace context into logs
LoggingInstrumentor().instrument()

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

## Define metrics
checkout_counter = meter.create_counter(
    "checkouts_total",
    description="Total checkout attempts"
)
checkout_duration = meter.create_histogram(
    "checkout_duration_ms",
    description="Checkout processing time"
)

def process_checkout(user_id, items):
    # Create trace span
    with tracer.start_as_current_span("process_checkout") as span:
        span.set_attribute("user.id", user_id)
        
        start_time = time.time()
        
        try:
            # Business logic
            result = charge_payment(user_id, items)
            
            # Success metrics
            checkout_counter.add(1, {"status": "success"})
            
            # Structured log with trace context (auto-injected)
            logging.info(
                "Checkout completed",
                extra={
                    "user_id": user_id,
                    "amount": result.amount,
                    "order_id": result.order_id
                }
            )
            
            return result
            
        except PaymentException as e:
            # Error metrics
            checkout_counter.add(1, {"status": "failed", "error": type(e).__name__})
            
            # Error log with stack trace
            logging.error(
                f"Checkout failed: {str(e)}",
                extra={"user_id": user_id},
                exc_info=True
            )
            
            # Mark span as error
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(e)
            
            raise
        
        finally:
            # Record duration metric
            duration_ms = (time.time() - start_time) * 1000
            checkout_duration.record(duration_ms)

Storage and Cost Considerations πŸ’°

Relative Costs

Storage Cost Comparison (per GB ingested)

πŸ“ LOGS:      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ $$$$ (Highest)
               - Full-text indexing
               - Retention: 7-30 days typical
               - Grows linearly with traffic

πŸ—ΊοΈ TRACES:    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ $$$ (Medium-High)
               - Sampling reduces volume
               - Complex relationships to store
               - Retention: 7-14 days typical

πŸ“Š METRICS:   β–ˆβ–ˆβ–ˆβ–ˆ $ (Lowest)
               - Pre-aggregated time series
               - Constant size per cardinality
               - Retention: 30-365+ days typical

Optimization strategies:

## Log optimization
log_levels:
  production: INFO  # Not DEBUG
  sampling: 0.1  # Sample 10% of INFO logs
  structured: true  # Enables efficient querying
  retention:
    error: 30 days
    info: 7 days
    debug: 1 day

## Metric optimization  
metrics:
  cardinality_limit: 1000  # per metric name
  aggregation_interval: 15s
  retention:
    raw: 7 days
    5min_avg: 30 days
    1hour_avg: 365 days

## Trace optimization
tracing:
  sampling_strategy: tail_based
  sample_rate: 0.01  # 1% of successful requests
  always_sample:
    - errors: true
    - slow_requests: true  # p95+ latency
  retention: 7 days

Common Mistakes to Avoid ⚠️

1. Logging in Metrics Clothing

## ❌ WRONG - Treating logs like metrics
for user_id in user_ids:
    logger.info(f"User {user_id} logged in")  # Creates log spam

## βœ… RIGHT - Use metrics for aggregation, logs for details
login_counter.labels(status="success").inc()
logger.info("User logged in", extra={"user_id": user_id})  # Only when needed

2. High-Cardinality Metrics

## ❌ WRONG - Unbounded cardinality
http_requests{user_id="usr_12345", session_id="sess_67890"}  # Millions of series!

## βœ… RIGHT - Bounded dimensions
http_requests{endpoint="/api/users", status="200", method="GET"}

3. Missing Trace Context

// ❌ WRONG - Logs and traces disconnected
logger.error('Payment failed');

// βœ… RIGHT - Correlate with trace_id
const span = trace.getActiveSpan();
logger.error('Payment failed', {
  trace_id: span.spanContext().traceId,
  span_id: span.spanContext().spanId
});

4. Over-Sampling Traces

// ❌ WRONG - Trace 100% in production
sampler := trace.AlwaysSample()

// βœ… RIGHT - Use adaptive sampling
sampler := trace.ParentBased(
    trace.TraceIDRatioBased(0.01),  // 1% base rate
)

5. Log Level Confusion

## ❌ WRONG - Everything at ERROR level
logger.error("User logged in")  # This is INFO!
logger.error("Cache miss")       # This is DEBUG!

## βœ… RIGHT - Appropriate levels
logger.info("User logged in", extra={"user_id": user_id})
logger.debug("Cache miss", extra={"key": cache_key})
logger.error("Database connection failed", exc_info=True)

6. Synchronous Observability Calls

## ❌ WRONG - Blocking on telemetry
def process_request():
    log_to_remote_service("Request started")  # Network call!
    # ... business logic ...

## βœ… RIGHT - Async/buffered telemetry
def process_request():
    logger.info("Request started")  # Buffered by logging library
    # ... business logic ...

Example Scenarios 🎯

Scenario 1: Database Slowdown Investigation

Problem: Users reporting slow page loads

Step 1 - Check metrics dashboard:

## Query Prometheus
rate(http_request_duration_seconds_sum[5m]) / 
  rate(http_request_duration_seconds_count[5m])
## Result: Average latency jumped from 150ms to 2500ms

Step 2 - Sample traces:

-- Query trace backend for slow requests
SELECT trace_id, duration_ms 
FROM traces 
WHERE duration_ms > 2000 
ORDER BY start_time DESC 
LIMIT 10

Trace visualization shows:

GET /products (2400ms)
β”œβ”€ API Gateway (50ms)
β”œβ”€ Auth middleware (30ms)
└─ Database query (2300ms)  ← BOTTLENECK!
   SELECT * FROM products WHERE category = ?

Step 3 - Check database logs:

{
  "timestamp": "2024-01-15T14:45:23Z",
  "level": "WARN",
  "service": "postgres",
  "message": "Slow query detected",
  "query": "SELECT * FROM products WHERE category = 'electronics'",
  "duration_ms": 2300,
  "rows_scanned": 1500000,
  "rows_returned": 150,
  "trace_id": "abc123"
}

Root cause: Missing index on category column, causing full table scan.

Scenario 2: Intermittent API Errors

Problem: Alert fires: error_rate > 5% for payment API

Metrics show:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
## /api/payment/charge: 8% error rate (above threshold)

Trace filtering:

-- Find failed payment traces
SELECT trace_id, error_message 
FROM spans 
WHERE service = 'payment-service' 
  AND status = 'ERROR' 
  AND timestamp > NOW() - INTERVAL '1 hour'

Example trace:

POST /payment/charge (failed)
β”œβ”€ Validate card details (5ms) βœ…
β”œβ”€ Check fraud rules (12ms) βœ…
└─ Call Stripe API (timeout after 30s) ❌
   Error: ConnectionTimeout

Logs reveal pattern:

[
  {"timestamp": "14:32:15", "error": "stripe.timeout", "duration_ms": 30000},
  {"timestamp": "14:35:42", "error": "stripe.timeout", "duration_ms": 30000},
  {"timestamp": "14:38:09", "error": "stripe.timeout", "duration_ms": 30000}
]

Root cause: Stripe API experiencing intermittent timeouts (check their status page).

Scenario 3: Memory Leak Detection

Metrics alert: memory_usage_bytes steadily increasing

## Memory gauge showing growth
process_resident_memory_bytes{service="user-service"}
## 500MB β†’ 800MB β†’ 1.2GB β†’ 1.8GB over 4 hours

Check heap profile traces:

// Enable runtime profiling
import _ "net/http/pprof"

// Capture heap snapshot
// curl http://localhost:6060/debug/pprof/heap > heap.prof
// go tool pprof heap.prof

Logs show repeated warnings:

{
  "level": "WARN",
  "message": "Large object allocation",
  "size_bytes": 10485760,
  "source": "cache.Store()",
  "trace_id": "xyz789"
}

Trace shows caching behavior:

GET /users/:id
β”œβ”€ Check cache (cache miss)
β”œβ”€ Query database (user data: 10MB)
└─ Store in cache (10MB added, never evicted)

Root cause: Cache has no eviction policy, accumulating data indefinitely.

Scenario 4: Cross-Service Debugging

Problem: Checkout flow failing sporadically

Distributed trace reveals dependency chain:

POST /checkout (trace_id: abc123) [FAILED]
β”œβ”€ Frontend service (20ms) βœ…
β”œβ”€ API Gateway (15ms) βœ…
β”œβ”€ Order service (80ms) βœ…
β”‚  β”œβ”€ Inventory service (50ms) βœ…
β”‚  β”‚  └─ Database read (45ms) βœ…
β”‚  └─ Pricing service (25ms) βœ…
β”‚     └─ Redis cache (5ms) βœ…
└─ Payment service (ERROR) ❌
   β”œβ”€ Fraud detection (30ms) βœ…
   └─ Payment gateway (timeout 10s) ❌
      Error: CircuitBreakerOpenException

Payment service logs:

{
  "trace_id": "abc123",
  "service": "payment-service",
  "error": "CircuitBreakerOpenException",
  "message": "Circuit breaker open after 10 consecutive failures",
  "upstream": "payment-gateway",
  "failure_count": 10,
  "last_success": "2024-01-15T14:20:00Z"
}

Root cause: Payment gateway is down, circuit breaker protecting system from cascading failures.

Key Takeaways πŸŽ“

πŸ“‹ Quick Reference Card

PillarBest ForCardinalityCostRetention
πŸ“ LogsDebugging specific events, audit trailsUnlimited$$$$7-30 days
πŸ“Š MetricsAlerts, trends, dashboardsLow (bounded)$30-365 days
πŸ—ΊοΈ TracesLatency analysis, dependenciesRequest-scoped$$$7-14 days

🧠 Memory Device: "LMT" = "Let Me Trace"

  • Logs = Lookup specific events
  • Metrics = Monitor trends & alert
  • Traces = Track requests across services

πŸ”‘ Golden Rules

  1. Correlate everything: Always include trace_id in logs and metrics labels
  2. Sample intelligently: 100% tracing is wasteful; sample errors and slow requests
  3. Mind cardinality: Never use user_id or session_id as metric labels
  4. Structure logs: Use JSON with consistent field names
  5. Use the right tool: Metrics for "what", traces for "where", logs for "why"

πŸ”§ Try This: Instrument a Simple Service

Challenge: Add observability to this basic HTTP handler

## Before: No observability
def handle_user_request(user_id):
    user = database.get_user(user_id)
    if not user:
        return {"error": "User not found"}, 404
    return {"user": user.to_dict()}, 200

Your task: Add:

  1. Structured logging with trace context
  2. Metrics for request count and duration
  3. Distributed tracing span
  4. Error handling with proper observability
πŸ’‘ Solution (click to expand)
import logging
import time
from opentelemetry import trace, metrics

logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("user_requests_total")
request_duration = meter.create_histogram("user_request_duration_ms")

def handle_user_request(user_id):
    with tracer.start_as_current_span("handle_user_request") as span:
        span.set_attribute("user.id", user_id)
        start_time = time.time()
        
        try:
            logger.info(
                "Fetching user",
                extra={"user_id": user_id}
            )
            
            user = database.get_user(user_id)
            
            if not user:
                request_counter.add(1, {"status": "not_found"})
                logger.warning(
                    "User not found",
                    extra={"user_id": user_id}
                )
                return {"error": "User not found"}, 404
            
            request_counter.add(1, {"status": "success"})
            logger.info(
                "User fetched successfully",
                extra={"user_id": user_id}
            )
            return {"user": user.to_dict()}, 200
            
        except DatabaseException as e:
            request_counter.add(1, {"status": "error"})
            logger.error(
                f"Database error: {str(e)}",
                extra={"user_id": user_id},
                exc_info=True
            )
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(e)
            return {"error": "Internal server error"}, 500
            
        finally:
            duration_ms = (time.time() - start_time) * 1000
            request_duration.record(duration_ms)

πŸ“š Further Study

Essential resources for deeper learning:

  1. OpenTelemetry Documentation - Industry standard for observability instrumentation
    https://opentelemetry.io/docs/

  2. Google SRE Book - Monitoring Distributed Systems - Foundational monitoring principles
    https://sre.google/sre-book/monitoring-distributed-systems/

  3. Charity Majors - Observability Engineering - Modern observability practices from Honeycomb CTO
    https://www.honeycomb.io/blog/observability-engineering-101


Ready to test your knowledge? The practice questions below will challenge you with real-world debugging scenarios using logs, metrics, and traces. Remember: under pressure, the right observability signal makes all the difference between panic and precision. 🎯