Logs vs Metrics vs Traces
Choosing the right observability tool under time pressure
Logs vs Metrics vs Traces
Master the three pillars of observability with free flashcards and spaced repetition practice. This lesson covers structured logging practices, time-series metrics collection, and distributed tracing fundamentalsβessential concepts for debugging production systems under pressure.
Welcome to Observability Fundamentals ποΈ
When your production system crashes at 3 AM, you need dataβfast. But not just any data. You need the right kind of observability data that tells you exactly what went wrong, where, and why. The three pillars of observabilityβlogs, metrics, and tracesβeach serve distinct purposes in your debugging arsenal.
Think of observability like investigating a crime scene: π
- Logs are the witness statements (detailed accounts of what happened)
- Metrics are the security camera timestamps (numerical patterns over time)
- Traces are the footprints connecting locations (requests flowing through services)
Understanding when to use eachβand how they complement each otherβseparates engineers who panic under pressure from those who methodically isolate root causes.
Core Concepts: The Three Pillars Explained ποΈ
Logs: The Detective's Notebook π
Logs are immutable, timestamped records of discrete events that happened in your system. They're your most detailed observability signalβthe raw material for understanding what happened.
Characteristics:
- Discrete events: Each log represents a single occurrence
- High cardinality: Can contain unlimited unique values (user IDs, error messages, stack traces)
- Text-based: Usually structured (JSON) or semi-structured (key=value)
- Storage-intensive: Growing linearly with system activity
Common use cases:
- Debugging specific user issues ("Why did order #12345 fail?")
- Audit trails and compliance
- Error stack traces and exception details
- Application state at specific moments
Example structured log:
{
"timestamp": "2024-01-15T14:32:18.123Z",
"level": "ERROR",
"service": "payment-processor",
"trace_id": "a1b2c3d4e5f6",
"user_id": "usr_98765",
"event": "payment_failed",
"error": "CardDeclinedException",
"message": "Payment declined: insufficient funds",
"amount": 149.99,
"currency": "USD",
"card_last4": "4242"
}
π‘ Pro tip: Always include correlation IDs (trace_id, request_id) in logs so you can connect related events across services.
β οΈ Common mistake: Logging too much sensitive data (PII, passwords, credit cards). Use log sanitization libraries and configure log levels appropriately.
Log Levels Hierarchy
| Level | Purpose | Production Volume | Example |
|---|---|---|---|
| TRACE | Extremely detailed debugging | β Never | "Entering function validateInput()" |
| DEBUG | Diagnostic information | β οΈ Rarely | "Cache hit for key: user:123" |
| INFO | Normal operations | β Moderate | "Payment processed successfully" |
| WARN | Unusual but handled | β Low | "API rate limit approaching" |
| ERROR | Failures requiring attention | β Very low | "Database connection failed" |
| FATAL | System-crashing errors | π¨ Extremely rare | "Out of memory, shutting down" |
Metrics: The System's Vital Signs π
Metrics are numerical measurements aggregated over time intervals. They answer "how much" and "how often" questions with minimal storage overhead.
Characteristics:
- Time-series data: Values measured at regular intervals
- Low cardinality: Limited set of dimensions (tags/labels)
- Aggregatable: Can be summed, averaged, percentiled
- Storage-efficient: Pre-aggregated, constant size regardless of traffic
Metric types:
| Type | Description | Example | Use Case |
|---|---|---|---|
| Counter | Monotonically increasing value | http_requests_total | Request counts, errors |
| Gauge | Value that goes up and down | memory_usage_bytes | CPU, memory, queue size |
| Histogram | Distribution of values in buckets | response_time_ms | Latency percentiles |
| Summary | Similar to histogram, client-calculated | request_duration_summary | Pre-computed quantiles |
Example Prometheus metrics:
## Counter - total HTTP requests
http_requests_total{service="api", endpoint="/users", status="200"} 15847
## Gauge - current active connections
active_connections{service="database", instance="db-1"} 42
## Histogram - request duration buckets
http_request_duration_seconds_bucket{le="0.1"} 9234
http_request_duration_seconds_bucket{le="0.5"} 11532
http_request_duration_seconds_bucket{le="1.0"} 11890
http_request_duration_seconds_sum 5847.32
http_request_duration_seconds_count 11932
π‘ Golden Signals (Google SRE methodology):
- Latency: How long requests take
- Traffic: How much demand on your system
- Errors: Rate of failed requests
- Saturation: How "full" your service is
When to use metrics:
- Real-time dashboards and alerts
- Capacity planning and trends
- SLO/SLA monitoring (99.9% uptime)
- Resource utilization tracking
- High-level system health
β οΈ Cardinality explosion danger: Adding high-cardinality dimensions (user_id, session_id) to metrics can create millions of time series, overwhelming your metrics backend.
## β BAD - creates millions of unique metrics
request_counter.labels(user_id=user_id, session_id=session_id).inc()
## β
GOOD - bounded cardinality
request_counter.labels(endpoint=endpoint, status_code=status).inc()
Traces: Following the Request Journey πΊοΈ
Traces show the path of a single request as it flows through distributed systems. They answer "where did the time go?" by breaking down latency across service boundaries.
Characteristics:
- Request-scoped: One trace per user request
- Tree structure: Parent-child relationships between spans
- Cross-service: Follows requests across network boundaries
- Sampling required: Can't trace 100% of traffic at scale
Trace anatomy:
Trace: User checkout request (trace_id: abc123) ββ Span 1: API Gateway [120ms] β ββ Span 2: Auth Service [15ms] β ββ Span 3: Inventory Service [45ms] β β ββ Span 4: Database query [40ms] β ββ Span 5: Payment Service [55ms] β ββ Span 6: Fraud check API [20ms] β ββ Span 7: Stripe API [30ms] ββ Total duration: 120ms
Span attributes:
{
"trace_id": "abc123def456",
"span_id": "span789",
"parent_span_id": "span456",
"name": "POST /checkout",
"service": "payment-service",
"start_time": "2024-01-15T14:32:18.100Z",
"duration_ms": 55,
"status": "OK",
"attributes": {
"http.method": "POST",
"http.url": "/api/v1/checkout",
"http.status_code": 200,
"user.id": "usr_98765",
"payment.amount": 149.99
},
"events": [
{"timestamp": "...", "name": "fraud_check_passed"},
{"timestamp": "...", "name": "payment_authorized"}
]
}
Distributed tracing context propagation:
## Service A generates trace context
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
# Inject trace context into HTTP headers
headers = {}
inject(headers) # Adds traceparent header
# Call Service B with context
response = requests.post(
"http://inventory-service/reserve",
headers=headers,
json={"items": items}
)
// Service B extracts trace context and continues span
func reserveInventory(w http.ResponseWriter, r *http.Request) {
// Extract parent trace context from headers
ctx := otel.GetTextMapPropagator().Extract(r.Context(),
propagation.HeaderCarrier(r.Header))
// Create child span
ctx, span := tracer.Start(ctx, "reserve_inventory")
defer span.End()
// Your business logic here
checkInventory(ctx, items)
}
π‘ Sampling strategies:
- Head-based sampling: Decide at trace start (e.g., sample 1% of requests)
- Tail-based sampling: Decide after trace completes (keep all errors, slow requests)
- Adaptive sampling: Adjust rate based on traffic volume
When to use traces:
- Debugging latency issues across services
- Understanding service dependencies
- Finding bottlenecks in request flows
- Root cause analysis for specific user requests
Choosing the Right Tool for the Job π§
Decision Matrix
| Question | Use Logs | Use Metrics | Use Traces |
|---|---|---|---|
| "Why did order #12345 fail?" | β Search logs for order_id | β | β View trace for request_id |
| "Is error rate increasing?" | β οΈ Expensive aggregation | β Alert on error_rate metric | β |
| "Which service is slowest?" | β | β οΈ Shows symptoms only | β Trace breakdown shows bottleneck |
| "What was the exact error message?" | β Full stack trace in logs | β | β οΈ Limited span attributes |
| "Is CPU usage trending up?" | β | β CPU gauge metric over time | β |
| "What's our p99 latency?" | β οΈ Can calculate but expensive | β Histogram metrics | β οΈ Only sampled requests |
The Complementary Nature of Observability π
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OBSERVABILITY DEBUGGING FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π METRICS (Discovery)
"Error rate spiking!"
β
β
πΊοΈ TRACES (Isolation)
"Payment service taking 5 seconds"
β
β
π LOGS (Root Cause)
"CardDeclinedException: Gateway timeout"
Real-world debugging scenario:
- Alert fires: Metrics show
http_request_duration_seconds{p99}jumped from 200ms to 3000ms - Narrow down: Traces reveal that
/checkoutendpoint'spayment-servicespan is slow - Find root cause: Logs from payment-service show
"error": "upstream timeout connecting to payment gateway" - Confirm: Check payment gateway's status page (external dependency issue)
## Unified instrumentation example (Python + OpenTelemetry)
import logging
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.logging import LoggingInstrumentor
## Auto-inject trace context into logs
LoggingInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
## Define metrics
checkout_counter = meter.create_counter(
"checkouts_total",
description="Total checkout attempts"
)
checkout_duration = meter.create_histogram(
"checkout_duration_ms",
description="Checkout processing time"
)
def process_checkout(user_id, items):
# Create trace span
with tracer.start_as_current_span("process_checkout") as span:
span.set_attribute("user.id", user_id)
start_time = time.time()
try:
# Business logic
result = charge_payment(user_id, items)
# Success metrics
checkout_counter.add(1, {"status": "success"})
# Structured log with trace context (auto-injected)
logging.info(
"Checkout completed",
extra={
"user_id": user_id,
"amount": result.amount,
"order_id": result.order_id
}
)
return result
except PaymentException as e:
# Error metrics
checkout_counter.add(1, {"status": "failed", "error": type(e).__name__})
# Error log with stack trace
logging.error(
f"Checkout failed: {str(e)}",
extra={"user_id": user_id},
exc_info=True
)
# Mark span as error
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
finally:
# Record duration metric
duration_ms = (time.time() - start_time) * 1000
checkout_duration.record(duration_ms)
Storage and Cost Considerations π°
Relative Costs
Storage Cost Comparison (per GB ingested)
π LOGS: ββββββββββββββββββββ $$$$ (Highest)
- Full-text indexing
- Retention: 7-30 days typical
- Grows linearly with traffic
πΊοΈ TRACES: ββββββββββββ $$$ (Medium-High)
- Sampling reduces volume
- Complex relationships to store
- Retention: 7-14 days typical
π METRICS: ββββ $ (Lowest)
- Pre-aggregated time series
- Constant size per cardinality
- Retention: 30-365+ days typical
Optimization strategies:
## Log optimization
log_levels:
production: INFO # Not DEBUG
sampling: 0.1 # Sample 10% of INFO logs
structured: true # Enables efficient querying
retention:
error: 30 days
info: 7 days
debug: 1 day
## Metric optimization
metrics:
cardinality_limit: 1000 # per metric name
aggregation_interval: 15s
retention:
raw: 7 days
5min_avg: 30 days
1hour_avg: 365 days
## Trace optimization
tracing:
sampling_strategy: tail_based
sample_rate: 0.01 # 1% of successful requests
always_sample:
- errors: true
- slow_requests: true # p95+ latency
retention: 7 days
Common Mistakes to Avoid β οΈ
1. Logging in Metrics Clothing
## β WRONG - Treating logs like metrics
for user_id in user_ids:
logger.info(f"User {user_id} logged in") # Creates log spam
## β
RIGHT - Use metrics for aggregation, logs for details
login_counter.labels(status="success").inc()
logger.info("User logged in", extra={"user_id": user_id}) # Only when needed
2. High-Cardinality Metrics
## β WRONG - Unbounded cardinality
http_requests{user_id="usr_12345", session_id="sess_67890"} # Millions of series!
## β
RIGHT - Bounded dimensions
http_requests{endpoint="/api/users", status="200", method="GET"}
3. Missing Trace Context
// β WRONG - Logs and traces disconnected
logger.error('Payment failed');
// β
RIGHT - Correlate with trace_id
const span = trace.getActiveSpan();
logger.error('Payment failed', {
trace_id: span.spanContext().traceId,
span_id: span.spanContext().spanId
});
4. Over-Sampling Traces
// β WRONG - Trace 100% in production
sampler := trace.AlwaysSample()
// β
RIGHT - Use adaptive sampling
sampler := trace.ParentBased(
trace.TraceIDRatioBased(0.01), // 1% base rate
)
5. Log Level Confusion
## β WRONG - Everything at ERROR level
logger.error("User logged in") # This is INFO!
logger.error("Cache miss") # This is DEBUG!
## β
RIGHT - Appropriate levels
logger.info("User logged in", extra={"user_id": user_id})
logger.debug("Cache miss", extra={"key": cache_key})
logger.error("Database connection failed", exc_info=True)
6. Synchronous Observability Calls
## β WRONG - Blocking on telemetry
def process_request():
log_to_remote_service("Request started") # Network call!
# ... business logic ...
## β
RIGHT - Async/buffered telemetry
def process_request():
logger.info("Request started") # Buffered by logging library
# ... business logic ...
Example Scenarios π―
Scenario 1: Database Slowdown Investigation
Problem: Users reporting slow page loads
Step 1 - Check metrics dashboard:
## Query Prometheus
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m])
## Result: Average latency jumped from 150ms to 2500ms
Step 2 - Sample traces:
-- Query trace backend for slow requests
SELECT trace_id, duration_ms
FROM traces
WHERE duration_ms > 2000
ORDER BY start_time DESC
LIMIT 10
Trace visualization shows:
GET /products (2400ms)
ββ API Gateway (50ms)
ββ Auth middleware (30ms)
ββ Database query (2300ms) β BOTTLENECK!
SELECT * FROM products WHERE category = ?
Step 3 - Check database logs:
{
"timestamp": "2024-01-15T14:45:23Z",
"level": "WARN",
"service": "postgres",
"message": "Slow query detected",
"query": "SELECT * FROM products WHERE category = 'electronics'",
"duration_ms": 2300,
"rows_scanned": 1500000,
"rows_returned": 150,
"trace_id": "abc123"
}
Root cause: Missing index on category column, causing full table scan.
Scenario 2: Intermittent API Errors
Problem: Alert fires: error_rate > 5% for payment API
Metrics show:
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
## /api/payment/charge: 8% error rate (above threshold)
Trace filtering:
-- Find failed payment traces
SELECT trace_id, error_message
FROM spans
WHERE service = 'payment-service'
AND status = 'ERROR'
AND timestamp > NOW() - INTERVAL '1 hour'
Example trace:
POST /payment/charge (failed)
ββ Validate card details (5ms) β
ββ Check fraud rules (12ms) β
ββ Call Stripe API (timeout after 30s) β
Error: ConnectionTimeout
Logs reveal pattern:
[
{"timestamp": "14:32:15", "error": "stripe.timeout", "duration_ms": 30000},
{"timestamp": "14:35:42", "error": "stripe.timeout", "duration_ms": 30000},
{"timestamp": "14:38:09", "error": "stripe.timeout", "duration_ms": 30000}
]
Root cause: Stripe API experiencing intermittent timeouts (check their status page).
Scenario 3: Memory Leak Detection
Metrics alert: memory_usage_bytes steadily increasing
## Memory gauge showing growth
process_resident_memory_bytes{service="user-service"}
## 500MB β 800MB β 1.2GB β 1.8GB over 4 hours
Check heap profile traces:
// Enable runtime profiling
import _ "net/http/pprof"
// Capture heap snapshot
// curl http://localhost:6060/debug/pprof/heap > heap.prof
// go tool pprof heap.prof
Logs show repeated warnings:
{
"level": "WARN",
"message": "Large object allocation",
"size_bytes": 10485760,
"source": "cache.Store()",
"trace_id": "xyz789"
}
Trace shows caching behavior:
GET /users/:id
ββ Check cache (cache miss)
ββ Query database (user data: 10MB)
ββ Store in cache (10MB added, never evicted)
Root cause: Cache has no eviction policy, accumulating data indefinitely.
Scenario 4: Cross-Service Debugging
Problem: Checkout flow failing sporadically
Distributed trace reveals dependency chain:
POST /checkout (trace_id: abc123) [FAILED]
ββ Frontend service (20ms) β
ββ API Gateway (15ms) β
ββ Order service (80ms) β
β ββ Inventory service (50ms) β
β β ββ Database read (45ms) β
β ββ Pricing service (25ms) β
β ββ Redis cache (5ms) β
ββ Payment service (ERROR) β
ββ Fraud detection (30ms) β
ββ Payment gateway (timeout 10s) β
Error: CircuitBreakerOpenException
Payment service logs:
{
"trace_id": "abc123",
"service": "payment-service",
"error": "CircuitBreakerOpenException",
"message": "Circuit breaker open after 10 consecutive failures",
"upstream": "payment-gateway",
"failure_count": 10,
"last_success": "2024-01-15T14:20:00Z"
}
Root cause: Payment gateway is down, circuit breaker protecting system from cascading failures.
Key Takeaways π
π Quick Reference Card
| Pillar | Best For | Cardinality | Cost | Retention |
|---|---|---|---|---|
| π Logs | Debugging specific events, audit trails | Unlimited | $$$$ | 7-30 days |
| π Metrics | Alerts, trends, dashboards | Low (bounded) | $ | 30-365 days |
| πΊοΈ Traces | Latency analysis, dependencies | Request-scoped | $$$ | 7-14 days |
π§ Memory Device: "LMT" = "Let Me Trace"
- Logs = Lookup specific events
- Metrics = Monitor trends & alert
- Traces = Track requests across services
π Golden Rules
- Correlate everything: Always include trace_id in logs and metrics labels
- Sample intelligently: 100% tracing is wasteful; sample errors and slow requests
- Mind cardinality: Never use user_id or session_id as metric labels
- Structure logs: Use JSON with consistent field names
- Use the right tool: Metrics for "what", traces for "where", logs for "why"
π§ Try This: Instrument a Simple Service
Challenge: Add observability to this basic HTTP handler
## Before: No observability
def handle_user_request(user_id):
user = database.get_user(user_id)
if not user:
return {"error": "User not found"}, 404
return {"user": user.to_dict()}, 200
Your task: Add:
- Structured logging with trace context
- Metrics for request count and duration
- Distributed tracing span
- Error handling with proper observability
π‘ Solution (click to expand)
import logging
import time
from opentelemetry import trace, metrics
logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("user_requests_total")
request_duration = meter.create_histogram("user_request_duration_ms")
def handle_user_request(user_id):
with tracer.start_as_current_span("handle_user_request") as span:
span.set_attribute("user.id", user_id)
start_time = time.time()
try:
logger.info(
"Fetching user",
extra={"user_id": user_id}
)
user = database.get_user(user_id)
if not user:
request_counter.add(1, {"status": "not_found"})
logger.warning(
"User not found",
extra={"user_id": user_id}
)
return {"error": "User not found"}, 404
request_counter.add(1, {"status": "success"})
logger.info(
"User fetched successfully",
extra={"user_id": user_id}
)
return {"user": user.to_dict()}, 200
except DatabaseException as e:
request_counter.add(1, {"status": "error"})
logger.error(
f"Database error: {str(e)}",
extra={"user_id": user_id},
exc_info=True
)
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
return {"error": "Internal server error"}, 500
finally:
duration_ms = (time.time() - start_time) * 1000
request_duration.record(duration_ms)
π Further Study
Essential resources for deeper learning:
OpenTelemetry Documentation - Industry standard for observability instrumentation
https://opentelemetry.io/docs/Google SRE Book - Monitoring Distributed Systems - Foundational monitoring principles
https://sre.google/sre-book/monitoring-distributed-systems/Charity Majors - Observability Engineering - Modern observability practices from Honeycomb CTO
https://www.honeycomb.io/blog/observability-engineering-101
Ready to test your knowledge? The practice questions below will challenge you with real-world debugging scenarios using logs, metrics, and traces. Remember: under pressure, the right observability signal makes all the difference between panic and precision. π―