The Three Pillars
Deep dive into metrics, logs, and traces as complementary signals for understanding system behavior
The Three Pillars of Observability
Master production observability with free flashcards and spaced repetition practice. This lesson covers metrics, logs, and tracesβthe three fundamental signal types that power modern observability systems. Understanding these pillars helps engineers diagnose production issues faster and build more reliable systems.
Welcome to Observability Fundamentals π
When your application crashes at 3 AM, you need answers fast. Observability is the practice of understanding what's happening inside your systems by examining the signals they emit. Unlike traditional monitoring that tells you when something breaks, observability helps you understand why it broke and where to look.
The Three Pillars frameworkβmetrics, logs, and tracesβhas become the industry standard for structuring observability data. Each pillar provides a different lens for understanding system behavior:
- Metrics give you the numerical pulse of your system π
- Logs tell the story of what happened π
- Traces map the journey of individual requests πΊοΈ
Think of it like investigating a car problem: metrics are your dashboard gauges (speed, fuel, temperature), logs are the maintenance records, and traces are the GPS history showing exactly where you drove.
π‘ Key Insight: These pillars work best together. Metrics alert you to problems, logs provide context, and traces show you the exact path through your system.
Pillar 1: Metrics π
Metrics are numerical measurements collected at regular intervals over time. They're aggregated, time-series data that answer questions like "How many?" "How fast?" and "How much?"
What Makes Metrics Special
Metrics are incredibly efficient because they're just numbers. Instead of storing every detail about every request, you aggregate them into counts, rates, and statistical distributions.
Common metric types:
| Type | Description | Example |
|---|---|---|
| Counter | Always increases (resets on restart) | Total HTTP requests: 1,547,293 |
| Gauge | Point-in-time value that goes up/down | Active connections: 42 |
| Histogram | Distribution across buckets | Response times: 95% under 200ms |
| Summary | Pre-calculated quantiles | p50: 150ms, p99: 800ms |
Why Metrics Matter
π― Fast alerting: You can evaluate "Is CPU > 80%?" instantly across thousands of servers
π― Long-term trends: Store data efficiently for months or years to spot patterns
π― Dashboards: Build real-time visualizations showing system health at a glance
The Four Golden Signals
Google's Site Reliability Engineering (SRE) book popularized four critical metrics every service should track:
βββββββββββββββββββββββββββββββββββββββββββ β THE FOUR GOLDEN SIGNALS β βββββββββββββββββββββββββββββββββββββββββββ€ β β β 1. LATENCY β±οΈ β β How long requests take β β β Track p50, p95, p99 percentiles β β β β 2. TRAFFIC π¦ β β How many requests you're serving β β β Requests/sec, sessions/sec β β β β 3. ERRORS β β β Rate of failed requests β β β 4xx, 5xx, exceptions β β β β 4. SATURATION π β β How "full" your service is β β β CPU, memory, disk, queue depth β β β βββββββββββββββββββββββββββββββββββββββββββ
π‘ Pro Tip: Start with these four. You can always add more metrics later, but these give you 80% of what you need.
Metric Naming and Labels
Modern metric systems use labels (also called tags or dimensions) to add context:
http_requests_total{method="GET", status="200", endpoint="/api/users"}
http_requests_total{method="POST", status="500", endpoint="/api/orders"}
This lets you slice and dice data: "Show me all POST requests" or "What's the error rate for /api/orders?"
β οΈ Cardinality Warning: Each unique combination of labels creates a new time series. If you add a label like user_id with millions of values, you'll explode your metric storage!
Pillar 2: Logs π
Logs are timestamped text records describing discrete events. They're the oldest observability signalβyou've probably written console.log() or print() statements to debug code.
Structured vs. Unstructured Logs
Unstructured logs (traditional):
2026-01-15 14:32:17 INFO User john@example.com logged in from 192.168.1.1
2026-01-15 14:32:18 ERROR Payment processing failed: insufficient funds
These are human-readable but hard to query programmatically.
Structured logs (modern best practice):
{
"timestamp": "2026-01-15T14:32:17Z",
"level": "INFO",
"event": "user_login",
"user_email": "john@example.com",
"source_ip": "192.168.1.1"
}
Structured logs are machine-parseable, making them searchable and aggregatable.
Log Levels
Standard severity levels help filter noise:
| Level | Use Case | Production Volume |
|---|---|---|
| DEBUG | Detailed diagnostic info | Usually disabled (too verbose) |
| INFO | Normal operations, milestones | Moderate |
| WARN | Unexpected but handled | Low |
| ERROR | Failures requiring attention | Very low (should be rare!) |
| FATAL | Application crash | Extremely rare |
What to Log
β Do log:
- Request starts and completions
- State changes (order created, payment processed)
- Error conditions with context
- Security events (authentication, authorization)
- External service calls
β Don't log:
- Passwords, API keys, tokens
- Credit card numbers, SSNs
- Personal health information (PHI)
- High-frequency events in tight loops (will crush storage)
The Cost of Logs
Logs are expensive! A busy application might generate gigabytes per hour. That's why you:
- Sample high-volume logs (keep 1%, discard 99%)
- Set retention policies (keep 7 days hot, 30 days cold, then delete)
- Use appropriate levels (don't INFO log every function call)
π§ Memory Trick: "DIWEF" - Debug, Info, Warn, Error, Fatal (ascending severity)
Pillar 3: Traces πΊοΈ
Distributed traces track individual requests as they flow through multiple services. In microservices architectures, a single user action might touch 10+ servicesβtraces show you the complete journey.
Anatomy of a Trace
A trace consists of spans, where each span represents a unit of work:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β DISTRIBUTED TRACE: Order Checkout Request β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π API Gateway [span: root] β β |ββββββββββββββββββββββββββββββββββββ| 1200ms β β β β β βββ π Cart Service [span: get_cart] β β β |βββββββββββ| 150ms β β β β β βββ π³ Payment Service [span: process_payment] β β β |βββββββββββββββββββββββββββ| 800ms β β β β β β β βββ π¦ Bank API [span: charge_card] β β β |βββββββββββββββββ| 650ms β β β β β βββ π§ Notification Service [span: send_email] β β |ββββββββ| 100ms β β β β Trace ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890 β β Total Duration: 1200ms β β Span Count: 5 β β Services Touched: 4 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Trace Concepts
Trace ID: Unique identifier for the entire request (follows it everywhere)
Span ID: Unique identifier for each operation within the trace
Parent Span ID: Links child spans to their parent (builds the tree structure)
Tags/Attributes: Key-value metadata (HTTP method, status code, database query)
Baggage: Data propagated to all child spans (user ID, tenant ID, feature flags)
Why Traces Are Powerful
Traces answer questions metrics and logs can't:
π "Why is this request slow?" β See which service is the bottleneck
π "Where did this error originate?" β Follow the span tree to the source
π "Do these services interact efficiently?" β Visualize dependencies and latencies
Trace Sampling
You can't trace every requestβit's too expensive. Common strategies:
| Strategy | Description | Use Case |
|---|---|---|
| Head-based | Decide at request start (1% sample rate) | High-volume, mostly healthy traffic |
| Tail-based | Decide after completion (keep all errors) | Ensure you capture interesting requests |
| Adaptive | Adjust rate based on volume/error rate | Optimize cost vs. coverage dynamically |
OpenTelemetry Standard
OpenTelemetry (OTel) is the industry-standard framework for generating traces, metrics, and logs. It provides:
- APIs for instrumenting code
- SDKs for popular languages
- Exporters to send data to various backends
- Auto-instrumentation for common frameworks
π‘ Best Practice: Use OTel from the start. It's vendor-neutral, so you can switch observability backends without rewriting instrumentation.
Bringing the Pillars Together ποΈ
The Observability Workflow
Here's how the three pillars work together in a real incident:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INCIDENT INVESTIGATION FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ ALERT FIRES
β
βΌ
π CHECK METRICS
"Error rate spiked from 0.1% to 5% at 14:32"
β
βββ Which service?
β β payment-service
β
βββ Which endpoint?
β β POST /api/checkout
β
βΌ
π SEARCH LOGS
"Show ERROR logs from payment-service after 14:32"
β
βββ Pattern found:
β "Database connection timeout"
β (happened 147 times)
β
βΌ
πΊοΈ EXAMINE TRACES
"Show slow traces touching payment-service DB"
β
βββ Span analysis:
β database_query span: 30 seconds (normally 50ms)
β query: SELECT * FROM transactions WHERE...
β
βΌ
π― ROOT CAUSE IDENTIFIED
"Missing index on transactions.user_id column"
"Recent data growth made full table scan too slow"
β
βΌ
β
FIX APPLIED
"Added index, query now 45ms, error rate back to 0.1%"
Correlation is Key
The magic happens when you correlate signals:
- Logs include trace IDs β click from log to full trace
- Traces include metric tags β see metrics for specific service/endpoint
- Metrics link to example traces β jump from dashboard to specific requests
Example correlation in practice:
{
"timestamp": "2026-01-15T14:32:17Z",
"level": "ERROR",
"message": "Database timeout",
"trace_id": "a1b2c3d4-e5f6-7890",
"span_id": "1234567890abcdef",
"service.name": "payment-service",
"http.method": "POST",
"http.route": "/api/checkout"
}
With this structured log, you can:
- Filter logs by service
- Jump to the full trace (via trace_id)
- Aggregate error counts into metrics (by service.name and http.route)
Beyond the Three Pillars
Modern observability is expanding:
π₯ Continuous Profiling: CPU/memory profiles to find performance bottlenecks
πΈ Snapshots: Capture variable state at specific code lines (without debuggers)
π§ͺ Chaos Engineering: Intentionally break things to test observability coverage
π€ AIOps: Machine learning to detect anomalies and predict incidents
But the three pillars remain fundamentalβeverything else builds on metrics, logs, and traces.
Example 1: E-commerce Checkout Investigation π
Scenario: Your e-commerce site's checkout is slow. Customers complain it takes 10+ seconds.
Step 1 - Metrics: Check your latency dashboard
http_request_duration_seconds{endpoint="/checkout"}p99: 12 seconds (was 800ms yesterday)http_requests_total{endpoint="/checkout", status="200"}: Still returning success- Timeline: Slowdown started at 08:00 UTC
Step 2 - Logs: Search for checkout-related warnings/errors after 08:00
{
"timestamp": "2026-01-15T08:03:42Z",
"level": "WARN",
"message": "Inventory service response time: 9847ms",
"service": "checkout-service",
"trace_id": "xyz789"
}
Pattern: Inventory service calls are slow!
Step 3 - Traces: Pull up trace xyz789
Checkout Request: 12.3s total
ββ Validate Cart: 50ms β
ββ Check Inventory: 9.8s β οΈ β BOTTLENECK
β ββ Database Query: 9.7s
β Query: SELECT * FROM inventory WHERE sku IN (...500 items...)
ββ Process Payment: 400ms β
ββ Send Confirmation: 150ms β
Root Cause: Someone added 500 items to their cart. The inventory check queries 500 SKUs individually (N+1 query problem).
Fix: Batch inventory queries OR add a cart item limit.
Example 2: Memory Leak Detection π§
Scenario: Your Node.js API starts fast but gets slower over hours, then crashes.
Step 1 - Metrics: Graph memory usage over time
process_resident_memory_bytes: Steadily climbing from 200MB to 2GB over 6 hoursnodejs_heap_size_total_bytes: Also climbing linearlynodejs_gc_duration_seconds: GC pauses getting longer (100ms β 5 seconds)
Diagnosis: Classic memory leak patternβheap grows until GC can't keep up.
Step 2 - Logs: Search for clues about what's growing
{
"timestamp": "2026-01-15T14:32:17Z",
"level": "INFO",
"message": "WebSocket connection established",
"connection_id": "conn_12345",
"total_connections": 8472
}
WebSocket connections keep growingβnever closed!
Step 3 - Traces: Sample traces show WebSocket initialization spans but no cleanup spans.
Root Cause: WebSocket connections stored in a Map, but disconnect handler never fires due to a bug.
Fix: Add proper cleanup on disconnect + implement connection limit.
Example 3: Cascading Failure π
Scenario: Recommendation service goes down, entire site becomes unresponsive.
Step 1 - Metrics: Multiple services showing problems
recommendation_service: 100% error rate (down)product_page_service: Latency spiked to 30 secondssearch_service: Latency spiked to 25 secondsapi_gateway: Request queues backing up
Step 2 - Traces: Examine slow product page traces
Product Page Request: 30.2s
ββ Get Product Details: 50ms β
ββ Get Recommendations: 30.0s β TIMEOUT
β ββ (no response from recommendation-service)
ββ Render Page: (never reached)
The product page blocks waiting for recommendations!
Step 3 - Logs: Recommendation service logs
ERROR: Redis connection refused
Connection pool exhausted, all circuits open
Recommendation service's Redis cache died.
Root Cause: No timeout or fallback. When recommendations fail, product pages hang, exhausting connection pools everywhere.
Fix:
- Add 1-second timeout to recommendation calls
- Implement graceful degradation (show generic recommendations)
- Add circuit breaker pattern
Example 4: Silent Data Corruption π
Scenario: Finance team reports yesterday's revenue number seems low.
Step 1 - Metrics: Check business metrics
orders_completed_total: Normal volume (5,200 orders)revenue_total_usd: $62,400 (expected ~$180,000 based on order volume)- Average order value: $12 (normally ~$35)
Step 2 - Logs: Search for payment-related anomalies
{
"timestamp": "2026-01-14T16:45:12Z",
"level": "WARN",
"message": "Currency conversion rate unavailable, defaulting to 1.0",
"from_currency": "EUR",
"to_currency": "USD"
}
Appears 3,100 times yesterday afternoon!
Step 3 - Traces: Examine order processing traces
Order Processing
ββ Calculate Total: 100ms
ββ Convert Currency (EURβUSD): 5ms
β ββ Exchange rate: 1.0 β οΈ (should be ~1.08)
ββ Charge Payment: 200ms
ββ Record Revenue: 50ms
ββ amount_usd: $100 (should be ~$108)
Root Cause: Currency API quota exceeded. Fallback used 1:1 conversion rate, undercharging European customers by ~8%.
Fix:
- Increase API quota
- Cache exchange rates locally
- Alert when fallback conversion is used
- Never default to 1.0βfail the order if rate unavailable
π‘ Lesson: Metrics caught what logs alone would miss. Order count looked fine, but revenue metric revealed the issue.
Common Mistakes β οΈ
1. Logging Too Much (or Too Little)
β Too Much: Logging every function entry/exit at INFO level
- Costs: Storage explosion, slow searches, important logs buried in noise
- Fix: Use DEBUG level for verbose logging, keep it disabled in production
β Too Little: Only logging errors with generic messages
catch(error) {
logger.error("Something went wrong"); // Useless!
}
- Fix: Include contextβwhat operation, what inputs, what state
catch(error) {
logger.error("Failed to process payment", {
error: error.message,
order_id: order.id,
payment_method: payment.method,
trace_id: currentTraceId()
});
}
2. High-Cardinality Metrics
β Bad: Adding user IDs as metric labels
api_requests{user_id="user_12345", endpoint="/api/orders"}
api_requests{user_id="user_12346", endpoint="/api/orders"}
api_requests{user_id="user_12347", endpoint="/api/orders"}
With 1 million users, you create 1 million time seriesβmetric storage explodes!
β Good: Use bounded labels
api_requests{user_tier="free", endpoint="/api/orders"}
api_requests{user_tier="premium", endpoint="/api/orders"}
Only 2 time series (or however many tiers you have).
3. Not Propagating Trace Context
β Bad: Starting a new trace in each service
## Service A
trace_id = generate_new_trace_id() # abc123
response = requests.post("service-b.com/api", data=payload)
## Service B receives request
trace_id = generate_new_trace_id() # xyz789 (DIFFERENT!)
You lose the connection between servicesβcan't see the full request path.
β Good: Propagate trace context in headers
## Service A
headers = {"traceparent": f"00-{trace_id}-{span_id}-01"}
response = requests.post("service-b.com/api", data=payload, headers=headers)
## Service B receives request
trace_id = extract_trace_id(request.headers["traceparent"]) # Same abc123!
4. Ignoring Sampling
β Bad: Tracing 100% of requests in high-volume production
- Cost: Massive data volumes, storage costs, network overhead
- Reality: You can't afford to trace 10,000 requests/second
β Good: Sample intelligently
- 1% of successful requests (statistical sample)
- 100% of errors (you need these for debugging)
- 100% of slow requests (above p95 latency)
5. Metrics Without Context
β Bad: Tracking a counter without useful dimensions
orders_total: 1,547,293
You know total orders, but can't answer: "How many orders per region?" "What's our mobile vs. desktop split?"
β Good: Add relevant dimensions
orders_total{region="us-east", platform="mobile", payment_method="credit_card"}: 834,192
orders_total{region="eu-west", platform="desktop", payment_method="paypal"}: 423,101
Now you can slice data meaningfully.
6. Logging Sensitive Data
β Bad: Logging user passwords, credit cards, API keys
logger.info("User logged in", {
email: user.email,
password: user.password // NEVER!
});
This violates privacy laws and creates security risks.
β Good: Redact sensitive fields
logger.info("User logged in", {
email: user.email,
user_id: user.id,
// password intentionally omitted
});
7. No Retention Policies
β Bad: Storing all logs/traces forever
- Costs spiral out of control
- Old data rarely accessed
- Compliance issues (GDPR requires data deletion)
β Good: Tiered retention
- Logs: 7 days hot (fast search), 30 days cold (archival), then delete
- Metrics: 15 days at 1s resolution, 90 days at 1m resolution, 1 year at 1h resolution
- Traces: 3 days sampled, keep only errors after that
π§ Remember: "MiLT" - Metrics (aggregate), Logs (events), Traces (requests)
Key Takeaways π―
π Quick Reference Card
| Pillar | What It Is | Best For | Example Tool |
|---|---|---|---|
| π Metrics | Numerical measurements over time | Alerting, dashboards, trends | Prometheus, Datadog |
| π Logs | Timestamped event records | Debugging, auditing, context | Elasticsearch, Splunk |
| πΊοΈ Traces | Request paths through systems | Performance analysis, dependencies | Jaeger, Zipkin, Honeycomb |
π― The Investigation Pattern
- Metrics tell you WHAT is wrong and WHEN
- Logs tell you WHY and provide context
- Traces tell you WHERE in your system
β Best Practices Checklist
- β Use structured logging (JSON format)
- β Include trace IDs in logs for correlation
- β Track the Four Golden Signals (latency, traffic, errors, saturation)
- β Keep metric cardinality low (avoid user IDs as labels)
- β Sample traces intelligently (not 100%)
- β Set retention policies to control costs
- β Never log passwords or sensitive data
- β Use OpenTelemetry for vendor-neutral instrumentation
π’ Key Numbers to Remember
- 4 Golden Signals every service needs
- 3 Pillars of observability
- 1-5% Typical trace sampling rate for high-volume systems
- p99 The percentile that matters most (captures tail latency)
π Further Study
Ready to dive deeper? Check out these resources:
Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - The chapter that defined the Four Golden Signals and modern observability thinking.
OpenTelemetry Documentation: https://opentelemetry.io/docs/ - Official docs for the industry-standard observability framework. Includes getting-started guides for all major languages.
Charity Majors - Observability Engineering (O'Reilly): https://www.oreilly.com/library/view/observability-engineering/9781492076438/ - Comprehensive book covering observability principles, practices, and cultural aspects from one of the field's pioneers.
Congratulations! You now understand the three pillars that form the foundation of modern observability. Next in this path, you'll learn how to instrument applications to emit these signals effectively. π