The Three Pillars

Deep dive into metrics, logs, and traces as complementary signals for understanding system behavior

The Three Pillars of Observability

Master production observability with free flashcards and spaced repetition practice. This lesson covers metrics, logs, and traces—the three fundamental signal types that power modern observability systems. Understanding these pillars helps engineers diagnose production issues faster and build more reliable systems.

Welcome to Observability Fundamentals 🔍

When your application crashes at 3 AM, you need answers fast. Observability is the practice of understanding what's happening inside your systems by examining the signals they emit. Unlike traditional monitoring that tells you when something breaks, observability helps you understand why it broke and where to look.

The Three Pillars framework—metrics, logs, and traces—has become the industry standard for structuring observability data. Each pillar provides a different lens for understanding system behavior:

Metrics give you the numerical pulse of your system 📊
Logs tell the story of what happened 📝
Traces map the journey of individual requests 🗺️

Think of it like investigating a car problem: metrics are your dashboard gauges (speed, fuel, temperature), logs are the maintenance records, and traces are the GPS history showing exactly where you drove.

💡 Key Insight: These pillars work best together. Metrics alert you to problems, logs provide context, and traces show you the exact path through your system.

Pillar 1: Metrics 📊

Metrics are numerical measurements collected at regular intervals over time. They're aggregated, time-series data that answer questions like "How many?" "How fast?" and "How much?"

What Makes Metrics Special

Metrics are incredibly efficient because they're just numbers. Instead of storing every detail about every request, you aggregate them into counts, rates, and statistical distributions.

Common metric types:

Type	Description	Example
Counter	Always increases (resets on restart)	Total HTTP requests: 1,547,293
Gauge	Point-in-time value that goes up/down	Active connections: 42
Histogram	Distribution across buckets	Response times: 95% under 200ms
Summary	Pre-calculated quantiles	p50: 150ms, p99: 800ms

Why Metrics Matter

🎯 Fast alerting: You can evaluate "Is CPU > 80%?" instantly across thousands of servers

🎯 Long-term trends: Store data efficiently for months or years to spot patterns

🎯 Dashboards: Build real-time visualizations showing system health at a glance

The Four Golden Signals

Google's Site Reliability Engineering (SRE) book popularized four critical metrics every service should track:

┌─────────────────────────────────────────┐
│        THE FOUR GOLDEN SIGNALS          │
├─────────────────────────────────────────┤
│                                         │
│  1. LATENCY ⏱️                          │
│     How long requests take              │
│     → Track p50, p95, p99 percentiles   │
│                                         │
│  2. TRAFFIC 🚦                          │
│     How many requests you're serving    │
│     → Requests/sec, sessions/sec        │
│                                         │
│  3. ERRORS ❌                           │
│     Rate of failed requests             │
│     → 4xx, 5xx, exceptions              │
│                                         │
│  4. SATURATION 📈                       │
│     How "full" your service is          │
│     → CPU, memory, disk, queue depth    │
│                                         │
└─────────────────────────────────────────┘

💡 Pro Tip: Start with these four. You can always add more metrics later, but these give you 80% of what you need.

Metric Naming and Labels

Modern metric systems use labels (also called tags or dimensions) to add context:

http_requests_total{method="GET", status="200", endpoint="/api/users"}
http_requests_total{method="POST", status="500", endpoint="/api/orders"}

This lets you slice and dice data: "Show me all POST requests" or "What's the error rate for /api/orders?"

⚠️ Cardinality Warning: Each unique combination of labels creates a new time series. If you add a label like user_id with millions of values, you'll explode your metric storage!

Pillar 2: Logs 📝

Logs are timestamped text records describing discrete events. They're the oldest observability signal—you've probably written console.log() or print() statements to debug code.

Structured vs. Unstructured Logs

Unstructured logs (traditional):

2026-01-15 14:32:17 INFO User john@example.com logged in from 192.168.1.1
2026-01-15 14:32:18 ERROR Payment processing failed: insufficient funds

These are human-readable but hard to query programmatically.

Structured logs (modern best practice):

{
  "timestamp": "2026-01-15T14:32:17Z",
  "level": "INFO",
  "event": "user_login",
  "user_email": "john@example.com",
  "source_ip": "192.168.1.1"
}

Structured logs are machine-parseable, making them searchable and aggregatable.

Log Levels

Standard severity levels help filter noise:

Level	Use Case	Production Volume
DEBUG	Detailed diagnostic info	Usually disabled (too verbose)
INFO	Normal operations, milestones	Moderate
WARN	Unexpected but handled	Low
ERROR	Failures requiring attention	Very low (should be rare!)
FATAL	Application crash	Extremely rare

What to Log

✅ Do log:

Request starts and completions
State changes (order created, payment processed)
Error conditions with context
Security events (authentication, authorization)
External service calls

❌ Don't log:

Passwords, API keys, tokens
Credit card numbers, SSNs
Personal health information (PHI)
High-frequency events in tight loops (will crush storage)

The Cost of Logs

Logs are expensive! A busy application might generate gigabytes per hour. That's why you:

Sample high-volume logs (keep 1%, discard 99%)
Set retention policies (keep 7 days hot, 30 days cold, then delete)
Use appropriate levels (don't INFO log every function call)

🧠 Memory Trick: "DIWEF" - Debug, Info, Warn, Error, Fatal (ascending severity)

Pillar 3: Traces 🗺️

Distributed traces track individual requests as they flow through multiple services. In microservices architectures, a single user action might touch 10+ services—traces show you the complete journey.

Anatomy of a Trace

A trace consists of spans, where each span represents a unit of work:

┌─────────────────────────────────────────────────────────────┐
│  DISTRIBUTED TRACE: Order Checkout Request                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  🌐 API Gateway [span: root]                                │
│  |━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━| 1200ms            │
│      │                                                      │
│      ├─→ 🛒 Cart Service [span: get_cart]                  │
│      │   |━━━━━━━━━━━| 150ms                              │
│      │                                                      │
│      ├─→ 💳 Payment Service [span: process_payment]        │
│      │   |━━━━━━━━━━━━━━━━━━━━━━━━━━━| 800ms             │
│      │       │                                              │
│      │       └─→ 🏦 Bank API [span: charge_card]           │
│      │           |━━━━━━━━━━━━━━━━━| 650ms                │
│      │                                                      │
│      └─→ 📧 Notification Service [span: send_email]        │
│          |━━━━━━━━| 100ms                                  │
│                                                             │
│  Trace ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890            │
│  Total Duration: 1200ms                                     │
│  Span Count: 5                                              │
│  Services Touched: 4                                        │
└─────────────────────────────────────────────────────────────┘

Key Trace Concepts

Trace ID: Unique identifier for the entire request (follows it everywhere)

Span ID: Unique identifier for each operation within the trace

Parent Span ID: Links child spans to their parent (builds the tree structure)

Tags/Attributes: Key-value metadata (HTTP method, status code, database query)

Baggage: Data propagated to all child spans (user ID, tenant ID, feature flags)

Why Traces Are Powerful

Traces answer questions metrics and logs can't:

🔍 "Why is this request slow?" → See which service is the bottleneck

🔍 "Where did this error originate?" → Follow the span tree to the source

🔍 "Do these services interact efficiently?" → Visualize dependencies and latencies

Trace Sampling

You can't trace every request—it's too expensive. Common strategies:

Strategy	Description	Use Case
Head-based	Decide at request start (1% sample rate)	High-volume, mostly healthy traffic
Tail-based	Decide after completion (keep all errors)	Ensure you capture interesting requests
Adaptive	Adjust rate based on volume/error rate	Optimize cost vs. coverage dynamically

OpenTelemetry Standard

OpenTelemetry (OTel) is the industry-standard framework for generating traces, metrics, and logs. It provides:

APIs for instrumenting code
SDKs for popular languages
Exporters to send data to various backends
Auto-instrumentation for common frameworks

💡 Best Practice: Use OTel from the start. It's vendor-neutral, so you can switch observability backends without rewriting instrumentation.

Bringing the Pillars Together 🏛️

The Observability Workflow

Here's how the three pillars work together in a real incident:

┌──────────────────────────────────────────────────────────┐
│         INCIDENT INVESTIGATION FLOW                      │
└──────────────────────────────────────────────────────────┘

  ⚠️ ALERT FIRES
      │
      ▼
  📊 CHECK METRICS
  "Error rate spiked from 0.1% to 5% at 14:32"
      │
      ├─→ Which service?
      │   → payment-service
      │
      ├─→ Which endpoint?
      │   → POST /api/checkout
      │
      ▼
  📝 SEARCH LOGS
  "Show ERROR logs from payment-service after 14:32"
      │
      ├─→ Pattern found:
      │   "Database connection timeout"
      │   (happened 147 times)
      │
      ▼
  🗺️ EXAMINE TRACES
  "Show slow traces touching payment-service DB"
      │
      ├─→ Span analysis:
      │   database_query span: 30 seconds (normally 50ms)
      │   query: SELECT * FROM transactions WHERE...
      │
      ▼
  🎯 ROOT CAUSE IDENTIFIED
  "Missing index on transactions.user_id column"
  "Recent data growth made full table scan too slow"
      │
      ▼
  ✅ FIX APPLIED
  "Added index, query now 45ms, error rate back to 0.1%"

Correlation is Key

The magic happens when you correlate signals:

Logs include trace IDs → click from log to full trace
Traces include metric tags → see metrics for specific service/endpoint
Metrics link to example traces → jump from dashboard to specific requests

Example correlation in practice:

{
  "timestamp": "2026-01-15T14:32:17Z",
  "level": "ERROR",
  "message": "Database timeout",
  "trace_id": "a1b2c3d4-e5f6-7890",
  "span_id": "1234567890abcdef",
  "service.name": "payment-service",
  "http.method": "POST",
  "http.route": "/api/checkout"
}

With this structured log, you can:

Filter logs by service
Jump to the full trace (via trace_id)
Aggregate error counts into metrics (by service.name and http.route)

Beyond the Three Pillars

Modern observability is expanding:

🔥 Continuous Profiling: CPU/memory profiles to find performance bottlenecks

📸 Snapshots: Capture variable state at specific code lines (without debuggers)

🧪 Chaos Engineering: Intentionally break things to test observability coverage

🤖 AIOps: Machine learning to detect anomalies and predict incidents

But the three pillars remain fundamental—everything else builds on metrics, logs, and traces.

Example 1: E-commerce Checkout Investigation 🛒

Scenario: Your e-commerce site's checkout is slow. Customers complain it takes 10+ seconds.

Step 1 - Metrics: Check your latency dashboard

http_request_duration_seconds{endpoint="/checkout"} p99: 12 seconds (was 800ms yesterday)
http_requests_total{endpoint="/checkout", status="200"}: Still returning success
Timeline: Slowdown started at 08:00 UTC

Step 2 - Logs: Search for checkout-related warnings/errors after 08:00

{
  "timestamp": "2026-01-15T08:03:42Z",
  "level": "WARN",
  "message": "Inventory service response time: 9847ms",
  "service": "checkout-service",
  "trace_id": "xyz789"
}

Pattern: Inventory service calls are slow!

Step 3 - Traces: Pull up trace xyz789

Checkout Request: 12.3s total
├─ Validate Cart: 50ms ✅
├─ Check Inventory: 9.8s ⚠️  ← BOTTLENECK
│  └─ Database Query: 9.7s
│     Query: SELECT * FROM inventory WHERE sku IN (...500 items...)
├─ Process Payment: 400ms ✅
└─ Send Confirmation: 150ms ✅

Root Cause: Someone added 500 items to their cart. The inventory check queries 500 SKUs individually (N+1 query problem).

Fix: Batch inventory queries OR add a cart item limit.

Example 2: Memory Leak Detection 💧

Scenario: Your Node.js API starts fast but gets slower over hours, then crashes.

Step 1 - Metrics: Graph memory usage over time

process_resident_memory_bytes: Steadily climbing from 200MB to 2GB over 6 hours
nodejs_heap_size_total_bytes: Also climbing linearly
nodejs_gc_duration_seconds: GC pauses getting longer (100ms → 5 seconds)

Diagnosis: Classic memory leak pattern—heap grows until GC can't keep up.

Step 2 - Logs: Search for clues about what's growing

{
  "timestamp": "2026-01-15T14:32:17Z",
  "level": "INFO",
  "message": "WebSocket connection established",
  "connection_id": "conn_12345",
  "total_connections": 8472
}

WebSocket connections keep growing—never closed!

Step 3 - Traces: Sample traces show WebSocket initialization spans but no cleanup spans.

Root Cause: WebSocket connections stored in a Map, but disconnect handler never fires due to a bug.

Fix: Add proper cleanup on disconnect + implement connection limit.

Example 3: Cascading Failure 🌊

Scenario: Recommendation service goes down, entire site becomes unresponsive.

Step 1 - Metrics: Multiple services showing problems

recommendation_service: 100% error rate (down)
product_page_service: Latency spiked to 30 seconds
search_service: Latency spiked to 25 seconds
api_gateway: Request queues backing up

Step 2 - Traces: Examine slow product page traces

Product Page Request: 30.2s
├─ Get Product Details: 50ms ✅
├─ Get Recommendations: 30.0s ❌ TIMEOUT
│  └─ (no response from recommendation-service)
└─ Render Page: (never reached)

The product page blocks waiting for recommendations!

Step 3 - Logs: Recommendation service logs

ERROR: Redis connection refused
Connection pool exhausted, all circuits open

Recommendation service's Redis cache died.

Root Cause: No timeout or fallback. When recommendations fail, product pages hang, exhausting connection pools everywhere.

Fix:

Add 1-second timeout to recommendation calls
Implement graceful degradation (show generic recommendations)
Add circuit breaker pattern

Example 4: Silent Data Corruption 🐛

Scenario: Finance team reports yesterday's revenue number seems low.

Step 1 - Metrics: Check business metrics

orders_completed_total: Normal volume (5,200 orders)
revenue_total_usd: $62,400 (expected ~$180,000 based on order volume)
Average order value: $12 (normally ~$35)

Step 2 - Logs: Search for payment-related anomalies

{
  "timestamp": "2026-01-14T16:45:12Z",
  "level": "WARN",
  "message": "Currency conversion rate unavailable, defaulting to 1.0",
  "from_currency": "EUR",
  "to_currency": "USD"
}

Appears 3,100 times yesterday afternoon!

Step 3 - Traces: Examine order processing traces

Order Processing
├─ Calculate Total: 100ms
├─ Convert Currency (EUR→USD): 5ms
│  └─ Exchange rate: 1.0 ⚠️  (should be ~1.08)
├─ Charge Payment: 200ms
└─ Record Revenue: 50ms
   └─ amount_usd: $100 (should be ~$108)

Root Cause: Currency API quota exceeded. Fallback used 1:1 conversion rate, undercharging European customers by ~8%.

Fix:

Increase API quota
Cache exchange rates locally
Alert when fallback conversion is used
Never default to 1.0—fail the order if rate unavailable

💡 Lesson: Metrics caught what logs alone would miss. Order count looked fine, but revenue metric revealed the issue.

Common Mistakes ⚠️

1. Logging Too Much (or Too Little)

❌ Too Much: Logging every function entry/exit at INFO level

Costs: Storage explosion, slow searches, important logs buried in noise
Fix: Use DEBUG level for verbose logging, keep it disabled in production

❌ Too Little: Only logging errors with generic messages

catch(error) {
  logger.error("Something went wrong");  // Useless!
}

Fix: Include context—what operation, what inputs, what state

catch(error) {
  logger.error("Failed to process payment", {
    error: error.message,
    order_id: order.id,
    payment_method: payment.method,
    trace_id: currentTraceId()
  });
}

2. High-Cardinality Metrics

❌ Bad: Adding user IDs as metric labels

api_requests{user_id="user_12345", endpoint="/api/orders"}
api_requests{user_id="user_12346", endpoint="/api/orders"}
api_requests{user_id="user_12347", endpoint="/api/orders"}

With 1 million users, you create 1 million time series—metric storage explodes!

✅ Good: Use bounded labels

api_requests{user_tier="free", endpoint="/api/orders"}
api_requests{user_tier="premium", endpoint="/api/orders"}

Only 2 time series (or however many tiers you have).

3. Not Propagating Trace Context

❌ Bad: Starting a new trace in each service

## Service A
trace_id = generate_new_trace_id()  # abc123
response = requests.post("service-b.com/api", data=payload)

## Service B receives request
trace_id = generate_new_trace_id()  # xyz789 (DIFFERENT!)

You lose the connection between services—can't see the full request path.

✅ Good: Propagate trace context in headers

## Service A
headers = {"traceparent": f"00-{trace_id}-{span_id}-01"}
response = requests.post("service-b.com/api", data=payload, headers=headers)

## Service B receives request
trace_id = extract_trace_id(request.headers["traceparent"])  # Same abc123!

4. Ignoring Sampling

❌ Bad: Tracing 100% of requests in high-volume production

Cost: Massive data volumes, storage costs, network overhead
Reality: You can't afford to trace 10,000 requests/second

✅ Good: Sample intelligently

1% of successful requests (statistical sample)
100% of errors (you need these for debugging)
100% of slow requests (above p95 latency)

5. Metrics Without Context

❌ Bad: Tracking a counter without useful dimensions

orders_total: 1,547,293

You know total orders, but can't answer: "How many orders per region?" "What's our mobile vs. desktop split?"

✅ Good: Add relevant dimensions

orders_total{region="us-east", platform="mobile", payment_method="credit_card"}: 834,192
orders_total{region="eu-west", platform="desktop", payment_method="paypal"}: 423,101

Now you can slice data meaningfully.

6. Logging Sensitive Data

❌ Bad: Logging user passwords, credit cards, API keys

logger.info("User logged in", {
  email: user.email,
  password: user.password  // NEVER!
});

This violates privacy laws and creates security risks.

✅ Good: Redact sensitive fields

logger.info("User logged in", {
  email: user.email,
  user_id: user.id,
  // password intentionally omitted
});

7. No Retention Policies

❌ Bad: Storing all logs/traces forever

Costs spiral out of control
Old data rarely accessed
Compliance issues (GDPR requires data deletion)

✅ Good: Tiered retention

Logs: 7 days hot (fast search), 30 days cold (archival), then delete
Metrics: 15 days at 1s resolution, 90 days at 1m resolution, 1 year at 1h resolution
Traces: 3 days sampled, keep only errors after that

🧠 Remember: "MiLT" - Metrics (aggregate), Logs (events), Traces (requests)

Key Takeaways 🎯

📋 Quick Reference Card

Pillar	What It Is	Best For	Example Tool
📊 Metrics	Numerical measurements over time	Alerting, dashboards, trends	Prometheus, Datadog
📝 Logs	Timestamped event records	Debugging, auditing, context	Elasticsearch, Splunk
🗺️ Traces	Request paths through systems	Performance analysis, dependencies	Jaeger, Zipkin, Honeycomb

🎯 The Investigation Pattern

Metrics tell you WHAT is wrong and WHEN
Logs tell you WHY and provide context
Traces tell you WHERE in your system

✅ Best Practices Checklist

✓ Use structured logging (JSON format)
✓ Include trace IDs in logs for correlation
✓ Track the Four Golden Signals (latency, traffic, errors, saturation)
✓ Keep metric cardinality low (avoid user IDs as labels)
✓ Sample traces intelligently (not 100%)
✓ Set retention policies to control costs
✓ Never log passwords or sensitive data
✓ Use OpenTelemetry for vendor-neutral instrumentation

🔢 Key Numbers to Remember

4 Golden Signals every service needs
3 Pillars of observability
1-5% Typical trace sampling rate for high-volume systems
p99 The percentile that matters most (captures tail latency)

📚 Further Study

Ready to dive deeper? Check out these resources:

Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - The chapter that defined the Four Golden Signals and modern observability thinking.
OpenTelemetry Documentation: https://opentelemetry.io/docs/ - Official docs for the industry-standard observability framework. Includes getting-started guides for all major languages.
Charity Majors - Observability Engineering (O'Reilly): https://www.oreilly.com/library/view/observability-engineering/9781492076438/ - Comprehensive book covering observability principles, practices, and cultural aspects from one of the field's pioneers.

Congratulations! You now understand the three pillars that form the foundation of modern observability. Next in this path, you'll learn how to instrument applications to emit these signals effectively. 🚀

📝

Ready to practice?

This lesson has 15 questions to help you learn