You are viewing a preview of this lesson. Sign in to start learning
Back to Production Observability: From Signals to Root Cause (2026)

The Three Pillars

Deep dive into metrics, logs, and traces as complementary signals for understanding system behavior

The Three Pillars of Observability

Master production observability with free flashcards and spaced repetition practice. This lesson covers metrics, logs, and tracesβ€”the three fundamental signal types that power modern observability systems. Understanding these pillars helps engineers diagnose production issues faster and build more reliable systems.

Welcome to Observability Fundamentals πŸ”

When your application crashes at 3 AM, you need answers fast. Observability is the practice of understanding what's happening inside your systems by examining the signals they emit. Unlike traditional monitoring that tells you when something breaks, observability helps you understand why it broke and where to look.

The Three Pillars frameworkβ€”metrics, logs, and tracesβ€”has become the industry standard for structuring observability data. Each pillar provides a different lens for understanding system behavior:

  • Metrics give you the numerical pulse of your system πŸ“Š
  • Logs tell the story of what happened πŸ“
  • Traces map the journey of individual requests πŸ—ΊοΈ

Think of it like investigating a car problem: metrics are your dashboard gauges (speed, fuel, temperature), logs are the maintenance records, and traces are the GPS history showing exactly where you drove.

πŸ’‘ Key Insight: These pillars work best together. Metrics alert you to problems, logs provide context, and traces show you the exact path through your system.

Pillar 1: Metrics πŸ“Š

Metrics are numerical measurements collected at regular intervals over time. They're aggregated, time-series data that answer questions like "How many?" "How fast?" and "How much?"

What Makes Metrics Special

Metrics are incredibly efficient because they're just numbers. Instead of storing every detail about every request, you aggregate them into counts, rates, and statistical distributions.

Common metric types:

Type Description Example
Counter Always increases (resets on restart) Total HTTP requests: 1,547,293
Gauge Point-in-time value that goes up/down Active connections: 42
Histogram Distribution across buckets Response times: 95% under 200ms
Summary Pre-calculated quantiles p50: 150ms, p99: 800ms

Why Metrics Matter

🎯 Fast alerting: You can evaluate "Is CPU > 80%?" instantly across thousands of servers

🎯 Long-term trends: Store data efficiently for months or years to spot patterns

🎯 Dashboards: Build real-time visualizations showing system health at a glance

The Four Golden Signals

Google's Site Reliability Engineering (SRE) book popularized four critical metrics every service should track:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        THE FOUR GOLDEN SIGNALS          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  1. LATENCY ⏱️                          β”‚
β”‚     How long requests take              β”‚
β”‚     β†’ Track p50, p95, p99 percentiles   β”‚
β”‚                                         β”‚
β”‚  2. TRAFFIC 🚦                          β”‚
β”‚     How many requests you're serving    β”‚
β”‚     β†’ Requests/sec, sessions/sec        β”‚
β”‚                                         β”‚
β”‚  3. ERRORS ❌                           β”‚
β”‚     Rate of failed requests             β”‚
β”‚     β†’ 4xx, 5xx, exceptions              β”‚
β”‚                                         β”‚
β”‚  4. SATURATION πŸ“ˆ                       β”‚
β”‚     How "full" your service is          β”‚
β”‚     β†’ CPU, memory, disk, queue depth    β”‚
β”‚                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Pro Tip: Start with these four. You can always add more metrics later, but these give you 80% of what you need.

Metric Naming and Labels

Modern metric systems use labels (also called tags or dimensions) to add context:

http_requests_total{method="GET", status="200", endpoint="/api/users"}
http_requests_total{method="POST", status="500", endpoint="/api/orders"}

This lets you slice and dice data: "Show me all POST requests" or "What's the error rate for /api/orders?"

⚠️ Cardinality Warning: Each unique combination of labels creates a new time series. If you add a label like user_id with millions of values, you'll explode your metric storage!

Pillar 2: Logs πŸ“

Logs are timestamped text records describing discrete events. They're the oldest observability signalβ€”you've probably written console.log() or print() statements to debug code.

Structured vs. Unstructured Logs

Unstructured logs (traditional):

2026-01-15 14:32:17 INFO User john@example.com logged in from 192.168.1.1
2026-01-15 14:32:18 ERROR Payment processing failed: insufficient funds

These are human-readable but hard to query programmatically.

Structured logs (modern best practice):

{
  "timestamp": "2026-01-15T14:32:17Z",
  "level": "INFO",
  "event": "user_login",
  "user_email": "john@example.com",
  "source_ip": "192.168.1.1"
}

Structured logs are machine-parseable, making them searchable and aggregatable.

Log Levels

Standard severity levels help filter noise:

Level Use Case Production Volume
DEBUG Detailed diagnostic info Usually disabled (too verbose)
INFO Normal operations, milestones Moderate
WARN Unexpected but handled Low
ERROR Failures requiring attention Very low (should be rare!)
FATAL Application crash Extremely rare

What to Log

βœ… Do log:

  • Request starts and completions
  • State changes (order created, payment processed)
  • Error conditions with context
  • Security events (authentication, authorization)
  • External service calls

❌ Don't log:

  • Passwords, API keys, tokens
  • Credit card numbers, SSNs
  • Personal health information (PHI)
  • High-frequency events in tight loops (will crush storage)

The Cost of Logs

Logs are expensive! A busy application might generate gigabytes per hour. That's why you:

  1. Sample high-volume logs (keep 1%, discard 99%)
  2. Set retention policies (keep 7 days hot, 30 days cold, then delete)
  3. Use appropriate levels (don't INFO log every function call)

🧠 Memory Trick: "DIWEF" - Debug, Info, Warn, Error, Fatal (ascending severity)

Pillar 3: Traces πŸ—ΊοΈ

Distributed traces track individual requests as they flow through multiple services. In microservices architectures, a single user action might touch 10+ servicesβ€”traces show you the complete journey.

Anatomy of a Trace

A trace consists of spans, where each span represents a unit of work:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DISTRIBUTED TRACE: Order Checkout Request                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  🌐 API Gateway [span: root]                                β”‚
β”‚  |━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━| 1200ms            β”‚
β”‚      β”‚                                                      β”‚
β”‚      β”œβ”€β†’ πŸ›’ Cart Service [span: get_cart]                  β”‚
β”‚      β”‚   |━━━━━━━━━━━| 150ms                              β”‚
β”‚      β”‚                                                      β”‚
β”‚      β”œβ”€β†’ πŸ’³ Payment Service [span: process_payment]        β”‚
β”‚      β”‚   |━━━━━━━━━━━━━━━━━━━━━━━━━━━| 800ms             β”‚
β”‚      β”‚       β”‚                                              β”‚
β”‚      β”‚       └─→ 🏦 Bank API [span: charge_card]           β”‚
β”‚      β”‚           |━━━━━━━━━━━━━━━━━| 650ms                β”‚
β”‚      β”‚                                                      β”‚
β”‚      └─→ πŸ“§ Notification Service [span: send_email]        β”‚
β”‚          |━━━━━━━━| 100ms                                  β”‚
β”‚                                                             β”‚
β”‚  Trace ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890            β”‚
β”‚  Total Duration: 1200ms                                     β”‚
β”‚  Span Count: 5                                              β”‚
β”‚  Services Touched: 4                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Trace Concepts

Trace ID: Unique identifier for the entire request (follows it everywhere)

Span ID: Unique identifier for each operation within the trace

Parent Span ID: Links child spans to their parent (builds the tree structure)

Tags/Attributes: Key-value metadata (HTTP method, status code, database query)

Baggage: Data propagated to all child spans (user ID, tenant ID, feature flags)

Why Traces Are Powerful

Traces answer questions metrics and logs can't:

πŸ” "Why is this request slow?" β†’ See which service is the bottleneck

πŸ” "Where did this error originate?" β†’ Follow the span tree to the source

πŸ” "Do these services interact efficiently?" β†’ Visualize dependencies and latencies

Trace Sampling

You can't trace every requestβ€”it's too expensive. Common strategies:

Strategy Description Use Case
Head-based Decide at request start (1% sample rate) High-volume, mostly healthy traffic
Tail-based Decide after completion (keep all errors) Ensure you capture interesting requests
Adaptive Adjust rate based on volume/error rate Optimize cost vs. coverage dynamically

OpenTelemetry Standard

OpenTelemetry (OTel) is the industry-standard framework for generating traces, metrics, and logs. It provides:

  • APIs for instrumenting code
  • SDKs for popular languages
  • Exporters to send data to various backends
  • Auto-instrumentation for common frameworks

πŸ’‘ Best Practice: Use OTel from the start. It's vendor-neutral, so you can switch observability backends without rewriting instrumentation.

Bringing the Pillars Together πŸ›οΈ

The Observability Workflow

Here's how the three pillars work together in a real incident:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         INCIDENT INVESTIGATION FLOW                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  ⚠️ ALERT FIRES
      β”‚
      β–Ό
  πŸ“Š CHECK METRICS
  "Error rate spiked from 0.1% to 5% at 14:32"
      β”‚
      β”œβ”€β†’ Which service?
      β”‚   β†’ payment-service
      β”‚
      β”œβ”€β†’ Which endpoint?
      β”‚   β†’ POST /api/checkout
      β”‚
      β–Ό
  πŸ“ SEARCH LOGS
  "Show ERROR logs from payment-service after 14:32"
      β”‚
      β”œβ”€β†’ Pattern found:
      β”‚   "Database connection timeout"
      β”‚   (happened 147 times)
      β”‚
      β–Ό
  πŸ—ΊοΈ EXAMINE TRACES
  "Show slow traces touching payment-service DB"
      β”‚
      β”œβ”€β†’ Span analysis:
      β”‚   database_query span: 30 seconds (normally 50ms)
      β”‚   query: SELECT * FROM transactions WHERE...
      β”‚
      β–Ό
  🎯 ROOT CAUSE IDENTIFIED
  "Missing index on transactions.user_id column"
  "Recent data growth made full table scan too slow"
      β”‚
      β–Ό
  βœ… FIX APPLIED
  "Added index, query now 45ms, error rate back to 0.1%"

Correlation is Key

The magic happens when you correlate signals:

  • Logs include trace IDs β†’ click from log to full trace
  • Traces include metric tags β†’ see metrics for specific service/endpoint
  • Metrics link to example traces β†’ jump from dashboard to specific requests

Example correlation in practice:

{
  "timestamp": "2026-01-15T14:32:17Z",
  "level": "ERROR",
  "message": "Database timeout",
  "trace_id": "a1b2c3d4-e5f6-7890",
  "span_id": "1234567890abcdef",
  "service.name": "payment-service",
  "http.method": "POST",
  "http.route": "/api/checkout"
}

With this structured log, you can:

  1. Filter logs by service
  2. Jump to the full trace (via trace_id)
  3. Aggregate error counts into metrics (by service.name and http.route)

Beyond the Three Pillars

Modern observability is expanding:

πŸ”₯ Continuous Profiling: CPU/memory profiles to find performance bottlenecks

πŸ“Έ Snapshots: Capture variable state at specific code lines (without debuggers)

πŸ§ͺ Chaos Engineering: Intentionally break things to test observability coverage

πŸ€– AIOps: Machine learning to detect anomalies and predict incidents

But the three pillars remain fundamentalβ€”everything else builds on metrics, logs, and traces.

Example 1: E-commerce Checkout Investigation πŸ›’

Scenario: Your e-commerce site's checkout is slow. Customers complain it takes 10+ seconds.

Step 1 - Metrics: Check your latency dashboard

  • http_request_duration_seconds{endpoint="/checkout"} p99: 12 seconds (was 800ms yesterday)
  • http_requests_total{endpoint="/checkout", status="200"}: Still returning success
  • Timeline: Slowdown started at 08:00 UTC

Step 2 - Logs: Search for checkout-related warnings/errors after 08:00

{
  "timestamp": "2026-01-15T08:03:42Z",
  "level": "WARN",
  "message": "Inventory service response time: 9847ms",
  "service": "checkout-service",
  "trace_id": "xyz789"
}

Pattern: Inventory service calls are slow!

Step 3 - Traces: Pull up trace xyz789

Checkout Request: 12.3s total
β”œβ”€ Validate Cart: 50ms βœ…
β”œβ”€ Check Inventory: 9.8s ⚠️  ← BOTTLENECK
β”‚  └─ Database Query: 9.7s
β”‚     Query: SELECT * FROM inventory WHERE sku IN (...500 items...)
β”œβ”€ Process Payment: 400ms βœ…
└─ Send Confirmation: 150ms βœ…

Root Cause: Someone added 500 items to their cart. The inventory check queries 500 SKUs individually (N+1 query problem).

Fix: Batch inventory queries OR add a cart item limit.

Example 2: Memory Leak Detection πŸ’§

Scenario: Your Node.js API starts fast but gets slower over hours, then crashes.

Step 1 - Metrics: Graph memory usage over time

  • process_resident_memory_bytes: Steadily climbing from 200MB to 2GB over 6 hours
  • nodejs_heap_size_total_bytes: Also climbing linearly
  • nodejs_gc_duration_seconds: GC pauses getting longer (100ms β†’ 5 seconds)

Diagnosis: Classic memory leak patternβ€”heap grows until GC can't keep up.

Step 2 - Logs: Search for clues about what's growing

{
  "timestamp": "2026-01-15T14:32:17Z",
  "level": "INFO",
  "message": "WebSocket connection established",
  "connection_id": "conn_12345",
  "total_connections": 8472
}

WebSocket connections keep growingβ€”never closed!

Step 3 - Traces: Sample traces show WebSocket initialization spans but no cleanup spans.

Root Cause: WebSocket connections stored in a Map, but disconnect handler never fires due to a bug.

Fix: Add proper cleanup on disconnect + implement connection limit.

Example 3: Cascading Failure 🌊

Scenario: Recommendation service goes down, entire site becomes unresponsive.

Step 1 - Metrics: Multiple services showing problems

  • recommendation_service: 100% error rate (down)
  • product_page_service: Latency spiked to 30 seconds
  • search_service: Latency spiked to 25 seconds
  • api_gateway: Request queues backing up

Step 2 - Traces: Examine slow product page traces

Product Page Request: 30.2s
β”œβ”€ Get Product Details: 50ms βœ…
β”œβ”€ Get Recommendations: 30.0s ❌ TIMEOUT
β”‚  └─ (no response from recommendation-service)
└─ Render Page: (never reached)

The product page blocks waiting for recommendations!

Step 3 - Logs: Recommendation service logs

ERROR: Redis connection refused
Connection pool exhausted, all circuits open

Recommendation service's Redis cache died.

Root Cause: No timeout or fallback. When recommendations fail, product pages hang, exhausting connection pools everywhere.

Fix:

  1. Add 1-second timeout to recommendation calls
  2. Implement graceful degradation (show generic recommendations)
  3. Add circuit breaker pattern

Example 4: Silent Data Corruption πŸ›

Scenario: Finance team reports yesterday's revenue number seems low.

Step 1 - Metrics: Check business metrics

  • orders_completed_total: Normal volume (5,200 orders)
  • revenue_total_usd: $62,400 (expected ~$180,000 based on order volume)
  • Average order value: $12 (normally ~$35)

Step 2 - Logs: Search for payment-related anomalies

{
  "timestamp": "2026-01-14T16:45:12Z",
  "level": "WARN",
  "message": "Currency conversion rate unavailable, defaulting to 1.0",
  "from_currency": "EUR",
  "to_currency": "USD"
}

Appears 3,100 times yesterday afternoon!

Step 3 - Traces: Examine order processing traces

Order Processing
β”œβ”€ Calculate Total: 100ms
β”œβ”€ Convert Currency (EURβ†’USD): 5ms
β”‚  └─ Exchange rate: 1.0 ⚠️  (should be ~1.08)
β”œβ”€ Charge Payment: 200ms
└─ Record Revenue: 50ms
   └─ amount_usd: $100 (should be ~$108)

Root Cause: Currency API quota exceeded. Fallback used 1:1 conversion rate, undercharging European customers by ~8%.

Fix:

  1. Increase API quota
  2. Cache exchange rates locally
  3. Alert when fallback conversion is used
  4. Never default to 1.0β€”fail the order if rate unavailable

πŸ’‘ Lesson: Metrics caught what logs alone would miss. Order count looked fine, but revenue metric revealed the issue.

Common Mistakes ⚠️

1. Logging Too Much (or Too Little)

❌ Too Much: Logging every function entry/exit at INFO level

  • Costs: Storage explosion, slow searches, important logs buried in noise
  • Fix: Use DEBUG level for verbose logging, keep it disabled in production

❌ Too Little: Only logging errors with generic messages

catch(error) {
  logger.error("Something went wrong");  // Useless!
}
  • Fix: Include contextβ€”what operation, what inputs, what state
catch(error) {
  logger.error("Failed to process payment", {
    error: error.message,
    order_id: order.id,
    payment_method: payment.method,
    trace_id: currentTraceId()
  });
}

2. High-Cardinality Metrics

❌ Bad: Adding user IDs as metric labels

api_requests{user_id="user_12345", endpoint="/api/orders"}
api_requests{user_id="user_12346", endpoint="/api/orders"}
api_requests{user_id="user_12347", endpoint="/api/orders"}

With 1 million users, you create 1 million time seriesβ€”metric storage explodes!

βœ… Good: Use bounded labels

api_requests{user_tier="free", endpoint="/api/orders"}
api_requests{user_tier="premium", endpoint="/api/orders"}

Only 2 time series (or however many tiers you have).

3. Not Propagating Trace Context

❌ Bad: Starting a new trace in each service

## Service A
trace_id = generate_new_trace_id()  # abc123
response = requests.post("service-b.com/api", data=payload)

## Service B receives request
trace_id = generate_new_trace_id()  # xyz789 (DIFFERENT!)

You lose the connection between servicesβ€”can't see the full request path.

βœ… Good: Propagate trace context in headers

## Service A
headers = {"traceparent": f"00-{trace_id}-{span_id}-01"}
response = requests.post("service-b.com/api", data=payload, headers=headers)

## Service B receives request
trace_id = extract_trace_id(request.headers["traceparent"])  # Same abc123!

4. Ignoring Sampling

❌ Bad: Tracing 100% of requests in high-volume production

  • Cost: Massive data volumes, storage costs, network overhead
  • Reality: You can't afford to trace 10,000 requests/second

βœ… Good: Sample intelligently

  • 1% of successful requests (statistical sample)
  • 100% of errors (you need these for debugging)
  • 100% of slow requests (above p95 latency)

5. Metrics Without Context

❌ Bad: Tracking a counter without useful dimensions

orders_total: 1,547,293

You know total orders, but can't answer: "How many orders per region?" "What's our mobile vs. desktop split?"

βœ… Good: Add relevant dimensions

orders_total{region="us-east", platform="mobile", payment_method="credit_card"}: 834,192
orders_total{region="eu-west", platform="desktop", payment_method="paypal"}: 423,101

Now you can slice data meaningfully.

6. Logging Sensitive Data

❌ Bad: Logging user passwords, credit cards, API keys

logger.info("User logged in", {
  email: user.email,
  password: user.password  // NEVER!
});

This violates privacy laws and creates security risks.

βœ… Good: Redact sensitive fields

logger.info("User logged in", {
  email: user.email,
  user_id: user.id,
  // password intentionally omitted
});

7. No Retention Policies

❌ Bad: Storing all logs/traces forever

  • Costs spiral out of control
  • Old data rarely accessed
  • Compliance issues (GDPR requires data deletion)

βœ… Good: Tiered retention

  • Logs: 7 days hot (fast search), 30 days cold (archival), then delete
  • Metrics: 15 days at 1s resolution, 90 days at 1m resolution, 1 year at 1h resolution
  • Traces: 3 days sampled, keep only errors after that

🧠 Remember: "MiLT" - Metrics (aggregate), Logs (events), Traces (requests)

Key Takeaways 🎯

πŸ“‹ Quick Reference Card

Pillar What It Is Best For Example Tool
πŸ“Š Metrics Numerical measurements over time Alerting, dashboards, trends Prometheus, Datadog
πŸ“ Logs Timestamped event records Debugging, auditing, context Elasticsearch, Splunk
πŸ—ΊοΈ Traces Request paths through systems Performance analysis, dependencies Jaeger, Zipkin, Honeycomb

🎯 The Investigation Pattern

  1. Metrics tell you WHAT is wrong and WHEN
  2. Logs tell you WHY and provide context
  3. Traces tell you WHERE in your system

βœ… Best Practices Checklist

  • βœ“ Use structured logging (JSON format)
  • βœ“ Include trace IDs in logs for correlation
  • βœ“ Track the Four Golden Signals (latency, traffic, errors, saturation)
  • βœ“ Keep metric cardinality low (avoid user IDs as labels)
  • βœ“ Sample traces intelligently (not 100%)
  • βœ“ Set retention policies to control costs
  • βœ“ Never log passwords or sensitive data
  • βœ“ Use OpenTelemetry for vendor-neutral instrumentation

πŸ”’ Key Numbers to Remember

  • 4 Golden Signals every service needs
  • 3 Pillars of observability
  • 1-5% Typical trace sampling rate for high-volume systems
  • p99 The percentile that matters most (captures tail latency)

πŸ“š Further Study

Ready to dive deeper? Check out these resources:

  1. Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - The chapter that defined the Four Golden Signals and modern observability thinking.

  2. OpenTelemetry Documentation: https://opentelemetry.io/docs/ - Official docs for the industry-standard observability framework. Includes getting-started guides for all major languages.

  3. Charity Majors - Observability Engineering (O'Reilly): https://www.oreilly.com/library/view/observability-engineering/9781492076438/ - Comprehensive book covering observability principles, practices, and cultural aspects from one of the field's pioneers.

Congratulations! You now understand the three pillars that form the foundation of modern observability. Next in this path, you'll learn how to instrument applications to emit these signals effectively. πŸš€