Observability Foundations
Master the conceptual shift from monitoring to true observability and understand why traditional approaches fail during critical incidents
Observability Foundations
Master production observability with free flashcards and spaced repetition practice. This lesson covers core observability pillars, signal types, telemetry collection, and the shift from traditional monitoring to modern observabilityβessential concepts for DevOps engineers, SREs, and platform teams building reliable distributed systems.
Welcome to Observability
π» In today's world of microservices, containers, and cloud-native architectures, understanding what's happening inside your production systems is more criticalβand more challengingβthan ever. Observability is the practice of understanding the internal state of a system by examining its external outputs. Unlike traditional monitoring that asks "is this specific thing broken?", observability asks "what is broken and why?"
Think of observability as the difference between checking if your car's engine light is on (monitoring) versus having a diagnostic tool that shows you exactly which cylinder is misfiring, why, and what cascading effects it's causing (observability). π
π― What You'll Learn
- The Three Pillars: Metrics, logs, and tracesβand why we need all three
- Signal Types: Understanding telemetry data and its characteristics
- Cardinality: The hidden challenge that breaks observability systems
- Correlation: Connecting signals across systems to find root causes
- Instrumentation: How to capture meaningful data from your applications
Core Concepts
The Three Pillars of Observability
Modern observability rests on three foundational signal types, often called the "three pillars":
| Pillar | What It Captures | Best For | Example |
|---|---|---|---|
| π Metrics | Numerical measurements over time | Trends, aggregations, alerts | CPU usage, request rate, error count |
| π Logs | Discrete event records | Debugging specific events | "User 123 failed login at 10:34" |
| π Traces | Request paths through distributed systems | Understanding latency and dependencies | API call β database β cache β response |
Why you need all three: Each pillar answers different questions. Metrics tell you that something is wrong, logs tell you what happened, and traces tell you where in your system the problem occurred.
π‘ Tip: Think of them as complementary tools in a diagnostic toolkit. You wouldn't try to fix a car with only a wrenchβyou need multiple tools for different jobs.
Understanding Metrics
Metrics are aggregated numerical data points collected at regular intervals. They're efficient to store and query, making them perfect for:
- Dashboards showing system health at a glance
- Alerts triggering when thresholds are crossed
- Trend analysis over days, weeks, or months
METRIC TYPES βββββββββββββββββββββββββββββββββββββββββββββββ β β β π Counter: Only increases β β Example: total_requests = 1,245,892 β β (resets to 0 on restart) β β β β π Gauge: Can go up or down β β Example: active_connections = 42 β β (current state snapshot) β β β β β±οΈ Histogram: Distribution of values β β Example: request_duration_ms β β Buckets: <100ms, <200ms, <500ms, 1s+ β β β β π Summary: Like histogram + percentiles β β Example: p50=120ms, p95=450ms, p99=2s β β β βββββββββββββββββββββββββββββββββββββββββββββββ
Metric dimensions (also called labels or tags) add context:
http_requests_total{method="GET", endpoint="/api/users", status="200"} = 15420
http_requests_total{method="POST", endpoint="/api/users", status="201"} = 892
http_requests_total{method="GET", endpoint="/api/users", status="404"} = 23
β οΈ Cardinality warning: Each unique combination of label values creates a new time series. With 10 endpoints, 5 methods, and 10 status codes, you have 500 time series. Add a user_id label with 100,000 users, and suddenly you have 50,000,000 time seriesβenough to overwhelm most metric systems!
Understanding Logs
Logs are timestamped text records of discrete events. They're the oldest form of telemetry and remain invaluable for:
- Detailed debugging of specific failures
- Audit trails and compliance
- Understanding exact sequences of events
| Log Level | Purpose | Example |
|---|---|---|
| TRACE | Very detailed, for development | "Entering method processPayment()" |
| DEBUG | Detailed information for debugging | "SQL query took 45ms" |
| INFO | Normal operational events | "User logged in successfully" |
| WARN | Potentially problematic situations | "Retry attempt 3 of 5" |
| ERROR | Error events that allow continuation | "Failed to send email notification" |
| FATAL | Severe errors causing shutdown | "Cannot connect to database, exiting" |
Structured logging transforms logs from raw text into queryable data:
## Unstructured (harder to query)
User john@example.com failed login from 192.168.1.100 at 2026-01-15 10:34:22
## Structured (easily queryable)
{
"timestamp": "2026-01-15T10:34:22Z",
"level": "WARN",
"event": "login_failed",
"user_email": "john@example.com",
"source_ip": "192.168.1.100",
"reason": "invalid_password",
"attempt_count": 3
}
π‘ Best practice: Always include correlation IDs (request IDs, trace IDs) in your logs so you can connect related events across multiple services.
Understanding Traces
Distributed tracing tracks requests as they flow through multiple services in a system. Each request becomes a trace, and each operation within that trace is a span.
DISTRIBUTED TRACE ANATOMY Trace ID: abc123xyz (represents entire request journey) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Span: API Gateway (125ms total) β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Span: User Service (45ms) β β β β ββββββββββββββββββββββββββββββββββββ β β β β β Span: Database Query (15ms) β β β β β ββββββββββββββββββββββββββββββββββββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Span: Order Service (80ms) β β β β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β β β β β Span: Cache (5ms)β β Span: Payment API (60ms) β β β β β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Each span contains: - Operation name - Start time & duration - Parent span ID (forms the tree) - Tags/attributes (metadata) - Logs/events within the span
Why traces matter: Without tracing, debugging a slow request in a 20-service architecture means checking logs in 20 different places. With tracing, you immediately see that 80% of the time was spent in the Payment API's database call.
π Sampling: Recording every single trace creates massive data volumes. Most systems use sampling strategies:
- Head-based sampling: Decide to record immediately (e.g., 1% of all requests)
- Tail-based sampling: Decide after seeing the complete trace (e.g., keep all errors and slow requests)
Cardinality: The Hidden Challenge
Cardinality refers to the number of unique values in a dataset. In observability, high cardinality destroys performance and costs.
| Dimension Type | Cardinality | Impact |
|---|---|---|
| Low cardinality environment, region, service | 3-100 values | β Efficient, inexpensive |
| Medium cardinality endpoint, host, container | 100-1000 values | β οΈ Manageable with care |
| High cardinality user_id, trace_id, session_id | 1000s-millions | β Expensive, slow queries |
The cardinality explosion:
βββββββββββββββββββββββββββββββββββββββββββββββββ β LABEL COMBINATION EXPLOSION β βββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 3 services Γ 20 endpoints Γ 8 status codes β β = 480 time series β (manageable) β β β β Add user_id (50,000 users): β β = 24,000,000 time series β (disaster!) β β β β Storage & query costs increase linearly β β with number of unique time series β β β βββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Solution: Use high-cardinality dimensions in traces and logs (designed for it), not in metrics. If you need to filter metrics by user, aggregate first or use exemplars (sample traces linked from metrics).
Signal Correlation
Correlation is connecting different signals to understand the complete picture. This is where observability becomes more than the sum of its parts.
CORRELATION FLOW: DEBUG SLOW API
1οΈβ£ Metric shows spike
β
π "p95 latency jumped from 200ms to 5s"
http_request_duration{endpoint="/checkout"}
β
2οΈβ£ Use time range to query traces
β
π Find slow traces in that time window
trace_duration > 4s AND endpoint="/checkout"
β
3οΈβ£ Examine trace spans
β
"Payment Service β Database span = 4.8s"
(Identified the slow component)
β
4οΈβ£ Get span's trace_id, search logs
β
π trace_id="xyz789" shows:
"ERROR: Connection pool exhausted, waited 4.7s"
β
5οΈβ£ Root cause found!
Database connection pool too small
Correlation IDs make this possible:
{
"trace_id": "xyz789",
"span_id": "abc123",
"request_id": "req-456",
"user_id": "u-789",
"session_id": "sess-012"
}
Include these in all three pillars:
- Metrics: Use exemplars to link to sample traces
- Logs: Always log trace_id and span_id
- Traces: Include request_id and session context
Instrumentation Approaches
Instrumentation is the process of adding observability code to your applications. There are three main approaches:
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Manual | Write telemetry code yourself | Full control, custom business metrics | Time-consuming, inconsistent |
| Auto-instrumentation | Agent/library instruments automatically | Zero code changes, quick setup | Generic, limited customization |
| Hybrid | Auto + manual for critical paths | Balance of ease and customization | Requires both skill sets |
Instrumentation levels:
βββββββββββββββββββββββββββββββ
β π― Business Metrics β
β (manual instrumentation) β
β "checkout_completed" β
β "cart_abandoned" β
βββββββββββββββββββββββββββββββ€
β Custom, domain-specific
βββββββββββββββββββββββββββββββ
β π¦ Application Metrics β
β (framework instrumentation)β
β HTTP requests, DB queries β
βββββββββββββββββββββββββββββββ€
β Language/framework
βββββββββββββββββββββββββββββββ
β βοΈ System Metrics β
β (agent instrumentation) β
β CPU, memory, disk, network β
βββββββββββββββββββββββββββββββ
β Infrastructure layer
π‘ Best practice: Start with auto-instrumentation for the foundation, then add manual instrumentation for business-critical flows and domain-specific metrics.
From Monitoring to Observability
The shift from monitoring to observability reflects the evolution from simple to complex systems:
| Aspect | Traditional Monitoring | Modern Observability |
|---|---|---|
| Question | "Is X broken?" | "What is broken and why?" |
| Approach | Predefined dashboards & alerts | Exploratory investigation |
| Known vs Unknown | Known problems only | Unknown problems discoverable |
| Data | Metrics and some logs | Metrics, logs, traces correlated |
| Query pattern | "Show me dashboard #5" | "Show me all requests where X..." |
| System complexity | Monoliths, simple architectures | Microservices, distributed systems |
π€ Did you know? The term "observability" comes from control theory, where a system is observable if you can determine its internal state by examining its outputs. The concept was borrowed by software engineering around 2017-2018 as microservices made traditional monitoring insufficient.
Telemetry Pipeline
Understanding how signals flow from your application to your observability platform:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TELEMETRY PIPELINE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π± APPLICATION
β
βββ Instrumentation Library (OpenTelemetry SDK)
β βββ Generates: spans, metrics, logs
β
β
π LOCAL AGENT/COLLECTOR
β βββ Buffers, batches, samples
β βββ Adds metadata (host, cluster, etc.)
β
β
π TELEMETRY BACKEND
β βββ Metrics β Time-series database (Prometheus)
β βββ Logs β Log aggregator (Elasticsearch)
β βββ Traces β Trace store (Jaeger, Tempo)
β
β
π ANALYSIS & VISUALIZATION
βββ Dashboards, alerts, queries (Grafana)
Key pipeline concepts:
- Buffering: Temporarily store data to handle bursts
- Batching: Send multiple signals together for efficiency
- Sampling: Reduce volume while maintaining statistical validity
- Enrichment: Add context (environment, version, region)
- Routing: Send different signals to different backends
Real-World Examples
Example 1: Debugging a Latency Spike
Scenario: Your e-commerce site's checkout endpoint suddenly becomes slow. Users are complaining. How do observability pillars work together?
Step 1 - Metrics detect the problem:
Metric: http_request_duration_p95{endpoint="/checkout"}
Baseline: 250ms
Current: 4.2s (16x increase!) π΄
Alert triggered at 10:34 AM
Step 2 - Traces identify the component:
Query for slow traces: trace_duration > 3s AND endpoint="/checkout"
Trace ID: abc-123-xyz Total: 4.1s ββ API Gateway 50ms ββ (1%) ββ Checkout Service 100ms βββ (2%) β ββ Validate Cart 30ms β ββ Calculate Tax 70ms ββ Payment Service 3,900ms ββββββββββββββββ (95%) β ββ Create Transaction 100ms β ββ Database Query 3,800ms ββββββββββββββββ ββ Notification Service 50ms ββ (1%) π Payment Service DB query is the bottleneck!
Step 3 - Logs reveal the root cause:
Query logs with trace_id="abc-123-xyz" and service="payment":
{
"timestamp": "2026-01-15T10:34:18Z",
"level": "ERROR",
"trace_id": "abc-123-xyz",
"span_id": "span-payment-db",
"message": "Database connection pool exhausted",
"details": {
"pool_size": 10,
"active_connections": 10,
"wait_time_ms": 3780,
"query": "SELECT * FROM transactions WHERE..."
}
}
Root cause: Connection pool is too small for current load. Fix: Increase pool size from 10 to 50 connections.
π‘ Key insight: Each pillar played a role. Metrics detected that something was wrong, traces identified where, and logs explained why.
Example 2: Cardinality Disaster
Scenario: A well-meaning engineer adds user_id as a metric label to track per-user request rates.
The code:
request_counter = Counter(
'api_requests_total',
'Total API requests',
['endpoint', 'method', 'status', 'user_id'] # β DANGER!
)
@app.route('/api/<endpoint>')
def handle_request(endpoint):
request_counter.labels(
endpoint=endpoint,
method=request.method,
status=200,
user_id=current_user.id # Creates unique time series per user!
).inc()
The math:
- 50 endpoints Γ 5 methods Γ 10 status codes = 2,500 base time series β
- Add 100,000 active users = 250,000,000 time series β
The consequences:
- Prometheus server runs out of memory
- Query times jump from 100ms to 30+ seconds
- Monthly observability bill increases by 40x
- System becomes unusable
The fix:
## Option 1: Remove high-cardinality label
request_counter = Counter(
'api_requests_total',
'Total API requests',
['endpoint', 'method', 'status'] # β
Low cardinality only
)
## Option 2: Use traces/logs for high-cardinality data
trace.set_attribute('user_id', current_user.id) # β
Traces handle it
logger.info('Request processed', extra={'user_id': current_user.id})
## Option 3: Aggregate before recording
if current_user.plan == 'premium':
premium_counter.inc() # Track by plan type, not individual user
π§ Memory device: METRIC = MATH, TRACE = TRACKING. If you need to do calculations (aggregations, averages), use metrics with low cardinality. If you need to track individual items (users, sessions), use traces.
Example 3: Trace Sampling Strategy
Scenario: Your system processes 10,000 requests/second. Recording all traces generates 500 GB/day of data, costing $15,000/month. You need a smarter sampling strategy.
Head-based sampling (decide immediately):
## Sample 1% of all traffic
import random
if random.random() < 0.01: # 1% sampling rate
tracer.start_trace()
else:
tracer.skip_trace()
Pros: Simple, predictable volume reduction (99% less data) Cons: Might miss the exact trace that had the error
Tail-based sampling (decide after completion):
## Smart sampling: keep interesting traces
def should_keep_trace(trace):
# Always keep errors
if trace.has_error():
return True
# Always keep slow requests
if trace.duration > 1000ms:
return True
# Keep 1% of normal traffic for baseline
if random.random() < 0.01:
return True
return False
Result comparison:
| Metric | No Sampling | Head (1%) | Tail (smart) |
|---|---|---|---|
| Traces/day | 864M | 8.64M | 12M |
| Data size/day | 500 GB | 5 GB | 7 GB |
| Cost/month | $15,000 | $150 | $210 |
| Error coverage | 100% | ~1% | 100% |
| Slow request coverage | 100% | ~1% | 100% |
π‘ Best practice: Use tail-based sampling in production. The slightly higher cost (7 GB vs 5 GB) is worth having 100% of errors and performance issues.
Example 4: Effective Correlation
Scenario: A user reports "my payment failed" but you have millions of requests to search through. Correlation IDs save the day.
User provides: "I tried to pay around 2:30 PM, order #ORD-78945"
Investigation flow:
1. Start with structured logs:
Query: order_id="ORD-78945" AND timestamp>="2026-01-15T14:25:00Z"
Result:
{
"timestamp": "2026-01-15T14:32:18Z",
"order_id": "ORD-78945",
"trace_id": "xyz-789-abc", β Got it!
"user_id": "U-12345",
"event": "payment_failed",
"reason": "gateway_timeout"
}
2. Use trace_id to examine the full request:
Query traces: trace_id="xyz-789-abc"
Trace spans show:
- Checkout Service: 150ms β
- Payment Gateway API: 30,000ms β (30 second timeout!)
- Response: 504 Gateway Timeout
3. Check metrics for pattern:
Query: payment_gateway_duration{time="14:30-14:35"}
Shows: Spike in latency across ALL payment requests
Indicates: External payment provider issue, not our code
4. Correlate with external status: Check payment provider's status page β "Degraded performance 14:28-14:41 UTC"
Complete picture:
- User's payment failed due to external provider outage β
- Our system correctly timed out and didn't charge user β
- Need better error message to user about retrying β
π§ Try this: In your next project, generate a UUID at the start of each request and include it in every log, metric exemplar, and trace. You'll be amazed how much easier debugging becomes.
Common Mistakes
β οΈ Mistake #1: Using metrics for high-cardinality data
- Wrong: Adding
user_id,session_id, ortransaction_idas metric labels - Why it's bad: Creates millions of time series, crashes metric systems
- Fix: Use these IDs in traces and logs, use aggregated categories in metrics
β οΈ Mistake #2: Logging everything at DEBUG level in production
- Wrong: Leaving verbose DEBUG logs enabled in production
- Why it's bad: Generates terabytes of mostly useless data, increases costs 10-100x
- Fix: Use INFO for production, enable DEBUG temporarily when investigating issues
β οΈ Mistake #3: No correlation between signals
- Wrong: Metrics, logs, and traces exist in isolation with no shared IDs
- Why it's bad: Impossible to connect "this metric spike" to "these specific traces"
- Fix: Always include trace_id in logs, use exemplars in metrics to link to sample traces
β οΈ Mistake #4: Ignoring sampling
- Wrong: Recording 100% of traces "to be safe"
- Why it's bad: Overwhelming data volume, unsustainable costs
- Fix: Implement tail-based sampling to keep all interesting traces (errors, slow requests) while reducing normal traffic
β οΈ Mistake #5: Monitoring symptoms instead of causes
- Wrong: Alert fires: "Disk usage is 95%" but you don't know which service is filling it
- Why it's bad: Alerts without context require manual investigation every time
- Fix: Add rich context to metrics (service, pod, process) and correlate with logs showing what's writing to disk
β οΈ Mistake #6: Treating observability as an afterthought
- Wrong: Building the entire system, then trying to "add observability" before launch
- Why it's bad: Retrofitting instrumentation is 10x harder than building it in from the start
- Fix: Make observability a requirement in every story, instrument as you build
β οΈ Mistake #7: Not defining SLIs/SLOs
- Wrong: Collecting metrics without clear definitions of "good" performance
- Why it's bad: Can't tell if your system is meeting user expectations
- Fix: Define Service Level Indicators (e.g., "99% of requests < 500ms") and track them
β οΈ Mistake #8: Alert fatigue
- Wrong: Creating alerts for every possible metric anomaly
- Why it's bad: Engineers ignore alerts when 95% are false positives
- Fix: Alert only on user-impacting issues, use warnings/dashboards for everything else
Key Takeaways
π Quick Reference Card: Observability Foundations
| Concept | Key Points |
|---|---|
| Three Pillars | π Metrics (what's wrong) + π Logs (what happened) + π Traces (where it happened) |
| Cardinality | Keep metrics LOW cardinality (<1000 series). Use traces/logs for high cardinality data. |
| Correlation | Always include trace_id in logs. Use correlation IDs to connect signals across pillars. |
| Sampling | Tail-based sampling: Keep 100% of errors/slow requests, 1% of normal traffic. |
| Instrumentation | Start with auto-instrumentation, add manual for business metrics. |
| Metric Types | Counter (always up), Gauge (up/down), Histogram (distribution), Summary (percentiles) |
| Log Levels | TRACE < DEBUG < INFO < WARN < ERROR < FATAL. Production = INFO level. |
| Spans | Individual operations in a trace. Form parent-child tree showing request flow. |
π§ Memory Device - The Three Questions:
- Metrics: "Is something wrong?" (trends, aggregates)
- Traces: "Where is it wrong?" (distributed system flow)
- Logs: "Why is it wrong?" (detailed event context)
β‘ Quick Decision Tree:
Need to store data about... ββ Numbers that change over time? β Metrics ββ Individual event details? β Logs ββ Request flow across services? β Traces ββ User/session specific data? β Traces + Logs (never Metrics!)
Summary
Observability transforms how we understand and debug production systems. Unlike traditional monitoring that asks "is service X down?", observability lets you ask arbitrary questions: "show me all requests from mobile users in Europe that took longer than 2 seconds and touched the payment service." This exploratory capability is essential for modern distributed systems where failures are complex and unpredictable.
The three pillarsβmetrics, logs, and tracesβwork together synergistically. Metrics provide efficient aggregation and alerting. Logs offer detailed event context. Traces map request flows through distributed systems. Used together with proper correlation IDs, they enable you to move from "alert fired" to "root cause identified" in minutes instead of hours.
The hidden challenges of observability are cardinality (which destroys metric systems) and data volume (which destroys budgets). Smart sampling strategies, careful metric label selection, and using the right signal type for the right data are essential for sustainable observability at scale.
As you build your observability practice, remember: instrumentation is not a one-time task but an ongoing part of software development. Every new feature should include observability from day one. The time invested upfront pays dividends when production issues ariseβand they always do.
π Further Study
- OpenTelemetry Documentation: https://opentelemetry.io/docs/ - The vendor-neutral standard for instrumentation, metrics, logs, and traces
- Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - Chapter covering observability best practices from Google's SRE team
- Charity Majors - Observability Engineering: https://www.honeycomb.io/blog - Co-author of "Observability Engineering" book, frequent writer on modern observability practices