You are viewing a preview of this lesson. Sign in to start learning
Back to Production Observability: From Signals to Root Cause (2026)

Metrics That Matter

Master counters, gauges, histograms, RED/USE methods, and why averages mislead during incidents

Metrics That Matter

Master the critical metrics for production observability with free flashcards and spaced repetition practice. This lesson covers the four golden signals, service-level indicators (SLIs), and metric cardinalityβ€”essential concepts for building reliable distributed systems in 2026 and beyond.

Welcome to Metrics That Matter πŸ’»

In the world of production observability, you can measure almost anything. CPU usage, memory consumption, network packets, disk I/O, request counts, error rates, cache hitsβ€”the list goes on endlessly. But here's the challenge: not all metrics are created equal.

In this lesson, you'll learn to distinguish signal from noise. We'll explore the metrics that actually help you understand system health, detect incidents early, and diagnose problems quickly. You'll discover frameworks like the Four Golden Signals and learn how to define meaningful Service Level Indicators (SLIs) that align with user experience.

πŸ’‘ Pro tip: The best engineers don't monitor everythingβ€”they monitor what matters. This lesson will teach you exactly what that means.

Core Concepts: Understanding Essential Metrics πŸ“Š

The Four Golden Signals πŸ”†

Google's Site Reliability Engineering (SRE) book introduced the Four Golden Signalsβ€”a fundamental framework for monitoring distributed systems. These four metrics provide a complete picture of system health:

SignalWhat It MeasuresWhy It Matters
LatencyTime to service a requestDirectly impacts user experience
TrafficDemand on your systemShows utilization and growth
ErrorsRate of failed requestsIndicates correctness issues
SaturationHow "full" your service isPredicts capacity problems

Let's explore each signal in detail:

1. Latency ⏱️

Latency measures how long it takes to service a request. But there's a critical nuance: you must track successful request latency and failed request latency separately.

Why? A failed request might return instantly (say, a 400 Bad Request), making your average latency look great even when users are experiencing errors. Always separate these:

  • Success latency: P50, P95, P99 of successful requests
  • Error latency: Response time for failed requests

πŸ’‘ Key insight: Focus on percentiles (P95, P99) rather than averages. An average of 100ms might hide the fact that 5% of your users wait 10 seconds.

LATENCY DISTRIBUTION

    β”‚
100%β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚                          β–“β–“
    β”‚                       β–“β–“β–“β–“β–“
 99%β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–“β–“β–“β–“β–“
    β”‚                  β–“β–“β–“β–“β–“β–“β–“
    β”‚             β–“β–“β–“β–“β–“β–“β–“β–“β–“
 95%β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–“β–“β–“β–“β–“β–“β–“β–“β–“
    β”‚      β–“β–“β–“β–“β–“β–“β–“β–“β–“
    β”‚  β–“β–“β–“β–“β–“β–“β–“β–“
 50%β”œβ–“β–“β–“β–“β–“β–“
    └─┴─┴─┴─┴─┴─┴─┴─┴─┴─
     10 50 100 200 500 1000ms

P50 = 50ms (median user)
P95 = 200ms (slower users)
P99 = 800ms (worst experience)
2. Traffic 🚦

Traffic measures the demand on your system. What you measure depends on your system type:

  • Web service: HTTP requests per second
  • Database: Transactions or queries per second
  • Streaming service: Network I/O rate or sessions per second
  • IoT system: Messages per second

Traffic helps you understand:

  • Normal patterns: Daily/weekly cycles
  • Growth trends: Are you scaling appropriately?
  • Incident correlation: Did traffic spike cause the outage?
3. Errors ❌

The error rate tells you when things break. Track errors as:

  • Rate: Errors per second (absolute)
  • Ratio: Error percentage of total requests (relative)

Both matter! A service handling 1000 req/s with 1% errors is generating 10 errors/secondβ€”significantly different from a service at 10 req/s with 1% errors (0.1 errors/s).

Critical distinction: Explicit vs. implicit errors

Explicit ErrorsImplicit Errors
HTTP 500 responsesHTTP 200 with wrong content
Exception thrownRequest timeout (no response)
Failed health checkSlow response violating SLA

πŸ’‘ Pro tip: Don't ignore 4xx errors completely. While they're client errors, a sudden spike in 400s or 404s might indicate a broken client deployment or API change.

4. Saturation πŸ“ˆ

Saturation measures how "full" your service is. This is the most complex signal because different resources saturate differently:

  • CPU: 80%+ utilization typically problematic
  • Memory: Depends on language (garbage collection impact)
  • Disk I/O: Queue depth and IOPS limits
  • Network: Bandwidth utilization
  • Thread pools: Active/max threads ratio

Saturation is predictiveβ€”it tells you problems are coming before they arrive.

SATURATION WARNING ZONES

100% ─━━━━━━━━━━━━━━━━━━━━━━  πŸ”΄ Critical
     β”‚β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“
 90% ─▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  🟑 Warning
     β”‚β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’
 70% ─▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒  🟒 Healthy
     β”‚β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
  0% ─░░░░░░░░░░░░░░░░░░░░░
     └────────────────────
         CPU Usage

Service Level Indicators (SLIs) 🎯

While the Four Golden Signals provide a framework, Service Level Indicators (SLIs) are the specific metrics you choose to measure service quality. An SLI is a carefully selected metric that represents user experience.

Anatomy of a good SLI:

  1. User-centric: Reflects what users actually care about
  2. Measurable: Can be reliably collected
  3. Proportional: Expressed as a ratio (good events / total events)
  4. Actionable: Changes suggest specific fixes

Common SLI patterns:

SLI TypeExampleFormula
AvailabilitySuccessful requests(successful requests) / (total requests)
LatencyFast requests(requests < 200ms) / (total requests)
QualityValid responses(correct results) / (total results)
DurabilityData retention(records retained) / (records written)

Example SLI definition:

"The proportion of HTTP GET requests for /api/users that return a 2xx status code and complete within 300ms, measured at the load balancer."

Notice the specificity:

  • βœ… Specific endpoint (/api/users)
  • βœ… Specific method (GET)
  • βœ… Success criteria (2xx, <300ms)
  • βœ… Measurement point (load balancer)

πŸ’‘ SLI vs SLO vs SLA:

  • SLI: The metric you measure ("99.5% of requests succeeded")
  • SLO: Your target ("We aim for 99.9% success")
  • SLA: The contract ("We guarantee 99.5% or you get a refund")

RED Method πŸ”΄

An alternative to the Four Golden Signals, the RED method focuses on request-driven services:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution

RED is essentially a simplified version of the Golden Signals, perfect for microservices and APIs.

RED METHOD DASHBOARD LAYOUT

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ“Š RATE (Traffic)                  β”‚
β”‚  β–β–‚β–ƒβ–„β–…β–†β–‡β–ˆβ–‡β–†β–…β–„β–ƒβ–‚β–  1,247 req/s      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ❌ ERRORS                          β”‚
β”‚  ▁▁▁▁▁▂▃▄▃▂▁▁▁▁▁  0.3% error rate  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ⏱️ DURATION (Latency)              β”‚
β”‚  P50: 45ms                          β”‚
β”‚  P95: 180ms                         β”‚
β”‚  P99: 520ms                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

USE Method πŸ”§

For resource-oriented monitoring (servers, containers, databases), Brendan Gregg developed the USE method:

  • Utilization: % time resource is busy
  • Saturation: Degree of queued work
  • Errors: Count of error events

Apply USE to every resource:

ResourceUtilizationSaturationErrors
CPUCPU usage %Run queue lengthCPU errors (rare)
MemoryMemory used %Swap usage, page faultsOOM kills
DiskIO utilization %IO queue depthRead/write errors
NetworkBandwidth used %Buffer overrunsPacket loss, retransmits

Metric Cardinality and Dimensionality πŸ“

As systems grow complex, you'll want to add dimensions (labels/tags) to metrics:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Cardinality = the number of unique time series created by dimension combinations.

Low cardinality (good):

  • method: 5-10 values (GET, POST, PUT, DELETE...)
  • status: ~15 values (200, 201, 400, 404, 500...)
  • endpoint: Dozens to hundreds (your API routes)

High cardinality (dangerous):

  • ❌ user_id: Millions of values
  • ❌ session_id: Unbounded
  • ❌ ip_address: Hundreds of thousands
  • ❌ timestamp: Infinite

⚠️ Cardinality explosion: With 10 methods Γ— 100 endpoints Γ— 15 statuses = 15,000 time series. Add a high-cardinality dimension like user_id (1M values) and you get 15 billion time seriesβ€”this will crash your monitoring system!

πŸ’‘ Cardinality best practices:

  1. Keep dimension values bounded
  2. Use aggregation for high-cardinality data ("top 10 users" not "all users")
  3. Store high-cardinality data in logs/traces, not metrics
  4. Monitor your monitoring system's cardinality

Counter, Gauge, and Histogram πŸ“Š

Metric types determine how data is stored and queried:

Counter πŸ”’: Cumulative value that only increases

  • Examples: total requests, total errors, bytes sent
  • Use for: Rates (calculate rate of change)
  • Resets to 0 on restart
http_requests_total: 1547 β†’ 1548 β†’ 1549 β†’ 0 (restart) β†’ 1 β†’ 2

Gauge πŸ“: Instantaneous value that can go up or down

  • Examples: CPU usage, memory consumption, queue depth, active connections
  • Use for: Current state snapshots
  • Value is meaningful at any point in time
cpu_usage_percent: 45.2 β†’ 67.8 β†’ 52.1 β†’ 71.5

Histogram πŸ“Š: Distribution of values in buckets

  • Examples: request latency, response size
  • Use for: Percentile calculations (P95, P99)
  • Pre-aggregated on client side
http_request_duration_seconds_bucket{le="0.1"} = 850
http_request_duration_seconds_bucket{le="0.5"} = 920
http_request_duration_seconds_bucket{le="1.0"} = 950
http_request_duration_seconds_bucket{le="+Inf"} = 1000
METRIC TYPE DECISION TREE

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ What are you        β”‚
       β”‚ measuring?          β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                           β”‚
 β”Œβ”€β”€β”΄β”€β”€β”                     β”Œβ”€β”€β”΄β”€β”€β”
 β”‚Totalβ”‚                     β”‚Stateβ”‚
 β”‚countβ”‚                     β”‚valueβ”‚
 β””β”€β”€β”¬β”€β”€β”˜                     β””β”€β”€β”¬β”€β”€β”˜
    β”‚                           β”‚
    β–Ό                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚COUNTER β”‚              β”‚                β”‚
β”‚        β”‚           β”Œβ”€β”€β”΄β”€β”€β”         β”Œβ”€β”€β”΄β”€β”€β”
β”‚Examplesβ”‚           β”‚Goes β”‚         β”‚Distri-β”‚
β”‚- Requestsβ”‚         β”‚up & β”‚         β”‚butionβ”‚
β”‚- Errors  β”‚         β”‚down?β”‚         β”‚?     β”‚
β”‚- Bytes   β”‚         β””β”€β”€β”¬β”€β”€β”˜         β””β”€β”€β”¬β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚               β”‚
                        β–Ό               β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚ GAUGE  β”‚      β”‚HISTOGRAMβ”‚
                   β”‚        β”‚      β”‚         β”‚
                   β”‚Examplesβ”‚      β”‚Examples β”‚
                   β”‚- CPU % β”‚      β”‚- Latencyβ”‚
                   β”‚- Memoryβ”‚      β”‚- Size   β”‚
                   β”‚- Queue β”‚      β”‚- Score  β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Real-World Examples 🌍

Example 1: E-commerce Checkout Service

Let's design metrics for a critical checkout service:

Golden Signals implementation:

  1. Latency:

    • checkout_duration_seconds (histogram)
    • Buckets: 0.1s, 0.5s, 1s, 2s, 5s
    • Track P95 and P99
    • Alert if P95 > 2 seconds
  2. Traffic:

    • checkout_requests_total (counter)
    • Dimensions: {method, status}
    • Calculate rate: requests/second
  3. Errors:

    • checkout_errors_total (counter)
    • Dimensions: {error_type, payment_provider}
    • Alert if error rate > 1%
  4. Saturation:

    • payment_api_pool_active (gauge): active connections
    • payment_api_pool_max (gauge): max connections
    • Alert if ratio > 0.8 (80% capacity)

SLI definition: "99.5% of checkout requests complete successfully with <2s latency, measured over 30-day windows"

Example 2: Video Streaming Platform

User-centric SLIs:

  1. Availability SLI:

    • "Proportion of video start requests that return a playback URL within 500ms"
    • Metric: (successful_starts < 500ms) / (total_start_requests)
    • Target: 99.9%
  2. Quality SLI:

    • "Proportion of viewing time without buffering events"
    • Metric: (smooth_playback_seconds) / (total_playback_seconds)
    • Target: 99.5%
  3. Latency SLI:

    • "Proportion of seek operations that complete within 1 second"
    • Metric: (seeks < 1s) / (total_seeks)
    • Target: 99%
Metric NameTypeDimensionsPurpose
video_start_duration_msHistogramregion, device_typeLatency SLI
playback_seconds_totalCounterquality, codecUsage tracking
buffer_events_totalCounterregion, cdnQuality SLI
cdn_capacity_percentGaugeregion, popSaturation

Example 3: Database Service Metrics

USE Method applied to PostgreSQL:

Utilization:

  • pg_connections_active / pg_connections_max: Connection pool usage
  • pg_cpu_percent: Database CPU usage
  • pg_disk_io_utilization: Disk busy time

Saturation:

  • pg_locks_waiting: Queries waiting for locks
  • pg_replication_lag_bytes: Replica lag
  • pg_disk_queue_depth: Pending disk operations

Errors:

  • pg_deadlocks_total: Deadlock count
  • pg_connection_errors_total: Failed connections
  • pg_transaction_rollbacks_total: Rolled-back transactions

πŸ’‘ Pro insight: For databases, slow queries are often more critical than total query count. Track:

  • pg_slow_queries_total{threshold="1s"}: Queries exceeding 1 second
  • pg_query_duration_p99: 99th percentile query time

Example 4: Microservices Dashboard

Imagine you manage 50 microservices. You can't monitor everythingβ€”focus on service-level metrics using RED:

Per-service metrics:

## Rate
service_requests_per_second{service="auth", method="POST", endpoint="/login"}

## Errors  
service_error_rate{service="auth", status="5xx"}

## Duration
service_request_duration_p95{service="auth", endpoint="/login"}

Cross-service dependencies:

## Upstream calls
service_dependency_requests{from="api-gateway", to="auth", status="200"}
service_dependency_latency_p99{from="api-gateway", to="auth"}
SERVICE DEPENDENCY MAP

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Users   β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
      β”‚
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API Gateway  β”‚ ← RED metrics
β””β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”˜
   β”‚    β”‚    β”‚
   ↓    ↓    ↓
β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”
β”‚Authβ”‚β”‚Cartβ”‚β”‚Searchβ”‚ ← RED metrics each
β””β”€β”€β”¬β”€β”˜β””β”€β”€β”¬β”€β”˜β””β”€β”€β”¬β”€β”˜
   β”‚     β”‚     β”‚
   ↓     ↓     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Database       β”‚ ← USE metrics
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Top-level: User-facing latency
Mid-level: Service RED metrics
Bottom: Resource USE metrics

Common Mistakes ⚠️

1. Measuring Everything

❌ Wrong: "Let's collect 500 metrics per serviceβ€”more data is better!"

βœ… Right: Start with the Four Golden Signals, then add metrics only when you have specific questions to answer.

Why it matters: Too many metrics create:

  • Alert fatigue (which metrics actually matter?)
  • High storage costs
  • Slow dashboards
  • Analysis paralysis

2. Ignoring Cardinality

❌ Wrong:

http_requests{user_id="user123", session_id="abc456", trace_id="xyz789"}

βœ… Right:

http_requests{endpoint="/api/users", method="GET", status="200"}

Why it matters: High-cardinality dimensions explode your time series database. User IDs belong in logs and traces, not metrics.

3. Averaging Latency

❌ Wrong: "Our average latency is 100ms, we're doing great!"

βœ… Right: "P50 is 80ms, P95 is 200ms, P99 is 1.2sβ€”we need to investigate that long tail."

Why it matters: Averages hide outliers. The worst 1% of user experiences matter enormously to user satisfaction.

4. Alert on Vanity Metrics

❌ Wrong: Alert when CPU usage > 50%

βœ… Right: Alert when P99 latency > 2s OR error rate > 1%

Why it matters: Users don't care about CPU usageβ€”they care about slow responses and errors. Alert on symptoms (user impact), not causes (resource usage).

5. Missing the "Why"

❌ Wrong: Track users_logged_in_total (just a counter)

βœ… Right: Track both:

  • login_requests_total{status="success"}: How many tried
  • login_duration_seconds: How long it took
  • login_errors_total{reason="invalid_password"}: Why it failed

Why it matters: Metrics should help you diagnose problems, not just detect them.

6. No Percentile Targets

❌ Wrong: "Our latency SLO is <500ms"

βœ… Right: "Our P95 latency SLO is <500ms"

Why it matters: Without specifying a percentile, your SLO is meaningless. Does "<500ms" mean all requests? Just the median?

7. Forgetting About Data Freshness

❌ Wrong: Using 5-minute average metrics for incident response

βœ… Right: Use 10-second granularity for dashboards, 1-minute for alerts

Why it matters: 5-minute averages mean you detect incidents 2.5 minutes late on average. In production, minutes matter.

Key Takeaways 🎯

  1. Start with the Four Golden Signals: Latency, Traffic, Errors, Saturation cover 90% of observability needs for request-driven services.

  2. Define clear SLIs: Service Level Indicators should be user-centric, measurable, and directly tied to user experience.

  3. Use the right method for the job:

    • RED for services and APIs
    • USE for resources (CPU, memory, disk)
    • Four Golden Signals for overall system health
  4. Mind your cardinality: Keep metric dimensions bounded. High-cardinality data belongs in logs and traces, not metrics.

  5. Think in percentiles, not averages: P95 and P99 latency matter more than mean latency for understanding user experience.

  6. Choose the right metric type:

    • Counter for cumulative counts (requests, errors)
    • Gauge for current state (CPU%, memory, connections)
    • Histogram for distributions (latency, sizes)
  7. Alert on symptoms, not causes: Users care about slow responses and errors, not CPU usage or memory consumption.

  8. Less is more: A focused set of meaningful metrics beats hundreds of unused ones. Start small, add deliberately.

πŸ“‹ Quick Reference: Metrics Framework Selection

System TypeRecommended FrameworkKey Metrics
Web API / MicroserviceRED MethodRate, Errors, Duration
Infrastructure / ServerUSE MethodUtilization, Saturation, Errors
User-facing ServiceFour Golden SignalsLatency, Traffic, Errors, Saturation
Database / StorageUSE + CustomUSE + Query latency, Replication lag
Queue / Message BrokerCustomQueue depth, Age of oldest message, Consumer lag

🧠 Memory Device: The Golden Signals

"LATE" helps you remember the Four Golden Signals:

  • Latency - How long requests take
  • Amount (Traffic) - How many requests
  • Troubles (Errors) - How many fail
  • Exhaustion (Saturation) - How full you are

πŸ“š Further Study

  1. Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - The definitive guide to the Four Golden Signals and practical monitoring philosophy

  2. Brendan Gregg's USE Method: https://www.brendangregg.com/usemethod.html - Comprehensive explanation of the Utilization, Saturation, Errors framework for resource monitoring

  3. Prometheus Best Practices: https://prometheus.io/docs/practices/naming/ - Industry-standard guidance on metric naming, cardinality, and instrumentation patterns for modern observability