Metrics That Matter
Master counters, gauges, histograms, RED/USE methods, and why averages mislead during incidents
Metrics That Matter
Master the critical metrics for production observability with free flashcards and spaced repetition practice. This lesson covers the four golden signals, service-level indicators (SLIs), and metric cardinalityβessential concepts for building reliable distributed systems in 2026 and beyond.
Welcome to Metrics That Matter π»
In the world of production observability, you can measure almost anything. CPU usage, memory consumption, network packets, disk I/O, request counts, error rates, cache hitsβthe list goes on endlessly. But here's the challenge: not all metrics are created equal.
In this lesson, you'll learn to distinguish signal from noise. We'll explore the metrics that actually help you understand system health, detect incidents early, and diagnose problems quickly. You'll discover frameworks like the Four Golden Signals and learn how to define meaningful Service Level Indicators (SLIs) that align with user experience.
π‘ Pro tip: The best engineers don't monitor everythingβthey monitor what matters. This lesson will teach you exactly what that means.
Core Concepts: Understanding Essential Metrics π
The Four Golden Signals π
Google's Site Reliability Engineering (SRE) book introduced the Four Golden Signalsβa fundamental framework for monitoring distributed systems. These four metrics provide a complete picture of system health:
| Signal | What It Measures | Why It Matters |
|---|---|---|
| Latency | Time to service a request | Directly impacts user experience |
| Traffic | Demand on your system | Shows utilization and growth |
| Errors | Rate of failed requests | Indicates correctness issues |
| Saturation | How "full" your service is | Predicts capacity problems |
Let's explore each signal in detail:
1. Latency β±οΈ
Latency measures how long it takes to service a request. But there's a critical nuance: you must track successful request latency and failed request latency separately.
Why? A failed request might return instantly (say, a 400 Bad Request), making your average latency look great even when users are experiencing errors. Always separate these:
- Success latency: P50, P95, P99 of successful requests
- Error latency: Response time for failed requests
π‘ Key insight: Focus on percentiles (P95, P99) rather than averages. An average of 100ms might hide the fact that 5% of your users wait 10 seconds.
LATENCY DISTRIBUTION
β
100%ββββββββββββββββββββββββββββββ
β ββ
β βββββ
99%ββββββββββββββββββββββββββββ
β βββββββ
β βββββββββ
95%ββββββββββββββββββββ
β βββββββββ
β ββββββββ
50%βββββββ
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄β
10 50 100 200 500 1000ms
P50 = 50ms (median user)
P95 = 200ms (slower users)
P99 = 800ms (worst experience)
2. Traffic π¦
Traffic measures the demand on your system. What you measure depends on your system type:
- Web service: HTTP requests per second
- Database: Transactions or queries per second
- Streaming service: Network I/O rate or sessions per second
- IoT system: Messages per second
Traffic helps you understand:
- Normal patterns: Daily/weekly cycles
- Growth trends: Are you scaling appropriately?
- Incident correlation: Did traffic spike cause the outage?
3. Errors β
The error rate tells you when things break. Track errors as:
- Rate: Errors per second (absolute)
- Ratio: Error percentage of total requests (relative)
Both matter! A service handling 1000 req/s with 1% errors is generating 10 errors/secondβsignificantly different from a service at 10 req/s with 1% errors (0.1 errors/s).
Critical distinction: Explicit vs. implicit errors
| Explicit Errors | Implicit Errors |
|---|---|
| HTTP 500 responses | HTTP 200 with wrong content |
| Exception thrown | Request timeout (no response) |
| Failed health check | Slow response violating SLA |
π‘ Pro tip: Don't ignore 4xx errors completely. While they're client errors, a sudden spike in 400s or 404s might indicate a broken client deployment or API change.
4. Saturation π
Saturation measures how "full" your service is. This is the most complex signal because different resources saturate differently:
- CPU: 80%+ utilization typically problematic
- Memory: Depends on language (garbage collection impact)
- Disk I/O: Queue depth and IOPS limits
- Network: Bandwidth utilization
- Thread pools: Active/max threads ratio
Saturation is predictiveβit tells you problems are coming before they arrive.
SATURATION WARNING ZONES
100% β€ββββββββββββββββββββββ π΄ Critical
ββββββββββββββββββββββ
90% β€βββββββββββββββββββββ π‘ Warning
ββββββββββββββββββββββ
70% β€βββββββββββββββββββββ π’ Healthy
ββββββββββββββββββββββ
0% β€βββββββββββββββββββββ
βββββββββββββββββββββ
CPU Usage
Service Level Indicators (SLIs) π―
While the Four Golden Signals provide a framework, Service Level Indicators (SLIs) are the specific metrics you choose to measure service quality. An SLI is a carefully selected metric that represents user experience.
Anatomy of a good SLI:
- User-centric: Reflects what users actually care about
- Measurable: Can be reliably collected
- Proportional: Expressed as a ratio (good events / total events)
- Actionable: Changes suggest specific fixes
Common SLI patterns:
| SLI Type | Example | Formula |
|---|---|---|
| Availability | Successful requests | (successful requests) / (total requests) |
| Latency | Fast requests | (requests < 200ms) / (total requests) |
| Quality | Valid responses | (correct results) / (total results) |
| Durability | Data retention | (records retained) / (records written) |
Example SLI definition:
"The proportion of HTTP GET requests for /api/users that return a 2xx status code and complete within 300ms, measured at the load balancer."
Notice the specificity:
- β Specific endpoint (/api/users)
- β Specific method (GET)
- β Success criteria (2xx, <300ms)
- β Measurement point (load balancer)
π‘ SLI vs SLO vs SLA:
- SLI: The metric you measure ("99.5% of requests succeeded")
- SLO: Your target ("We aim for 99.9% success")
- SLA: The contract ("We guarantee 99.5% or you get a refund")
RED Method π΄
An alternative to the Four Golden Signals, the RED method focuses on request-driven services:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
RED is essentially a simplified version of the Golden Signals, perfect for microservices and APIs.
RED METHOD DASHBOARD LAYOUT βββββββββββββββββββββββββββββββββββββββ β π RATE (Traffic) β β βββββ ββββββ ββββ 1,247 req/s β βββββββββββββββββββββββββββββββββββββββ€ β β ERRORS β β βββββββββββββββ 0.3% error rate β βββββββββββββββββββββββββββββββββββββββ€ β β±οΈ DURATION (Latency) β β P50: 45ms β β P95: 180ms β β P99: 520ms β βββββββββββββββββββββββββββββββββββββββ
USE Method π§
For resource-oriented monitoring (servers, containers, databases), Brendan Gregg developed the USE method:
- Utilization: % time resource is busy
- Saturation: Degree of queued work
- Errors: Count of error events
Apply USE to every resource:
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | CPU usage % | Run queue length | CPU errors (rare) |
| Memory | Memory used % | Swap usage, page faults | OOM kills |
| Disk | IO utilization % | IO queue depth | Read/write errors |
| Network | Bandwidth used % | Buffer overruns | Packet loss, retransmits |
Metric Cardinality and Dimensionality π
As systems grow complex, you'll want to add dimensions (labels/tags) to metrics:
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Cardinality = the number of unique time series created by dimension combinations.
Low cardinality (good):
method: 5-10 values (GET, POST, PUT, DELETE...)status: ~15 values (200, 201, 400, 404, 500...)endpoint: Dozens to hundreds (your API routes)
High cardinality (dangerous):
- β
user_id: Millions of values - β
session_id: Unbounded - β
ip_address: Hundreds of thousands - β
timestamp: Infinite
β οΈ Cardinality explosion: With 10 methods Γ 100 endpoints Γ 15 statuses = 15,000 time series. Add a high-cardinality dimension like user_id (1M values) and you get 15 billion time seriesβthis will crash your monitoring system!
π‘ Cardinality best practices:
- Keep dimension values bounded
- Use aggregation for high-cardinality data ("top 10 users" not "all users")
- Store high-cardinality data in logs/traces, not metrics
- Monitor your monitoring system's cardinality
Counter, Gauge, and Histogram π
Metric types determine how data is stored and queried:
Counter π’: Cumulative value that only increases
- Examples: total requests, total errors, bytes sent
- Use for: Rates (calculate rate of change)
- Resets to 0 on restart
http_requests_total: 1547 β 1548 β 1549 β 0 (restart) β 1 β 2
Gauge π: Instantaneous value that can go up or down
- Examples: CPU usage, memory consumption, queue depth, active connections
- Use for: Current state snapshots
- Value is meaningful at any point in time
cpu_usage_percent: 45.2 β 67.8 β 52.1 β 71.5
Histogram π: Distribution of values in buckets
- Examples: request latency, response size
- Use for: Percentile calculations (P95, P99)
- Pre-aggregated on client side
http_request_duration_seconds_bucket{le="0.1"} = 850
http_request_duration_seconds_bucket{le="0.5"} = 920
http_request_duration_seconds_bucket{le="1.0"} = 950
http_request_duration_seconds_bucket{le="+Inf"} = 1000
METRIC TYPE DECISION TREE
βββββββββββββββββββββββ
β What are you β
β measuring? β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β β
ββββ΄βββ ββββ΄βββ
βTotalβ βStateβ
βcountβ βvalueβ
ββββ¬βββ ββββ¬βββ
β β
βΌ β
ββββββββββ βββββββββ΄βββββββββ
βCOUNTER β β β
β β ββββ΄βββ ββββ΄βββ
βExamplesβ βGoes β βDistri-β
β- Requestsβ βup & β βbutionβ
β- Errors β βdown?β β? β
β- Bytes β ββββ¬βββ ββββ¬βββ
ββββββββββ β β
βΌ βΌ
ββββββββββ βββββββββββ
β GAUGE β βHISTOGRAMβ
β β β β
βExamplesβ βExamples β
β- CPU % β β- Latencyβ
β- Memoryβ β- Size β
β- Queue β β- Score β
ββββββββββ βββββββββββ
Real-World Examples π
Example 1: E-commerce Checkout Service
Let's design metrics for a critical checkout service:
Golden Signals implementation:
Latency:
checkout_duration_seconds(histogram)- Buckets: 0.1s, 0.5s, 1s, 2s, 5s
- Track P95 and P99
- Alert if P95 > 2 seconds
Traffic:
checkout_requests_total(counter)- Dimensions:
{method, status} - Calculate rate: requests/second
Errors:
checkout_errors_total(counter)- Dimensions:
{error_type, payment_provider} - Alert if error rate > 1%
Saturation:
payment_api_pool_active(gauge): active connectionspayment_api_pool_max(gauge): max connections- Alert if ratio > 0.8 (80% capacity)
SLI definition: "99.5% of checkout requests complete successfully with <2s latency, measured over 30-day windows"
Example 2: Video Streaming Platform
User-centric SLIs:
Availability SLI:
- "Proportion of video start requests that return a playback URL within 500ms"
- Metric:
(successful_starts < 500ms) / (total_start_requests) - Target: 99.9%
Quality SLI:
- "Proportion of viewing time without buffering events"
- Metric:
(smooth_playback_seconds) / (total_playback_seconds) - Target: 99.5%
Latency SLI:
- "Proportion of seek operations that complete within 1 second"
- Metric:
(seeks < 1s) / (total_seeks) - Target: 99%
| Metric Name | Type | Dimensions | Purpose |
|---|---|---|---|
| video_start_duration_ms | Histogram | region, device_type | Latency SLI |
| playback_seconds_total | Counter | quality, codec | Usage tracking |
| buffer_events_total | Counter | region, cdn | Quality SLI |
| cdn_capacity_percent | Gauge | region, pop | Saturation |
Example 3: Database Service Metrics
USE Method applied to PostgreSQL:
Utilization:
pg_connections_active / pg_connections_max: Connection pool usagepg_cpu_percent: Database CPU usagepg_disk_io_utilization: Disk busy time
Saturation:
pg_locks_waiting: Queries waiting for lockspg_replication_lag_bytes: Replica lagpg_disk_queue_depth: Pending disk operations
Errors:
pg_deadlocks_total: Deadlock countpg_connection_errors_total: Failed connectionspg_transaction_rollbacks_total: Rolled-back transactions
π‘ Pro insight: For databases, slow queries are often more critical than total query count. Track:
pg_slow_queries_total{threshold="1s"}: Queries exceeding 1 secondpg_query_duration_p99: 99th percentile query time
Example 4: Microservices Dashboard
Imagine you manage 50 microservices. You can't monitor everythingβfocus on service-level metrics using RED:
Per-service metrics:
## Rate
service_requests_per_second{service="auth", method="POST", endpoint="/login"}
## Errors
service_error_rate{service="auth", status="5xx"}
## Duration
service_request_duration_p95{service="auth", endpoint="/login"}
Cross-service dependencies:
## Upstream calls
service_dependency_requests{from="api-gateway", to="auth", status="200"}
service_dependency_latency_p99{from="api-gateway", to="auth"}
SERVICE DEPENDENCY MAP
ββββββββββββ
β Users β
βββββββ¬βββββ
β
β
ββββββββββββββββ
β API Gateway β β RED metrics
ββββ¬βββββ¬βββββ¬ββ
β β β
β β β
ββββββββββββββββββ
βAuthββCartββSearchβ β RED metrics each
ββββ¬ββββββ¬ββββββ¬ββ
β β β
β β β
ββββββββββββββββββββ
β Database β β USE metrics
ββββββββββββββββββββ
Top-level: User-facing latency
Mid-level: Service RED metrics
Bottom: Resource USE metrics
Common Mistakes β οΈ
1. Measuring Everything
β Wrong: "Let's collect 500 metrics per serviceβmore data is better!"
β Right: Start with the Four Golden Signals, then add metrics only when you have specific questions to answer.
Why it matters: Too many metrics create:
- Alert fatigue (which metrics actually matter?)
- High storage costs
- Slow dashboards
- Analysis paralysis
2. Ignoring Cardinality
β Wrong:
http_requests{user_id="user123", session_id="abc456", trace_id="xyz789"}
β Right:
http_requests{endpoint="/api/users", method="GET", status="200"}
Why it matters: High-cardinality dimensions explode your time series database. User IDs belong in logs and traces, not metrics.
3. Averaging Latency
β Wrong: "Our average latency is 100ms, we're doing great!"
β Right: "P50 is 80ms, P95 is 200ms, P99 is 1.2sβwe need to investigate that long tail."
Why it matters: Averages hide outliers. The worst 1% of user experiences matter enormously to user satisfaction.
4. Alert on Vanity Metrics
β Wrong: Alert when CPU usage > 50%
β Right: Alert when P99 latency > 2s OR error rate > 1%
Why it matters: Users don't care about CPU usageβthey care about slow responses and errors. Alert on symptoms (user impact), not causes (resource usage).
5. Missing the "Why"
β Wrong: Track users_logged_in_total (just a counter)
β Right: Track both:
login_requests_total{status="success"}: How many triedlogin_duration_seconds: How long it tooklogin_errors_total{reason="invalid_password"}: Why it failed
Why it matters: Metrics should help you diagnose problems, not just detect them.
6. No Percentile Targets
β Wrong: "Our latency SLO is <500ms"
β Right: "Our P95 latency SLO is <500ms"
Why it matters: Without specifying a percentile, your SLO is meaningless. Does "<500ms" mean all requests? Just the median?
7. Forgetting About Data Freshness
β Wrong: Using 5-minute average metrics for incident response
β Right: Use 10-second granularity for dashboards, 1-minute for alerts
Why it matters: 5-minute averages mean you detect incidents 2.5 minutes late on average. In production, minutes matter.
Key Takeaways π―
Start with the Four Golden Signals: Latency, Traffic, Errors, Saturation cover 90% of observability needs for request-driven services.
Define clear SLIs: Service Level Indicators should be user-centric, measurable, and directly tied to user experience.
Use the right method for the job:
- RED for services and APIs
- USE for resources (CPU, memory, disk)
- Four Golden Signals for overall system health
Mind your cardinality: Keep metric dimensions bounded. High-cardinality data belongs in logs and traces, not metrics.
Think in percentiles, not averages: P95 and P99 latency matter more than mean latency for understanding user experience.
Choose the right metric type:
- Counter for cumulative counts (requests, errors)
- Gauge for current state (CPU%, memory, connections)
- Histogram for distributions (latency, sizes)
Alert on symptoms, not causes: Users care about slow responses and errors, not CPU usage or memory consumption.
Less is more: A focused set of meaningful metrics beats hundreds of unused ones. Start small, add deliberately.
π Quick Reference: Metrics Framework Selection
| System Type | Recommended Framework | Key Metrics |
|---|---|---|
| Web API / Microservice | RED Method | Rate, Errors, Duration |
| Infrastructure / Server | USE Method | Utilization, Saturation, Errors |
| User-facing Service | Four Golden Signals | Latency, Traffic, Errors, Saturation |
| Database / Storage | USE + Custom | USE + Query latency, Replication lag |
| Queue / Message Broker | Custom | Queue depth, Age of oldest message, Consumer lag |
π§ Memory Device: The Golden Signals
"LATE" helps you remember the Four Golden Signals:
- Latency - How long requests take
- Amount (Traffic) - How many requests
- Troubles (Errors) - How many fail
- Exhaustion (Saturation) - How full you are
π Further Study
Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - The definitive guide to the Four Golden Signals and practical monitoring philosophy
Brendan Gregg's USE Method: https://www.brendangregg.com/usemethod.html - Comprehensive explanation of the Utilization, Saturation, Errors framework for resource monitoring
Prometheus Best Practices: https://prometheus.io/docs/practices/naming/ - Industry-standard guidance on metric naming, cardinality, and instrumentation patterns for modern observability