Metrics That Matter

Master counters, gauges, histograms, RED/USE methods, and why averages mislead during incidents

Metrics That Matter

Master the critical metrics for production observability with free flashcards and spaced repetition practice. This lesson covers the four golden signals, service-level indicators (SLIs), and metric cardinality—essential concepts for building reliable distributed systems in 2026 and beyond.

Welcome to Metrics That Matter 💻

In the world of production observability, you can measure almost anything. CPU usage, memory consumption, network packets, disk I/O, request counts, error rates, cache hits—the list goes on endlessly. But here's the challenge: not all metrics are created equal.

In this lesson, you'll learn to distinguish signal from noise. We'll explore the metrics that actually help you understand system health, detect incidents early, and diagnose problems quickly. You'll discover frameworks like the Four Golden Signals and learn how to define meaningful Service Level Indicators (SLIs) that align with user experience.

💡 Pro tip: The best engineers don't monitor everything—they monitor what matters. This lesson will teach you exactly what that means.

Core Concepts: Understanding Essential Metrics 📊

The Four Golden Signals 🔆

Google's Site Reliability Engineering (SRE) book introduced the Four Golden Signals—a fundamental framework for monitoring distributed systems. These four metrics provide a complete picture of system health:

Signal	What It Measures	Why It Matters
Latency	Time to service a request	Directly impacts user experience
Traffic	Demand on your system	Shows utilization and growth
Errors	Rate of failed requests	Indicates correctness issues
Saturation	How "full" your service is	Predicts capacity problems

Let's explore each signal in detail:

1. Latency ⏱️

Latency measures how long it takes to service a request. But there's a critical nuance: you must track successful request latency and failed request latency separately.

Why? A failed request might return instantly (say, a 400 Bad Request), making your average latency look great even when users are experiencing errors. Always separate these:

Success latency: P50, P95, P99 of successful requests
Error latency: Response time for failed requests

💡 Key insight: Focus on percentiles (P95, P99) rather than averages. An average of 100ms might hide the fact that 5% of your users wait 10 seconds.

LATENCY DISTRIBUTION

    │
100%├─────────────────────────────
    │                          ▓▓
    │                       ▓▓▓▓▓
 99%├──────────────────────▓▓▓▓▓
    │                  ▓▓▓▓▓▓▓
    │             ▓▓▓▓▓▓▓▓▓
 95%├──────────▓▓▓▓▓▓▓▓▓
    │      ▓▓▓▓▓▓▓▓▓
    │  ▓▓▓▓▓▓▓▓
 50%├▓▓▓▓▓▓
    └─┴─┴─┴─┴─┴─┴─┴─┴─┴─
     10 50 100 200 500 1000ms

P50 = 50ms (median user)
P95 = 200ms (slower users)
P99 = 800ms (worst experience)

2. Traffic 🚦

Traffic measures the demand on your system. What you measure depends on your system type:

Web service: HTTP requests per second
Database: Transactions or queries per second
Streaming service: Network I/O rate or sessions per second
IoT system: Messages per second

Traffic helps you understand:

Normal patterns: Daily/weekly cycles
Growth trends: Are you scaling appropriately?
Incident correlation: Did traffic spike cause the outage?

3. Errors ❌

The error rate tells you when things break. Track errors as:

Rate: Errors per second (absolute)
Ratio: Error percentage of total requests (relative)

Both matter! A service handling 1000 req/s with 1% errors is generating 10 errors/second—significantly different from a service at 10 req/s with 1% errors (0.1 errors/s).

Critical distinction: Explicit vs. implicit errors

Explicit Errors	Implicit Errors
HTTP 500 responses	HTTP 200 with wrong content
Exception thrown	Request timeout (no response)
Failed health check	Slow response violating SLA

💡 Pro tip: Don't ignore 4xx errors completely. While they're client errors, a sudden spike in 400s or 404s might indicate a broken client deployment or API change.

4. Saturation 📈

Saturation measures how "full" your service is. This is the most complex signal because different resources saturate differently:

CPU: 80%+ utilization typically problematic
Memory: Depends on language (garbage collection impact)
Disk I/O: Queue depth and IOPS limits
Network: Bandwidth utilization
Thread pools: Active/max threads ratio

Saturation is predictive—it tells you problems are coming before they arrive.

SATURATION WARNING ZONES

100% ┤━━━━━━━━━━━━━━━━━━━━━━  🔴 Critical
     │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
 90% ┤▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  🟡 Warning
     │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
 70% ┤▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒  🟢 Healthy
     │░░░░░░░░░░░░░░░░░░░░░
  0% ┤░░░░░░░░░░░░░░░░░░░░░
     └────────────────────
         CPU Usage

Service Level Indicators (SLIs) 🎯

While the Four Golden Signals provide a framework, Service Level Indicators (SLIs) are the specific metrics you choose to measure service quality. An SLI is a carefully selected metric that represents user experience.

Anatomy of a good SLI:

User-centric: Reflects what users actually care about
Measurable: Can be reliably collected
Proportional: Expressed as a ratio (good events / total events)
Actionable: Changes suggest specific fixes

Common SLI patterns:

SLI Type	Example	Formula
Availability	Successful requests	(successful requests) / (total requests)
Latency	Fast requests	(requests < 200ms) / (total requests)
Quality	Valid responses	(correct results) / (total results)
Durability	Data retention	(records retained) / (records written)

Example SLI definition:

"The proportion of HTTP GET requests for /api/users that return a 2xx status code and complete within 300ms, measured at the load balancer."

Notice the specificity:

✅ Specific endpoint (/api/users)
✅ Specific method (GET)
✅ Success criteria (2xx, <300ms)
✅ Measurement point (load balancer)

💡 SLI vs SLO vs SLA:

SLI: The metric you measure ("99.5% of requests succeeded")
SLO: Your target ("We aim for 99.9% success")
SLA: The contract ("We guarantee 99.5% or you get a refund")

RED Method 🔴

An alternative to the Four Golden Signals, the RED method focuses on request-driven services:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

RED is essentially a simplified version of the Golden Signals, perfect for microservices and APIs.

RED METHOD DASHBOARD LAYOUT

┌─────────────────────────────────────┐
│  📊 RATE (Traffic)                  │
│  ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁  1,247 req/s      │
├─────────────────────────────────────┤
│  ❌ ERRORS                          │
│  ▁▁▁▁▁▂▃▄▃▂▁▁▁▁▁  0.3% error rate  │
├─────────────────────────────────────┤
│  ⏱️ DURATION (Latency)              │
│  P50: 45ms                          │
│  P95: 180ms                         │
│  P99: 520ms                         │
└─────────────────────────────────────┘

USE Method 🔧

For resource-oriented monitoring (servers, containers, databases), Brendan Gregg developed the USE method:

Utilization: % time resource is busy
Saturation: Degree of queued work
Errors: Count of error events

Apply USE to every resource:

Resource	Utilization	Saturation	Errors
CPU	CPU usage %	Run queue length	CPU errors (rare)
Memory	Memory used %	Swap usage, page faults	OOM kills
Disk	IO utilization %	IO queue depth	Read/write errors
Network	Bandwidth used %	Buffer overruns	Packet loss, retransmits

Metric Cardinality and Dimensionality 📐

As systems grow complex, you'll want to add dimensions (labels/tags) to metrics:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Cardinality = the number of unique time series created by dimension combinations.

Low cardinality (good):

method: 5-10 values (GET, POST, PUT, DELETE...)
status: ~15 values (200, 201, 400, 404, 500...)
endpoint: Dozens to hundreds (your API routes)

High cardinality (dangerous):

❌ user_id: Millions of values
❌ session_id: Unbounded
❌ ip_address: Hundreds of thousands
❌ timestamp: Infinite

⚠️ Cardinality explosion: With 10 methods × 100 endpoints × 15 statuses = 15,000 time series. Add a high-cardinality dimension like user_id (1M values) and you get 15 billion time series—this will crash your monitoring system!

💡 Cardinality best practices:

Keep dimension values bounded
Use aggregation for high-cardinality data ("top 10 users" not "all users")
Store high-cardinality data in logs/traces, not metrics
Monitor your monitoring system's cardinality

Counter, Gauge, and Histogram 📊

Metric types determine how data is stored and queried:

Counter 🔢: Cumulative value that only increases

Examples: total requests, total errors, bytes sent
Use for: Rates (calculate rate of change)
Resets to 0 on restart

http_requests_total: 1547 → 1548 → 1549 → 0 (restart) → 1 → 2

Gauge 📏: Instantaneous value that can go up or down

Examples: CPU usage, memory consumption, queue depth, active connections
Use for: Current state snapshots
Value is meaningful at any point in time

cpu_usage_percent: 45.2 → 67.8 → 52.1 → 71.5

Histogram 📊: Distribution of values in buckets

Examples: request latency, response size
Use for: Percentile calculations (P95, P99)
Pre-aggregated on client side

http_request_duration_seconds_bucket{le="0.1"} = 850
http_request_duration_seconds_bucket{le="0.5"} = 920
http_request_duration_seconds_bucket{le="1.0"} = 950
http_request_duration_seconds_bucket{le="+Inf"} = 1000

METRIC TYPE DECISION TREE

       ┌─────────────────────┐
       │ What are you        │
       │ measuring?          │
       └──────────┬──────────┘
                  │
    ┌─────────────┴─────────────┐
    │                           │
 ┌──┴──┐                     ┌──┴──┐
 │Total│                     │State│
 │count│                     │value│
 └──┬──┘                     └──┬──┘
    │                           │
    ▼                           │
┌────────┐              ┌───────┴────────┐
│COUNTER │              │                │
│        │           ┌──┴──┐         ┌──┴──┐
│Examples│           │Goes │         │Distri-│
│- Requests│         │up & │         │bution│
│- Errors  │         │down?│         │?     │
│- Bytes   │         └──┬──┘         └──┬──┘
└────────┘              │               │
                        ▼               ▼
                   ┌────────┐      ┌─────────┐
                   │ GAUGE  │      │HISTOGRAM│
                   │        │      │         │
                   │Examples│      │Examples │
                   │- CPU % │      │- Latency│
                   │- Memory│      │- Size   │
                   │- Queue │      │- Score  │
                   └────────┘      └─────────┘

Real-World Examples 🌍

Example 1: E-commerce Checkout Service

Let's design metrics for a critical checkout service:

Golden Signals implementation:

Latency:
- checkout_duration_seconds (histogram)
- Buckets: 0.1s, 0.5s, 1s, 2s, 5s
- Track P95 and P99
- Alert if P95 > 2 seconds
Traffic:
- checkout_requests_total (counter)
- Dimensions: {method, status}
- Calculate rate: requests/second
Errors:
- checkout_errors_total (counter)
- Dimensions: {error_type, payment_provider}
- Alert if error rate > 1%
Saturation:
- payment_api_pool_active (gauge): active connections
- payment_api_pool_max (gauge): max connections
- Alert if ratio > 0.8 (80% capacity)

SLI definition: "99.5% of checkout requests complete successfully with <2s latency, measured over 30-day windows"

Example 2: Video Streaming Platform

User-centric SLIs:

Availability SLI:
- "Proportion of video start requests that return a playback URL within 500ms"
- Metric: (successful_starts < 500ms) / (total_start_requests)
- Target: 99.9%
Quality SLI:
- "Proportion of viewing time without buffering events"
- Metric: (smooth_playback_seconds) / (total_playback_seconds)
- Target: 99.5%
Latency SLI:
- "Proportion of seek operations that complete within 1 second"
- Metric: (seeks < 1s) / (total_seeks)
- Target: 99%

Metric Name	Type	Dimensions	Purpose
video_start_duration_ms	Histogram	region, device_type	Latency SLI
playback_seconds_total	Counter	quality, codec	Usage tracking
buffer_events_total	Counter	region, cdn	Quality SLI
cdn_capacity_percent	Gauge	region, pop	Saturation

Example 3: Database Service Metrics

USE Method applied to PostgreSQL:

Utilization:

pg_connections_active / pg_connections_max: Connection pool usage
pg_cpu_percent: Database CPU usage
pg_disk_io_utilization: Disk busy time

Saturation:

pg_locks_waiting: Queries waiting for locks
pg_replication_lag_bytes: Replica lag
pg_disk_queue_depth: Pending disk operations

Errors:

pg_deadlocks_total: Deadlock count
pg_connection_errors_total: Failed connections
pg_transaction_rollbacks_total: Rolled-back transactions

💡 Pro insight: For databases, slow queries are often more critical than total query count. Track:

pg_slow_queries_total{threshold="1s"}: Queries exceeding 1 second
pg_query_duration_p99: 99th percentile query time

Example 4: Microservices Dashboard

Imagine you manage 50 microservices. You can't monitor everything—focus on service-level metrics using RED:

Per-service metrics:

## Rate
service_requests_per_second{service="auth", method="POST", endpoint="/login"}

## Errors  
service_error_rate{service="auth", status="5xx"}

## Duration
service_request_duration_p95{service="auth", endpoint="/login"}

Cross-service dependencies:

## Upstream calls
service_dependency_requests{from="api-gateway", to="auth", status="200"}
service_dependency_latency_p99{from="api-gateway", to="auth"}

SERVICE DEPENDENCY MAP

┌──────────┐
│  Users   │
└─────┬────┘
      │
      ↓
┌──────────────┐
│ API Gateway  │ ← RED metrics
└──┬────┬────┬─┘
   │    │    │
   ↓    ↓    ↓
┌────┐┌────┐┌────┐
│Auth││Cart││Search│ ← RED metrics each
└──┬─┘└──┬─┘└──┬─┘
   │     │     │
   ↓     ↓     ↓
┌──────────────────┐
│   Database       │ ← USE metrics
└──────────────────┘

Top-level: User-facing latency
Mid-level: Service RED metrics
Bottom: Resource USE metrics

Common Mistakes ⚠️

1. Measuring Everything

❌ Wrong: "Let's collect 500 metrics per service—more data is better!"

✅ Right: Start with the Four Golden Signals, then add metrics only when you have specific questions to answer.

Why it matters: Too many metrics create:

Alert fatigue (which metrics actually matter?)
High storage costs
Slow dashboards
Analysis paralysis

2. Ignoring Cardinality

❌ Wrong:

http_requests{user_id="user123", session_id="abc456", trace_id="xyz789"}

✅ Right:

http_requests{endpoint="/api/users", method="GET", status="200"}

Why it matters: High-cardinality dimensions explode your time series database. User IDs belong in logs and traces, not metrics.

3. Averaging Latency

❌ Wrong: "Our average latency is 100ms, we're doing great!"

✅ Right: "P50 is 80ms, P95 is 200ms, P99 is 1.2s—we need to investigate that long tail."

Why it matters: Averages hide outliers. The worst 1% of user experiences matter enormously to user satisfaction.

4. Alert on Vanity Metrics

❌ Wrong: Alert when CPU usage > 50%

✅ Right: Alert when P99 latency > 2s OR error rate > 1%

Why it matters: Users don't care about CPU usage—they care about slow responses and errors. Alert on symptoms (user impact), not causes (resource usage).

5. Missing the "Why"

❌ Wrong: Track users_logged_in_total (just a counter)

✅ Right: Track both:

login_requests_total{status="success"}: How many tried
login_duration_seconds: How long it took
login_errors_total{reason="invalid_password"}: Why it failed

Why it matters: Metrics should help you diagnose problems, not just detect them.

6. No Percentile Targets

❌ Wrong: "Our latency SLO is <500ms"

✅ Right: "Our P95 latency SLO is <500ms"

Why it matters: Without specifying a percentile, your SLO is meaningless. Does "<500ms" mean all requests? Just the median?

7. Forgetting About Data Freshness

❌ Wrong: Using 5-minute average metrics for incident response

✅ Right: Use 10-second granularity for dashboards, 1-minute for alerts

Why it matters: 5-minute averages mean you detect incidents 2.5 minutes late on average. In production, minutes matter.

Key Takeaways 🎯

Start with the Four Golden Signals: Latency, Traffic, Errors, Saturation cover 90% of observability needs for request-driven services.
Define clear SLIs: Service Level Indicators should be user-centric, measurable, and directly tied to user experience.
Use the right method for the job:
- RED for services and APIs
- USE for resources (CPU, memory, disk)
- Four Golden Signals for overall system health
Mind your cardinality: Keep metric dimensions bounded. High-cardinality data belongs in logs and traces, not metrics.
Think in percentiles, not averages: P95 and P99 latency matter more than mean latency for understanding user experience.
Choose the right metric type:
- Counter for cumulative counts (requests, errors)
- Gauge for current state (CPU%, memory, connections)
- Histogram for distributions (latency, sizes)
Alert on symptoms, not causes: Users care about slow responses and errors, not CPU usage or memory consumption.
Less is more: A focused set of meaningful metrics beats hundreds of unused ones. Start small, add deliberately.

📋 Quick Reference: Metrics Framework Selection

System Type	Recommended Framework	Key Metrics
Web API / Microservice	RED Method	Rate, Errors, Duration
Infrastructure / Server	USE Method	Utilization, Saturation, Errors
User-facing Service	Four Golden Signals	Latency, Traffic, Errors, Saturation
Database / Storage	USE + Custom	USE + Query latency, Replication lag
Queue / Message Broker	Custom	Queue depth, Age of oldest message, Consumer lag

🧠 Memory Device: The Golden Signals

"LATE" helps you remember the Four Golden Signals:

Latency - How long requests take
Amount (Traffic) - How many requests
Troubles (Errors) - How many fail
Exhaustion (Saturation) - How full you are

📚 Further Study

Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - The definitive guide to the Four Golden Signals and practical monitoring philosophy
Brendan Gregg's USE Method: https://www.brendangregg.com/usemethod.html - Comprehensive explanation of the Utilization, Saturation, Errors framework for resource monitoring
Prometheus Best Practices: https://prometheus.io/docs/practices/naming/ - Industry-standard guidance on metric naming, cardinality, and instrumentation patterns for modern observability

📝

Ready to practice?

This lesson has 15 questions to help you learn