You are viewing a preview of this lesson. Sign in to start learning
Back to Production Observability: From Signals to Root Cause (2026)

Signals vs Symptoms

Learn to distinguish between raw telemetry data and the surface manifestations of deeper issues

Signals vs Symptoms in Production Observability

Master the critical distinction between signals and symptoms with free flashcards and spaced repetition practice. This lesson covers signal types, symptom patterns, and root cause analysis techniquesβ€”essential concepts for building effective observability strategies in modern production systems.

Welcome

πŸ’» In the world of production systems, confusion between signals and symptoms is one of the most commonβ€”and costlyβ€”mistakes teams make. When an alert fires at 3 AM, are you looking at the actual problem or just its visible effects? Understanding this distinction transforms how you approach incident response, build monitoring systems, and ultimately keep your services reliable.

Think of it like medicine: a fever is a symptom, but the infection causing it is the underlying signal your body is trying to communicate. In production systems, high latency might be your symptom, but the signal could be anything from memory pressure to network saturation to database lock contention.

Core Concepts

What Are Signals? πŸ”

Signals are the fundamental data points emitted by your system that represent its actual state and behavior. They are the raw, objective measurements that tell you what's happening inside your infrastructure.

Think of signals as the primary sources of truth:

  • Metrics: Numerical measurements over time (CPU usage, request count, memory consumption)
  • Logs: Discrete event records with context (application errors, access logs, system events)
  • Traces: End-to-end request flows showing the path through your system
  • Events: State changes or significant occurrences (deployments, configuration changes, scaling events)

πŸ’‘ Key insight: Signals are what your system actually does. They're the objective reality beneath everything else.

Signal Type What It Measures Example
Counter Cumulative value that only increases Total HTTP requests: 1,453,892
Gauge Point-in-time value that can go up or down Current memory usage: 4.2 GB
Histogram Distribution of values over time Request duration: p50=120ms, p99=850ms
Log entry Timestamped event with structured data 2026-01-15 14:23:01 ERROR: Connection timeout to db-primary
Span Single operation within a distributed trace API call β†’ Database query (duration: 45ms)

What Are Symptoms? 🌑️

Symptoms are the observable effects or manifestations that indicate something might be wrong. They're what users experience and what your monitoring typically alerts on first.

Symptoms are derived indicators:

  • High response times (derived from request duration metrics)
  • Error rate spikes (aggregated from multiple error signals)
  • Service unavailability (computed from health check failures)
  • Slow page loads (experienced by users, reflected in multiple signals)

⚠️ Critical distinction: Symptoms tell you something is wrong, but not why it's wrong.

THE SYMPTOM β†’ SIGNAL RELATIONSHIP

    πŸ‘€ User Experience
         β”‚
         ↓
    🌑️ SYMPTOM (What you observe)
    "The website is slow"
         β”‚
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         ↓           ↓           ↓           ↓
    πŸ“Š SIGNALS (What's actually happening)
    CPU: 95%    Memory:   Network:  Database:
    sustained   swapping  packet    query
                active    loss      timeouts

The Diagnostic Journey πŸ”¬

When you're troubleshooting a production issue, you typically move through these layers:

Layer 1: Symptom Recognition β†’ "Users are reporting errors"

Layer 2: Signal Collection β†’ Gather metrics, logs, traces related to the symptom

Layer 3: Signal Analysis β†’ Identify patterns, correlations, and anomalies

Layer 4: Root Cause β†’ Determine the underlying reason (often found in signals you weren't initially monitoring)

Here's the relationship visualized:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OBSERVABILITY HIERARCHY                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                 β”‚
β”‚  πŸ‘₯ USER IMPACT (Business metrics)              β”‚
β”‚      ↑ Derived from                             β”‚
β”‚  🌑️ SYMPTOMS (SLIs/SLOs)                       β”‚
β”‚      ↑ Aggregated from                          β”‚
β”‚  πŸ“Š SIGNALS (Raw telemetry)                     β”‚
β”‚      ↑ Emitted by                               β”‚
β”‚  βš™οΈ SYSTEM BEHAVIOR (Code + infrastructure)     β”‚
β”‚                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why the Distinction Matters 🎯

1. Alert Fatigue Prevention

When you alert on symptoms without understanding underlying signals, you get:

  • Multiple alerts for the same root cause
  • Confusion about what to fix first
  • Teams chasing their tails

2. Faster Mean Time to Resolution (MTTR)

Understanding signals helps you:

  • Skip directly to relevant data
  • Avoid investigating red herrings
  • Build better runbooks

3. Proactive Problem Detection

Signal-based monitoring catches issues before they become symptoms:

  • Memory slowly leaking β†’ Signal visible hours before crash
  • Disk space growing β†’ Signal visible days before full disk
  • Connection pool exhaustion β†’ Signal visible minutes before timeout spike

πŸ’‘ Mental model: Symptoms are lagging indicators (they tell you after the fact), while signals are leading indicators (they show you what's developing).

Common Signal-Symptom Pairs πŸ”—

Understanding typical relationships helps you build better observability:

Symptom (What Users See) Potential Signals (What's Really Happening) Root Cause Examples
500 errors increasing Exception count by type, stack traces, database connection pool metrics Database connection exhaustion, unhandled null pointers, dependency timeout
Slow API responses Request duration by endpoint, CPU utilization, GC pause time, query execution time N+1 database queries, memory pressure causing GC, external API degradation
Service unavailable Pod restart count, OOM kill events, health check failures, load balancer errors Memory leak, unhandled panic, misconfigured health check
Data inconsistency Transaction rollback rate, replication lag, event processing delay Race condition, failed distributed transaction, message queue backlog

Detailed Examples

Example 1: The Slow Checkout Flow πŸ›’

The Symptom: Your e-commerce checkout process is taking 5 seconds instead of the usual 500ms. Customers are abandoning carts.

The Surface Investigation (symptom-focused thinking):

  • "The checkout API is slow"
  • "We need to scale up checkout service"
  • "Let's add more cache"

The Signal-Driven Investigation:

Step Signal Examined What It Revealed
1 Distributed trace of slow request 95% of time spent in payment validation service
2 Payment service response time metrics Normal response time (50ms), but high retry count
3 Network connection metrics Connection establishment taking 4.8s on every retry
4 DNS resolution logs DNS lookups timing out, falling back to secondary resolver

The Root Cause: DNS server was overloaded. Not a checkout problem, not a payment service problem, not a scaling problem.

The Fix: Implement DNS caching and add additional DNS servers. Resolved in 10 minutes vs. hours of scaling experiments.

πŸ’‘ Lesson: If you'd just looked at the symptom ("checkout is slow") and scaled checkout services, you'd have wasted resources and time while users continued suffering.

Example 2: The Mysterious Memory Leak 🧠

The Symptom: Application crashes every 6 hours with Out Of Memory (OOM) errors.

Symptom-Level Response (treating the effect):

  • Restart the service when memory hits 90%
  • Increase memory allocation from 4GB to 8GB
  • Set up auto-restart on crash

Signal-Driven Investigation:

MEMORY USAGE OVER TIME (Signal Analysis)

8GB  ─                            β–² OOM!
     β”‚                        β•±β•±β•±β•±
6GB  ─                    β•±β•±β•±β•±
     β”‚                β•±β•±β•±β•±
4GB  ─            β•±β•±β•±β•±          ← Steady linear growth
     β”‚        β•±β•±β•±β•±
2GB  ─    β•±β•±β•±β•±
     β”‚β•±β•±β•±β•±
0GB  ┼────┬────┬────┬────┬────┬────
     0h   1h   2h   3h   4h   5h   6h

Signal: Memory grows ~1.2GB/hour consistently

Deeper Signal Examination:

  1. Heap dump analysis (signal): Large number of HTTP connection objects never released
  2. Connection pool metrics (signal): Pool size growing unbounded
  3. Code review triggered by signals: Connection pool missing maxTotal configuration
  4. Log correlation (signal): Each failed request creates new connection but doesn't return it to pool

The Root Cause: Configuration error + error handling bug. Failed requests didn't properly release connections.

The Real Fix:

  • Set connection pool max size
  • Fix error handling to ensure connection release
  • Memory issue disappeared completely

⚠️ What symptom-focused approach missed: Increasing memory would just make it crash every 12 hours instead of 6. The leak would continue.

Example 3: The Dashboard That Lied πŸ“Š

The Symptom: Your monitoring dashboard shows "All Systems Green" βœ…

User Reality: Customers can't log in, support tickets flooding in.

What Happened: Monitoring was tracking symptoms, not signals.

The Dashboard Showed (symptoms):

  • βœ… HTTP 200 response rate: 99.9%
  • βœ… Average response time: 150ms
  • βœ… CPU usage: 45%
  • βœ… Error rate: 0.1%

The Hidden Signals:

Signal Type What It Actually Showed Why Dashboard Missed It
Authentication success rate Dropped from 98% to 12% Not monitored separately from general success rate
Redis connection errors (logs) Spiking to 1000/min Logged but not aggregated into metrics
Session token validation failures 95% failing Application handled gracefully, returned 200 with error message
Cache hit rate Dropped from 85% to 2% Treated as performance optimization metric, not health signal

The Root Cause: Redis cluster failover happened, but application fell back to "graceful degradation" that returned HTTP 200 with "Please try again later" messages. Technically not errors, but users couldn't use the system.

πŸ’‘ Lesson: Symptoms without context are meaningless. You need signals that reflect actual user journeys and business functions, not just technical health.

Example 4: The Cascading Failure 🌊

The Symptom: Everything is failing simultaneously across multiple services.

The Panic Response (symptom whack-a-mole):

  • Restart service A β†’ Still failing
  • Scale up service B β†’ Still failing
  • Rollback deployment of service C β†’ Still failing
  • Declare major incident β†’ Still failing

The Signal-Driven Approach:

SERVICE DEPENDENCY MAP WITH SIGNALS

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚Frontend β”‚  β”‚ Auth    β”‚  β”‚ Orders  β”‚
    β”‚ ❌ Fail  β”‚  β”‚ ❌ Fail  β”‚  β”‚ ❌ Fail  β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚            β”‚            β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚ (all depend on)
                      ↓
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Database    β”‚
              β”‚  ⚠️ Signal:   β”‚
              β”‚ Connection    β”‚
              β”‚ pool at 100%  β”‚
              β”‚ Wait time:    β”‚
              β”‚ 30 seconds    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      ↓
              πŸ” Root Cause:
              Long-running query
              holding locks,
              blocking all other
              transactions

The Critical Signals:

  1. Database connection wait time (signal): Jumped from 0ms to 30,000ms
  2. Active transaction count (signal): One transaction running for 45 minutes
  3. Lock wait time (signal): 200 queries waiting on same table lock
  4. Query logs (signal): Massive JOIN query without proper indexes

The Real Fix: Kill the one problematic query. All services recovered in 30 seconds.

πŸ’‘ Lesson: In cascading failures, start with shared dependencies and their signals. Most symptoms are downstream effects of a single root cause.

Common Mistakes

❌ Mistake 1: Alerting Only on Symptoms

The Problem: You get notified that something is wrong, but have no idea what's causing it.

🚨 ALERT: API Response Time > 1 second

Now what? You still need to:

  1. Look up what's happening (signals)
  2. Correlate multiple data sources
  3. Form hypotheses
  4. Test them

Better Approach: Alert includes relevant signals automatically.

🚨 ALERT: API Response Time > 1 second

Key Signals:
- Database query time: 950ms (↑ 300% from baseline)
- Cache hit rate: 15% (↓ from usual 85%)
- Redis latency: 12ms (normal)

Likely cause: Cache warming issue after deployment
Runbook: [link]

❌ Mistake 2: Treating All Signals Equally

The Problem: You're drowning in data, unable to find the signal in the noise.

Not all signals are equally valuable:

Signal Priority Examples When to Check
πŸ”΄ Critical User-facing errors, data corruption, security breaches Alert immediately, wake someone up
🟑 Warning High resource usage, elevated latency, increasing error rates Monitor closely, may need intervention
🟒 Informational Deployment events, scaling events, configuration changes Check during investigation for context
βšͺ Diagnostic Detailed traces, debug logs, method-level metrics Use only when debugging specific issues

❌ Mistake 3: Ignoring Signal Context

The Problem: A signal means different things in different contexts.

Example: CPU at 80%

  • During daily batch job at 3 AM: βœ… Normal
  • During low-traffic period on web server: ⚠️ Suspicious
  • After deployment: 🚨 Potential problem
  • On Black Friday: βœ… Expected

Better Approach: Context-aware signals

CPU usage: 80%
Context: Tuesday 2:47 PM (typical usage: 35%)
Change: +45 percentage points in 5 minutes
Correlation: Deployment completed 6 minutes ago
Verdict: 🚨 Investigate immediately

❌ Mistake 4: The "Dashboard Operator" Trap

The Problem: Staring at dashboards waiting for symptoms to appear.

Why It Fails:

  • Humans are terrible at sustained attention
  • Symptoms appear only after users are affected
  • You're reactive, not proactive

Better Approach: Automated signal analysis

SIGNAL ANALYSIS WORKFLOW

πŸ“Š Signals β†’ πŸ€– Automated Analysis β†’ 🎯 Smart Alerts
                    β”‚
                    β”œβ†’ Anomaly detection
                    β”œβ†’ Pattern recognition  
                    β”œβ†’ Predictive analysis
                    β”œβ†’ Correlation discovery
                    β””β†’ Root cause suggestion

❌ Mistake 5: Collecting Signals Without Purpose

The Problem: "Let's collect everything just in case!"

Why It Fails:

  • Storage costs explode
  • Query performance degrades
  • Important signals buried in noise
  • Nobody knows what signals mean

Better Approach: Signal taxonomy with purpose

For each signal, document:

  • What it measures (definition)
  • Why it matters (impact)
  • When to check it (context)
  • How to interpret it (thresholds)
  • Where it fits in debugging (common use cases)

Key Takeaways

🎯 The Core Principle

Signals are data. Symptoms are interpretations.

Your observability strategy should:

  1. Collect comprehensive signals
  2. Aggregate signals into symptoms
  3. Alert on symptoms
  4. Investigate using signals
  5. Resolve root causes (found in signals)

🧠 Mental Models to Remember

The Medical Analogy:

  • Fever (symptom) ← Infection (signal) ← Bacteria (root cause)
  • High latency (symptom) ← CPU saturation (signal) ← Memory leak (root cause)

The Detective Analogy:

  • Crime scene (symptom) ← Evidence (signals) ← Perpetrator (root cause)
  • Service down (symptom) ← Crash logs (signals) ← Null pointer (root cause)

πŸ”§ Practical Implementation

When building observability:

  1. Start with signals (what can the system tell us?)
  2. Define symptoms (what do users care about?)
  3. Map relationships (which signals cause which symptoms?)
  4. Build workflows (symptom detected β†’ relevant signals β†’ guided investigation)

When responding to incidents:

  1. Acknowledge the symptom (what's broken?)
  2. Gather signals (what's the data saying?)
  3. Form hypothesis (what could explain these signals?)
  4. Test hypothesis (do other signals confirm this?)
  5. Find root cause (what signal points to the source?)

πŸ“Š The Observability Stack

πŸ“‹ Quick Reference: Signals vs Symptoms

Aspect Signals Symptoms
Nature Raw data, objective measurements Derived indicators, interpreted state
Purpose Diagnosis and root cause analysis Detection and awareness
Volume High cardinality, detailed Low cardinality, aggregated
Examples Memory bytes used, query duration, error logs Service degraded, high latency, elevated errors
Alert on? Rarely (only critical signals) Yes (user-impacting symptoms)
Who cares? Engineers during investigation Everyone (users, business, on-call)
Time horizon Leading indicator (early warning) Lagging indicator (problem already exists)
Cardinality Thousands to millions of unique streams Dozens to hundreds of indicators

πŸš€ Next Steps in Your Journey

You've learned to distinguish signals from symptoms. This foundational shift in thinking prepares you for:

  • Building effective SLIs/SLOs (next lesson: translating business impact into measurable symptoms)
  • Designing alert strategies (knowing when to alert on symptoms vs. signals)
  • Root cause analysis techniques (signal correlation, anomaly detection, causation)
  • Observability architecture (collecting, storing, and querying signals at scale)

πŸ’‘ Remember: The goal isn't to collect more dataβ€”it's to collect the right signals and understand which symptoms they produce.

πŸ“š Further Study


πŸŽ“ Lesson complete! Practice with the flashcards above and test your understanding with the quiz questions. The next lesson will build on this foundation to explore how signals aggregate into Service Level Indicators.