Signals vs Symptoms

Learn to distinguish between raw telemetry data and the surface manifestations of deeper issues

Signals vs Symptoms in Production Observability

Master the critical distinction between signals and symptoms with free flashcards and spaced repetition practice. This lesson covers signal types, symptom patterns, and root cause analysis techniques—essential concepts for building effective observability strategies in modern production systems.

Welcome

💻 In the world of production systems, confusion between signals and symptoms is one of the most common—and costly—mistakes teams make. When an alert fires at 3 AM, are you looking at the actual problem or just its visible effects? Understanding this distinction transforms how you approach incident response, build monitoring systems, and ultimately keep your services reliable.

Think of it like medicine: a fever is a symptom, but the infection causing it is the underlying signal your body is trying to communicate. In production systems, high latency might be your symptom, but the signal could be anything from memory pressure to network saturation to database lock contention.

Core Concepts

What Are Signals? 🔍

Signals are the fundamental data points emitted by your system that represent its actual state and behavior. They are the raw, objective measurements that tell you what's happening inside your infrastructure.

Think of signals as the primary sources of truth:

Metrics: Numerical measurements over time (CPU usage, request count, memory consumption)
Logs: Discrete event records with context (application errors, access logs, system events)
Traces: End-to-end request flows showing the path through your system
Events: State changes or significant occurrences (deployments, configuration changes, scaling events)

💡 Key insight: Signals are what your system actually does. They're the objective reality beneath everything else.

Signal Type	What It Measures	Example
Counter	Cumulative value that only increases	Total HTTP requests: 1,453,892
Gauge	Point-in-time value that can go up or down	Current memory usage: 4.2 GB
Histogram	Distribution of values over time	Request duration: p50=120ms, p99=850ms
Log entry	Timestamped event with structured data	2026-01-15 14:23:01 ERROR: Connection timeout to db-primary
Span	Single operation within a distributed trace	API call → Database query (duration: 45ms)

What Are Symptoms? 🌡️

Symptoms are the observable effects or manifestations that indicate something might be wrong. They're what users experience and what your monitoring typically alerts on first.

Symptoms are derived indicators:

High response times (derived from request duration metrics)
Error rate spikes (aggregated from multiple error signals)
Service unavailability (computed from health check failures)
Slow page loads (experienced by users, reflected in multiple signals)

⚠️ Critical distinction: Symptoms tell you something is wrong, but not why it's wrong.

THE SYMPTOM → SIGNAL RELATIONSHIP

    👤 User Experience
         │
         ↓
    🌡️ SYMPTOM (What you observe)
    "The website is slow"
         │
         ├───────────┬───────────┬───────────┐
         ↓           ↓           ↓           ↓
    📊 SIGNALS (What's actually happening)
    CPU: 95%    Memory:   Network:  Database:
    sustained   swapping  packet    query
                active    loss      timeouts

The Diagnostic Journey 🔬

When you're troubleshooting a production issue, you typically move through these layers:

Layer 1: Symptom Recognition → "Users are reporting errors"

Layer 2: Signal Collection → Gather metrics, logs, traces related to the symptom

Layer 3: Signal Analysis → Identify patterns, correlations, and anomalies

Layer 4: Root Cause → Determine the underlying reason (often found in signals you weren't initially monitoring)

Here's the relationship visualized:

┌─────────────────────────────────────────────────┐
│  OBSERVABILITY HIERARCHY                        │
├─────────────────────────────────────────────────┤
│                                                 │
│  👥 USER IMPACT (Business metrics)              │
│      ↑ Derived from                             │
│  🌡️ SYMPTOMS (SLIs/SLOs)                       │
│      ↑ Aggregated from                          │
│  📊 SIGNALS (Raw telemetry)                     │
│      ↑ Emitted by                               │
│  ⚙️ SYSTEM BEHAVIOR (Code + infrastructure)     │
│                                                 │
└─────────────────────────────────────────────────┘

Why the Distinction Matters 🎯

1. Alert Fatigue Prevention

When you alert on symptoms without understanding underlying signals, you get:

Multiple alerts for the same root cause
Confusion about what to fix first
Teams chasing their tails

2. Faster Mean Time to Resolution (MTTR)

Understanding signals helps you:

Skip directly to relevant data
Avoid investigating red herrings
Build better runbooks

3. Proactive Problem Detection

Signal-based monitoring catches issues before they become symptoms:

Memory slowly leaking → Signal visible hours before crash
Disk space growing → Signal visible days before full disk
Connection pool exhaustion → Signal visible minutes before timeout spike

💡 Mental model: Symptoms are lagging indicators (they tell you after the fact), while signals are leading indicators (they show you what's developing).

Common Signal-Symptom Pairs 🔗

Understanding typical relationships helps you build better observability:

Symptom (What Users See)	Potential Signals (What's Really Happening)	Root Cause Examples
500 errors increasing	Exception count by type, stack traces, database connection pool metrics	Database connection exhaustion, unhandled null pointers, dependency timeout
Slow API responses	Request duration by endpoint, CPU utilization, GC pause time, query execution time	N+1 database queries, memory pressure causing GC, external API degradation
Service unavailable	Pod restart count, OOM kill events, health check failures, load balancer errors	Memory leak, unhandled panic, misconfigured health check
Data inconsistency	Transaction rollback rate, replication lag, event processing delay	Race condition, failed distributed transaction, message queue backlog

Detailed Examples

Example 1: The Slow Checkout Flow 🛒

The Symptom: Your e-commerce checkout process is taking 5 seconds instead of the usual 500ms. Customers are abandoning carts.

The Surface Investigation (symptom-focused thinking):

"The checkout API is slow"
"We need to scale up checkout service"
"Let's add more cache"

The Signal-Driven Investigation:

Step	Signal Examined	What It Revealed
1	Distributed trace of slow request	95% of time spent in payment validation service
2	Payment service response time metrics	Normal response time (50ms), but high retry count
3	Network connection metrics	Connection establishment taking 4.8s on every retry
4	DNS resolution logs	DNS lookups timing out, falling back to secondary resolver

The Root Cause: DNS server was overloaded. Not a checkout problem, not a payment service problem, not a scaling problem.

The Fix: Implement DNS caching and add additional DNS servers. Resolved in 10 minutes vs. hours of scaling experiments.

💡 Lesson: If you'd just looked at the symptom ("checkout is slow") and scaled checkout services, you'd have wasted resources and time while users continued suffering.

Example 2: The Mysterious Memory Leak 🧠

The Symptom: Application crashes every 6 hours with Out Of Memory (OOM) errors.

Symptom-Level Response (treating the effect):

Restart the service when memory hits 90%
Increase memory allocation from 4GB to 8GB
Set up auto-restart on crash

Signal-Driven Investigation:

MEMORY USAGE OVER TIME (Signal Analysis)

8GB  ┤                            ▲ OOM!
     │                        ╱╱╱╱
6GB  ┤                    ╱╱╱╱
     │                ╱╱╱╱
4GB  ┤            ╱╱╱╱          ← Steady linear growth
     │        ╱╱╱╱
2GB  ┤    ╱╱╱╱
     │╱╱╱╱
0GB  ┼────┬────┬────┬────┬────┬────
     0h   1h   2h   3h   4h   5h   6h

Signal: Memory grows ~1.2GB/hour consistently

Deeper Signal Examination:

Heap dump analysis (signal): Large number of HTTP connection objects never released
Connection pool metrics (signal): Pool size growing unbounded
Code review triggered by signals: Connection pool missing maxTotal configuration
Log correlation (signal): Each failed request creates new connection but doesn't return it to pool

The Root Cause: Configuration error + error handling bug. Failed requests didn't properly release connections.

The Real Fix:

Set connection pool max size
Fix error handling to ensure connection release
Memory issue disappeared completely

⚠️ What symptom-focused approach missed: Increasing memory would just make it crash every 12 hours instead of 6. The leak would continue.

Example 3: The Dashboard That Lied 📊

The Symptom: Your monitoring dashboard shows "All Systems Green" ✅

User Reality: Customers can't log in, support tickets flooding in.

What Happened: Monitoring was tracking symptoms, not signals.

The Dashboard Showed (symptoms):

✅ HTTP 200 response rate: 99.9%
✅ Average response time: 150ms
✅ CPU usage: 45%
✅ Error rate: 0.1%

The Hidden Signals:

Signal Type	What It Actually Showed	Why Dashboard Missed It
Authentication success rate	Dropped from 98% to 12%	Not monitored separately from general success rate
Redis connection errors (logs)	Spiking to 1000/min	Logged but not aggregated into metrics
Session token validation failures	95% failing	Application handled gracefully, returned 200 with error message
Cache hit rate	Dropped from 85% to 2%	Treated as performance optimization metric, not health signal

The Root Cause: Redis cluster failover happened, but application fell back to "graceful degradation" that returned HTTP 200 with "Please try again later" messages. Technically not errors, but users couldn't use the system.

💡 Lesson: Symptoms without context are meaningless. You need signals that reflect actual user journeys and business functions, not just technical health.

Example 4: The Cascading Failure 🌊

The Symptom: Everything is failing simultaneously across multiple services.

The Panic Response (symptom whack-a-mole):

Restart service A → Still failing
Scale up service B → Still failing
Rollback deployment of service C → Still failing
Declare major incident → Still failing

The Signal-Driven Approach:

SERVICE DEPENDENCY MAP WITH SIGNALS

    ┌─────────┐  ┌─────────┐  ┌─────────┐
    │Frontend │  │ Auth    │  │ Orders  │
    │ ❌ Fail  │  │ ❌ Fail  │  │ ❌ Fail  │
    └────┬────┘  └────┬────┘  └────┬────┘
         │            │            │
         └────────────┼────────────┘
                      │ (all depend on)
                      ↓
              ┌───────────────┐
              │   Database    │
              │  ⚠️ Signal:   │
              │ Connection    │
              │ pool at 100%  │
              │ Wait time:    │
              │ 30 seconds    │
              └───────────────┘
                      ↓
              🔍 Root Cause:
              Long-running query
              holding locks,
              blocking all other
              transactions

The Critical Signals:

Database connection wait time (signal): Jumped from 0ms to 30,000ms
Active transaction count (signal): One transaction running for 45 minutes
Lock wait time (signal): 200 queries waiting on same table lock
Query logs (signal): Massive JOIN query without proper indexes

The Real Fix: Kill the one problematic query. All services recovered in 30 seconds.

💡 Lesson: In cascading failures, start with shared dependencies and their signals. Most symptoms are downstream effects of a single root cause.

Common Mistakes

❌ Mistake 1: Alerting Only on Symptoms

The Problem: You get notified that something is wrong, but have no idea what's causing it.

🚨 ALERT: API Response Time > 1 second

Now what? You still need to:

Look up what's happening (signals)
Correlate multiple data sources
Form hypotheses
Test them

Better Approach: Alert includes relevant signals automatically.

🚨 ALERT: API Response Time > 1 second

Key Signals:
- Database query time: 950ms (↑ 300% from baseline)
- Cache hit rate: 15% (↓ from usual 85%)
- Redis latency: 12ms (normal)

Likely cause: Cache warming issue after deployment
Runbook: [link]

❌ Mistake 2: Treating All Signals Equally

The Problem: You're drowning in data, unable to find the signal in the noise.

Not all signals are equally valuable:

Signal Priority	Examples	When to Check
🔴 Critical	User-facing errors, data corruption, security breaches	Alert immediately, wake someone up
🟡 Warning	High resource usage, elevated latency, increasing error rates	Monitor closely, may need intervention
🟢 Informational	Deployment events, scaling events, configuration changes	Check during investigation for context
⚪ Diagnostic	Detailed traces, debug logs, method-level metrics	Use only when debugging specific issues

❌ Mistake 3: Ignoring Signal Context

The Problem: A signal means different things in different contexts.

Example: CPU at 80%

During daily batch job at 3 AM: ✅ Normal
During low-traffic period on web server: ⚠️ Suspicious
After deployment: 🚨 Potential problem
On Black Friday: ✅ Expected

Better Approach: Context-aware signals

CPU usage: 80%
Context: Tuesday 2:47 PM (typical usage: 35%)
Change: +45 percentage points in 5 minutes
Correlation: Deployment completed 6 minutes ago
Verdict: 🚨 Investigate immediately

❌ Mistake 4: The "Dashboard Operator" Trap

The Problem: Staring at dashboards waiting for symptoms to appear.

Why It Fails:

Humans are terrible at sustained attention
Symptoms appear only after users are affected
You're reactive, not proactive

Better Approach: Automated signal analysis

SIGNAL ANALYSIS WORKFLOW

📊 Signals → 🤖 Automated Analysis → 🎯 Smart Alerts
                    │
                    ├→ Anomaly detection
                    ├→ Pattern recognition  
                    ├→ Predictive analysis
                    ├→ Correlation discovery
                    └→ Root cause suggestion

❌ Mistake 5: Collecting Signals Without Purpose

The Problem: "Let's collect everything just in case!"

Why It Fails:

Storage costs explode
Query performance degrades
Important signals buried in noise
Nobody knows what signals mean

Better Approach: Signal taxonomy with purpose

For each signal, document:

What it measures (definition)
Why it matters (impact)
When to check it (context)
How to interpret it (thresholds)
Where it fits in debugging (common use cases)

Key Takeaways

🎯 The Core Principle

Signals are data. Symptoms are interpretations.

Your observability strategy should:

Collect comprehensive signals
Aggregate signals into symptoms
Alert on symptoms
Investigate using signals
Resolve root causes (found in signals)

🧠 Mental Models to Remember

The Medical Analogy:

Fever (symptom) ← Infection (signal) ← Bacteria (root cause)
High latency (symptom) ← CPU saturation (signal) ← Memory leak (root cause)

The Detective Analogy:

Crime scene (symptom) ← Evidence (signals) ← Perpetrator (root cause)
Service down (symptom) ← Crash logs (signals) ← Null pointer (root cause)

🔧 Practical Implementation

When building observability:

Start with signals (what can the system tell us?)
Define symptoms (what do users care about?)
Map relationships (which signals cause which symptoms?)
Build workflows (symptom detected → relevant signals → guided investigation)

When responding to incidents:

Acknowledge the symptom (what's broken?)
Gather signals (what's the data saying?)
Form hypothesis (what could explain these signals?)
Test hypothesis (do other signals confirm this?)
Find root cause (what signal points to the source?)

📊 The Observability Stack

📋 Quick Reference: Signals vs Symptoms

Aspect	Signals	Symptoms
Nature	Raw data, objective measurements	Derived indicators, interpreted state
Purpose	Diagnosis and root cause analysis	Detection and awareness
Volume	High cardinality, detailed	Low cardinality, aggregated
Examples	Memory bytes used, query duration, error logs	Service degraded, high latency, elevated errors
Alert on?	Rarely (only critical signals)	Yes (user-impacting symptoms)
Who cares?	Engineers during investigation	Everyone (users, business, on-call)
Time horizon	Leading indicator (early warning)	Lagging indicator (problem already exists)
Cardinality	Thousands to millions of unique streams	Dozens to hundreds of indicators

🚀 Next Steps in Your Journey

You've learned to distinguish signals from symptoms. This foundational shift in thinking prepares you for:

Building effective SLIs/SLOs (next lesson: translating business impact into measurable symptoms)
Designing alert strategies (knowing when to alert on symptoms vs. signals)
Root cause analysis techniques (signal correlation, anomaly detection, causation)
Observability architecture (collecting, storing, and querying signals at scale)

💡 Remember: The goal isn't to collect more data—it's to collect the right signals and understand which symptoms they produce.

📚 Further Study

Google SRE Book - Monitoring Distributed Systems - Deep dive into monitoring philosophy
Charity Majors - Observability Engineering - Modern observability principles
OpenTelemetry Documentation - Standards for signals collection

🎓 Lesson complete! Practice with the flashcards above and test your understanding with the quiz questions. The next lesson will build on this foundation to explore how signals aggregate into Service Level Indicators.

📝

Ready to practice?

This lesson has 15 questions to help you learn