Signals vs Symptoms
Learn to distinguish between raw telemetry data and the surface manifestations of deeper issues
Signals vs Symptoms in Production Observability
Master the critical distinction between signals and symptoms with free flashcards and spaced repetition practice. This lesson covers signal types, symptom patterns, and root cause analysis techniquesβessential concepts for building effective observability strategies in modern production systems.
Welcome
π» In the world of production systems, confusion between signals and symptoms is one of the most commonβand costlyβmistakes teams make. When an alert fires at 3 AM, are you looking at the actual problem or just its visible effects? Understanding this distinction transforms how you approach incident response, build monitoring systems, and ultimately keep your services reliable.
Think of it like medicine: a fever is a symptom, but the infection causing it is the underlying signal your body is trying to communicate. In production systems, high latency might be your symptom, but the signal could be anything from memory pressure to network saturation to database lock contention.
Core Concepts
What Are Signals? π
Signals are the fundamental data points emitted by your system that represent its actual state and behavior. They are the raw, objective measurements that tell you what's happening inside your infrastructure.
Think of signals as the primary sources of truth:
- Metrics: Numerical measurements over time (CPU usage, request count, memory consumption)
- Logs: Discrete event records with context (application errors, access logs, system events)
- Traces: End-to-end request flows showing the path through your system
- Events: State changes or significant occurrences (deployments, configuration changes, scaling events)
π‘ Key insight: Signals are what your system actually does. They're the objective reality beneath everything else.
| Signal Type | What It Measures | Example |
|---|---|---|
| Counter | Cumulative value that only increases | Total HTTP requests: 1,453,892 |
| Gauge | Point-in-time value that can go up or down | Current memory usage: 4.2 GB |
| Histogram | Distribution of values over time | Request duration: p50=120ms, p99=850ms |
| Log entry | Timestamped event with structured data | 2026-01-15 14:23:01 ERROR: Connection timeout to db-primary |
| Span | Single operation within a distributed trace | API call β Database query (duration: 45ms) |
What Are Symptoms? π‘οΈ
Symptoms are the observable effects or manifestations that indicate something might be wrong. They're what users experience and what your monitoring typically alerts on first.
Symptoms are derived indicators:
- High response times (derived from request duration metrics)
- Error rate spikes (aggregated from multiple error signals)
- Service unavailability (computed from health check failures)
- Slow page loads (experienced by users, reflected in multiple signals)
β οΈ Critical distinction: Symptoms tell you something is wrong, but not why it's wrong.
THE SYMPTOM β SIGNAL RELATIONSHIP
π€ User Experience
β
β
π‘οΈ SYMPTOM (What you observe)
"The website is slow"
β
βββββββββββββ¬ββββββββββββ¬ββββββββββββ
β β β β
π SIGNALS (What's actually happening)
CPU: 95% Memory: Network: Database:
sustained swapping packet query
active loss timeouts
The Diagnostic Journey π¬
When you're troubleshooting a production issue, you typically move through these layers:
Layer 1: Symptom Recognition β "Users are reporting errors"
Layer 2: Signal Collection β Gather metrics, logs, traces related to the symptom
Layer 3: Signal Analysis β Identify patterns, correlations, and anomalies
Layer 4: Root Cause β Determine the underlying reason (often found in signals you weren't initially monitoring)
Here's the relationship visualized:
βββββββββββββββββββββββββββββββββββββββββββββββββββ β OBSERVABILITY HIERARCHY β βββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π₯ USER IMPACT (Business metrics) β β β Derived from β β π‘οΈ SYMPTOMS (SLIs/SLOs) β β β Aggregated from β β π SIGNALS (Raw telemetry) β β β Emitted by β β βοΈ SYSTEM BEHAVIOR (Code + infrastructure) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββ
Why the Distinction Matters π―
1. Alert Fatigue Prevention
When you alert on symptoms without understanding underlying signals, you get:
- Multiple alerts for the same root cause
- Confusion about what to fix first
- Teams chasing their tails
2. Faster Mean Time to Resolution (MTTR)
Understanding signals helps you:
- Skip directly to relevant data
- Avoid investigating red herrings
- Build better runbooks
3. Proactive Problem Detection
Signal-based monitoring catches issues before they become symptoms:
- Memory slowly leaking β Signal visible hours before crash
- Disk space growing β Signal visible days before full disk
- Connection pool exhaustion β Signal visible minutes before timeout spike
π‘ Mental model: Symptoms are lagging indicators (they tell you after the fact), while signals are leading indicators (they show you what's developing).
Common Signal-Symptom Pairs π
Understanding typical relationships helps you build better observability:
| Symptom (What Users See) | Potential Signals (What's Really Happening) | Root Cause Examples |
|---|---|---|
| 500 errors increasing | Exception count by type, stack traces, database connection pool metrics | Database connection exhaustion, unhandled null pointers, dependency timeout |
| Slow API responses | Request duration by endpoint, CPU utilization, GC pause time, query execution time | N+1 database queries, memory pressure causing GC, external API degradation |
| Service unavailable | Pod restart count, OOM kill events, health check failures, load balancer errors | Memory leak, unhandled panic, misconfigured health check |
| Data inconsistency | Transaction rollback rate, replication lag, event processing delay | Race condition, failed distributed transaction, message queue backlog |
Detailed Examples
Example 1: The Slow Checkout Flow π
The Symptom: Your e-commerce checkout process is taking 5 seconds instead of the usual 500ms. Customers are abandoning carts.
The Surface Investigation (symptom-focused thinking):
- "The checkout API is slow"
- "We need to scale up checkout service"
- "Let's add more cache"
The Signal-Driven Investigation:
| Step | Signal Examined | What It Revealed |
|---|---|---|
| 1 | Distributed trace of slow request | 95% of time spent in payment validation service |
| 2 | Payment service response time metrics | Normal response time (50ms), but high retry count |
| 3 | Network connection metrics | Connection establishment taking 4.8s on every retry |
| 4 | DNS resolution logs | DNS lookups timing out, falling back to secondary resolver |
The Root Cause: DNS server was overloaded. Not a checkout problem, not a payment service problem, not a scaling problem.
The Fix: Implement DNS caching and add additional DNS servers. Resolved in 10 minutes vs. hours of scaling experiments.
π‘ Lesson: If you'd just looked at the symptom ("checkout is slow") and scaled checkout services, you'd have wasted resources and time while users continued suffering.
Example 2: The Mysterious Memory Leak π§
The Symptom: Application crashes every 6 hours with Out Of Memory (OOM) errors.
Symptom-Level Response (treating the effect):
- Restart the service when memory hits 90%
- Increase memory allocation from 4GB to 8GB
- Set up auto-restart on crash
Signal-Driven Investigation:
MEMORY USAGE OVER TIME (Signal Analysis)
8GB β€ β² OOM!
β β±β±β±β±
6GB β€ β±β±β±β±
β β±β±β±β±
4GB β€ β±β±β±β± β Steady linear growth
β β±β±β±β±
2GB β€ β±β±β±β±
ββ±β±β±β±
0GB βΌβββββ¬βββββ¬βββββ¬βββββ¬βββββ¬ββββ
0h 1h 2h 3h 4h 5h 6h
Signal: Memory grows ~1.2GB/hour consistently
Deeper Signal Examination:
- Heap dump analysis (signal): Large number of HTTP connection objects never released
- Connection pool metrics (signal): Pool size growing unbounded
- Code review triggered by signals: Connection pool missing
maxTotalconfiguration - Log correlation (signal): Each failed request creates new connection but doesn't return it to pool
The Root Cause: Configuration error + error handling bug. Failed requests didn't properly release connections.
The Real Fix:
- Set connection pool max size
- Fix error handling to ensure connection release
- Memory issue disappeared completely
β οΈ What symptom-focused approach missed: Increasing memory would just make it crash every 12 hours instead of 6. The leak would continue.
Example 3: The Dashboard That Lied π
The Symptom: Your monitoring dashboard shows "All Systems Green" β
User Reality: Customers can't log in, support tickets flooding in.
What Happened: Monitoring was tracking symptoms, not signals.
The Dashboard Showed (symptoms):
- β HTTP 200 response rate: 99.9%
- β Average response time: 150ms
- β CPU usage: 45%
- β Error rate: 0.1%
The Hidden Signals:
| Signal Type | What It Actually Showed | Why Dashboard Missed It |
|---|---|---|
| Authentication success rate | Dropped from 98% to 12% | Not monitored separately from general success rate |
| Redis connection errors (logs) | Spiking to 1000/min | Logged but not aggregated into metrics |
| Session token validation failures | 95% failing | Application handled gracefully, returned 200 with error message |
| Cache hit rate | Dropped from 85% to 2% | Treated as performance optimization metric, not health signal |
The Root Cause: Redis cluster failover happened, but application fell back to "graceful degradation" that returned HTTP 200 with "Please try again later" messages. Technically not errors, but users couldn't use the system.
π‘ Lesson: Symptoms without context are meaningless. You need signals that reflect actual user journeys and business functions, not just technical health.
Example 4: The Cascading Failure π
The Symptom: Everything is failing simultaneously across multiple services.
The Panic Response (symptom whack-a-mole):
- Restart service A β Still failing
- Scale up service B β Still failing
- Rollback deployment of service C β Still failing
- Declare major incident β Still failing
The Signal-Driven Approach:
SERVICE DEPENDENCY MAP WITH SIGNALS
βββββββββββ βββββββββββ βββββββββββ
βFrontend β β Auth β β Orders β
β β Fail β β β Fail β β β Fail β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
ββββββββββββββΌβββββββββββββ
β (all depend on)
β
βββββββββββββββββ
β Database β
β β οΈ Signal: β
β Connection β
β pool at 100% β
β Wait time: β
β 30 seconds β
βββββββββββββββββ
β
π Root Cause:
Long-running query
holding locks,
blocking all other
transactions
The Critical Signals:
- Database connection wait time (signal): Jumped from 0ms to 30,000ms
- Active transaction count (signal): One transaction running for 45 minutes
- Lock wait time (signal): 200 queries waiting on same table lock
- Query logs (signal): Massive JOIN query without proper indexes
The Real Fix: Kill the one problematic query. All services recovered in 30 seconds.
π‘ Lesson: In cascading failures, start with shared dependencies and their signals. Most symptoms are downstream effects of a single root cause.
Common Mistakes
β Mistake 1: Alerting Only on Symptoms
The Problem: You get notified that something is wrong, but have no idea what's causing it.
π¨ ALERT: API Response Time > 1 second
Now what? You still need to:
- Look up what's happening (signals)
- Correlate multiple data sources
- Form hypotheses
- Test them
Better Approach: Alert includes relevant signals automatically.
π¨ ALERT: API Response Time > 1 second
Key Signals:
- Database query time: 950ms (β 300% from baseline)
- Cache hit rate: 15% (β from usual 85%)
- Redis latency: 12ms (normal)
Likely cause: Cache warming issue after deployment
Runbook: [link]
β Mistake 2: Treating All Signals Equally
The Problem: You're drowning in data, unable to find the signal in the noise.
Not all signals are equally valuable:
| Signal Priority | Examples | When to Check |
|---|---|---|
| π΄ Critical | User-facing errors, data corruption, security breaches | Alert immediately, wake someone up |
| π‘ Warning | High resource usage, elevated latency, increasing error rates | Monitor closely, may need intervention |
| π’ Informational | Deployment events, scaling events, configuration changes | Check during investigation for context |
| βͺ Diagnostic | Detailed traces, debug logs, method-level metrics | Use only when debugging specific issues |
β Mistake 3: Ignoring Signal Context
The Problem: A signal means different things in different contexts.
Example: CPU at 80%
- During daily batch job at 3 AM: β Normal
- During low-traffic period on web server: β οΈ Suspicious
- After deployment: π¨ Potential problem
- On Black Friday: β Expected
Better Approach: Context-aware signals
CPU usage: 80%
Context: Tuesday 2:47 PM (typical usage: 35%)
Change: +45 percentage points in 5 minutes
Correlation: Deployment completed 6 minutes ago
Verdict: π¨ Investigate immediately
β Mistake 4: The "Dashboard Operator" Trap
The Problem: Staring at dashboards waiting for symptoms to appear.
Why It Fails:
- Humans are terrible at sustained attention
- Symptoms appear only after users are affected
- You're reactive, not proactive
Better Approach: Automated signal analysis
SIGNAL ANALYSIS WORKFLOW
π Signals β π€ Automated Analysis β π― Smart Alerts
β
ββ Anomaly detection
ββ Pattern recognition
ββ Predictive analysis
ββ Correlation discovery
ββ Root cause suggestion
β Mistake 5: Collecting Signals Without Purpose
The Problem: "Let's collect everything just in case!"
Why It Fails:
- Storage costs explode
- Query performance degrades
- Important signals buried in noise
- Nobody knows what signals mean
Better Approach: Signal taxonomy with purpose
For each signal, document:
- What it measures (definition)
- Why it matters (impact)
- When to check it (context)
- How to interpret it (thresholds)
- Where it fits in debugging (common use cases)
Key Takeaways
π― The Core Principle
Signals are data. Symptoms are interpretations.
Your observability strategy should:
- Collect comprehensive signals
- Aggregate signals into symptoms
- Alert on symptoms
- Investigate using signals
- Resolve root causes (found in signals)
π§ Mental Models to Remember
The Medical Analogy:
- Fever (symptom) β Infection (signal) β Bacteria (root cause)
- High latency (symptom) β CPU saturation (signal) β Memory leak (root cause)
The Detective Analogy:
- Crime scene (symptom) β Evidence (signals) β Perpetrator (root cause)
- Service down (symptom) β Crash logs (signals) β Null pointer (root cause)
π§ Practical Implementation
When building observability:
- Start with signals (what can the system tell us?)
- Define symptoms (what do users care about?)
- Map relationships (which signals cause which symptoms?)
- Build workflows (symptom detected β relevant signals β guided investigation)
When responding to incidents:
- Acknowledge the symptom (what's broken?)
- Gather signals (what's the data saying?)
- Form hypothesis (what could explain these signals?)
- Test hypothesis (do other signals confirm this?)
- Find root cause (what signal points to the source?)
π The Observability Stack
π Quick Reference: Signals vs Symptoms
| Aspect | Signals | Symptoms |
| Nature | Raw data, objective measurements | Derived indicators, interpreted state |
| Purpose | Diagnosis and root cause analysis | Detection and awareness |
| Volume | High cardinality, detailed | Low cardinality, aggregated |
| Examples | Memory bytes used, query duration, error logs | Service degraded, high latency, elevated errors |
| Alert on? | Rarely (only critical signals) | Yes (user-impacting symptoms) |
| Who cares? | Engineers during investigation | Everyone (users, business, on-call) |
| Time horizon | Leading indicator (early warning) | Lagging indicator (problem already exists) |
| Cardinality | Thousands to millions of unique streams | Dozens to hundreds of indicators |
π Next Steps in Your Journey
You've learned to distinguish signals from symptoms. This foundational shift in thinking prepares you for:
- Building effective SLIs/SLOs (next lesson: translating business impact into measurable symptoms)
- Designing alert strategies (knowing when to alert on symptoms vs. signals)
- Root cause analysis techniques (signal correlation, anomaly detection, causation)
- Observability architecture (collecting, storing, and querying signals at scale)
π‘ Remember: The goal isn't to collect more dataβit's to collect the right signals and understand which symptoms they produce.
π Further Study
- Google SRE Book - Monitoring Distributed Systems - Deep dive into monitoring philosophy
- Charity Majors - Observability Engineering - Modern observability principles
- OpenTelemetry Documentation - Standards for signals collection
π Lesson complete! Practice with the flashcards above and test your understanding with the quiz questions. The next lesson will build on this foundation to explore how signals aggregate into Service Level Indicators.