Observability Architecture

Design vendor-neutral telemetry pipelines that support future migrations and tool evolution

Observability Architecture

Master observability architecture with free flashcards and spaced repetition practice. This lesson covers signal collection, data pipelines, storage strategies, and query patterns—essential concepts for building production-grade observability systems that help teams diagnose issues and understand system behavior at scale.

Welcome to Observability Architecture 👋

Observability has evolved from simple log files and basic monitoring to sophisticated distributed systems that ingest, process, and analyze billions of data points per second. The architecture you choose determines whether your team can quickly diagnose production incidents or drowns in data without insights.

Why architecture matters: A well-designed observability system provides fast answers during incidents, scales economically, and adapts as your infrastructure grows. Poor architecture leads to slow queries, storage costs that spiral out of control, and blind spots during critical outages.

In this lesson, you'll learn how the pieces fit together—from collecting signals at the edge to querying them during an incident. We'll explore real-world trade-offs, common pitfalls, and practical patterns that production teams rely on every day.

🏗️ Core Concepts: The Observability Stack

The Three Pillars (And Why That Model Is Evolving)

Traditionally, observability rested on three pillars:

📊 Metrics - Numerical measurements over time (CPU usage, request rate, error count) 📝 Logs - Discrete event records with timestamps and context 🔍 Traces - Request flows showing execution paths across services

But modern architectures recognize these aren't separate pillars—they're different lenses for viewing the same underlying events. A single HTTP request generates metrics (duration, status code), logs (access log entry), and traces (span representing the request). The trend is toward unified observability where all signals share common infrastructure.

Signal Collection: The Edge Problem

Every observability architecture starts with signal collection—capturing data where it originates.

┌─────────────────────────────────────────────┐
│         SIGNAL COLLECTION LAYER             │
└─────────────────────────────────────────────┘

📱 Application          🖥️ Infrastructure
     │                      │
     ▼                      ▼
┌─────────┐           ┌─────────┐
│ SDK/    │           │ Agents  │
│ Library │           │ (node,  │
│ (OTel)  │           │ cAdvisor│
└────┬────┘           └────┬────┘
     │                     │
     └─────────┬───────────┘
               ▼
       ┌───────────────┐
       │  Collector    │  ← Aggregation point
       │  (Agent/      │
       │   Gateway)    │
       └───────┬───────┘
               │
               ▼
       [ Pipeline / Backend ]

Key architectural decisions:

1. Push vs. Pull

Push: Applications send data to collectors (common for logs, traces)
Pull: Collectors scrape endpoints (Prometheus model for metrics)
Hybrid: Many systems use both

2. Agent Deployment

Sidecar: Agent runs alongside each application container
DaemonSet: One agent per node serves all containers
Library-based: Instrumentation embedded directly in application

💡 Tip: Start with library-based instrumentation for flexibility, add agents for infrastructure signals that applications can't see.

Data Pipelines: From Signal to Storage

Once collected, signals flow through pipelines that transform, route, and enrich them.

Pipeline Stage	Purpose	Examples
Ingestion	Receive and parse incoming data	Protocol handlers (HTTP, gRPC), format parsers (JSON, Protobuf)
Processing	Transform and enrich	Add metadata, sample, aggregate, filter
Routing	Direct to appropriate storage	Send errors to alerting, metrics to TSDB, traces to trace store
Buffering	Handle backpressure	Queue for retry, shed load during spikes

Processing patterns:

Sampling - Reduce volume by keeping representative subset

Head-based: Decide at collection time (keep 10% randomly)
Tail-based: Decide after seeing full trace (keep all errors, sample success)

Aggregation - Pre-compute summaries to reduce storage

Temporal: Roll up minute-level data into hourly averages
Spatial: Combine metrics across replica instances

Enrichment - Add context for easier querying

Attach environment labels (prod/staging)
Add Kubernetes metadata (pod name, namespace)
Correlate with deployment events

Storage: The Cost-Performance Trade-off

Storage architecture determines query speed, retention capabilities, and—critically—cost.

┌──────────────────────────────────────────────┐
│        STORAGE ARCHITECTURE SPECTRUM         │
└──────────────────────────────────────────────┘

    Hot Storage              Warm              Cold
    (Fast queries)        (Balanced)      (Archival)
         │                    │                │
         ▼                    ▼                ▼
    ┌─────────┐          ┌─────────┐      ┌─────────┐
    │  SSD    │          │ Object  │      │ Glacier │
    │  RAM    │          │ Storage │      │ Tape    │
    │ Minutes │          │  Days   │      │ Months  │
    │ to Hours│          │ to Weeks│      │ to Years│
    └─────────┘          └─────────┘      └─────────┘
       $$$                  $$                $
    Milliseconds         Seconds           Minutes

Storage types by signal:

Time-Series Databases (TSDB) - Optimized for metrics

Examples: Prometheus, InfluxDB, TimescaleDB, M3DB
Compression algorithms exploit time-series patterns
Downsampling: Keep high-resolution recent data, lower resolution historical

Log Aggregation Systems - Optimized for text search

Examples: Elasticsearch, Loki, ClickHouse
Inverted indexes for full-text search
Columnar storage for analytical queries

Trace Stores - Optimized for graph queries

Examples: Jaeger, Tempo, Zipkin
Index by trace ID for fast lookup
Store spans with parent-child relationships

Unified Storage - Single backend for all signals

Examples: OpenSearch, ClickHouse, Apache Druid
Reduces operational complexity
May sacrifice specialized optimizations

⚠️ Storage retention strategy:

Tier	Resolution	Retention	Use Case
Real-time	Raw (1s)	6-24 hours	Active incident investigation
Recent	Downsampled (1m)	7-30 days	Recent history, debugging
Historical	Aggregated (5m)	90-365 days	Trend analysis, capacity planning
Archive	Summary (1h)	1+ years	Compliance, long-term patterns

Query Layer: Making Data Accessible

The query layer sits between storage and users, providing the interface for exploration and alerting.

Query patterns:

1. Time-range queries - "Show me error rate for the last hour"

Most common pattern in observability
Optimized by time-indexed storage

2. Aggregations - "What's the P95 latency by endpoint?"

Group by dimensions (endpoint, region, customer)
Apply functions (percentile, average, max)

3. Correlations - "Find traces with high latency AND errors"

Join conditions across signal types
Critical for root cause analysis

4. Exemplars - "Show me example traces matching this metric spike"

Jump from aggregated view to raw examples
Bridges metrics and traces

┌─────────────────────────────────────────────┐
│         QUERY ARCHITECTURE LAYERS           │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  UI / Visualization (Grafana, Custom)       │
└────────────────┬────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────┐
│  Query API (PromQL, LogQL, TraceQL)         │
└────────────────┬────────────────────────────┘
                 │
          ┌──────┴──────┐
          ▼             ▼
    ┌──────────┐   ┌──────────┐
    │ Query    │   │  Cache   │
    │ Engine   │───│  Layer   │
    └────┬─────┘   └──────────┘
         │
         ▼
    ┌──────────┐
    │ Storage  │
    │ Backend  │
    └──────────┘

Query optimization techniques:

Materialized views - Pre-compute common queries

Example: Pre-aggregate metrics by service every minute
Trade-off: Storage space for query speed

Query result caching - Store recent query results

Helps with dashboard refreshes
Invalidate on new data or time window shift

Index strategies - Speed up lookups

Inverted indexes: Find all logs containing "error"
Bitmap indexes: Filter by high-cardinality labels
Time-based partitioning: Skip irrelevant time ranges

📖 Architecture Examples

Example 1: Small-Scale Startup Architecture

Scenario: A startup with 5 microservices, 20 containers, 100K requests/day

Architecture:

┌───────────────────────────────────────────┐
│         STARTUP STACK                     │
└───────────────────────────────────────────┘

  📱 Applications (OpenTelemetry SDK)
           │
           ▼
  ┌─────────────────┐
  │ OTel Collector  │  (Single instance)
  │  - Receives all │
  │    signals      │
  │  - Basic filter │
  └────────┬────────┘
           │
     ┌─────┴─────┐
     ▼           ▼
┌─────────┐  ┌──────────┐
│Prometheus│ │Grafana   │
│(Metrics) │ │ Cloud    │
│          │ │(Logs/    │
│+ Loki    │ │ Traces)  │
│(Logs)    │ │          │
└─────────┘  └──────────┘

Key decisions:

Single collector: Simplifies operations, sufficient for this scale
Prometheus + Loki: Open-source, runs on same infrastructure
Grafana Cloud for traces: Avoid running complex trace storage
Retention: 15 days local metrics/logs, 7 days traces

Cost: ~$200/month (mostly Grafana Cloud)

Trade-offs:

✅ Simple to operate
✅ Low cost
⚠️ Single point of failure (collector)
⚠️ Limited scale headroom

Example 2: Mid-Scale SaaS Architecture

Scenario: Growing SaaS with 50 services, 500 containers, 50M requests/day

Architecture:

┌────────────────────────────────────────────────┐
│         MID-SCALE ARCHITECTURE                 │
└────────────────────────────────────────────────┘

  📱 Apps (OTel SDK) + 🖥️ Infra (Node agents)
              │
              ▼
      ┌───────────────┐
      │ OTel Collectors│  (DaemonSet on K8s)
      │  - Per-node   │
      │  - Sampling   │
      │  - Enrichment │
      └───────┬───────┘
              │
              ▼
      ┌───────────────┐
      │ Load Balancer │
      └───────┬───────┘
              │
      ┌───────┴────────┐
      │                │
      ▼                ▼
  ┌────────┐      ┌─────────┐
  │Gateway │      │Gateway  │  (Redundant)
  │Collector│     │Collector│
  └────┬───┘      └────┬────┘
       │               │
  ┌────┴───────────────┴────┐
  │                          │
  ▼                          ▼
┌──────────┐          ┌──────────┐
│  M3DB    │          │TempoStack│
│(Metrics) │          │(Traces)  │
│          │          │          │
│+ ClickHse│          │+ Grafana │
│(Logs)    │          │          │
└──────────┘          └──────────┘

Key decisions:

Two-tier collectors: Edge (DaemonSet) + Gateway (centralized)
M3DB for metrics: Better scalability than Prometheus
ClickHouse for logs: Fast analytical queries, cost-effective
Tempo for traces: Open-source, S3-backed
Sampling: 100% errors, 10% success cases, 1% of high-throughput endpoints

Data flow:

Apps emit signals → Local DaemonSet collector
DaemonSet adds K8s metadata, samples
Gateway collectors aggregate, route by signal type
Storage backends optimized per signal

Cost: ~$3-5K/month (mostly compute and storage)

Trade-offs:

✅ Redundant collection path
✅ Scales to 10x current load
✅ Cost-optimized storage choices
⚠️ More operational complexity
⚠️ Multiple systems to maintain

Example 3: Enterprise Multi-Tenant Architecture

Scenario: Platform serving 100+ internal teams, thousands of services, billions of requests/day

Architecture:

┌─────────────────────────────────────────────────┐
│      ENTERPRISE MULTI-TENANT ARCHITECTURE       │
└─────────────────────────────────────────────────┘

📱 Apps (SDK) → 🔹 Regional Edge Collectors
                     │
                     ▼
              ┌──────────────┐
              │ Kafka Cluster│  (Buffer + Routing)
              │  - Topics by │
              │    tenant    │
              │  - Topics by │
              │    signal    │
              └──────┬───────┘
                     │
        ┌────────────┼────────────┐
        │            │            │
        ▼            ▼            ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Metrics  │ │   Logs   │ │  Traces  │
  │Processors│ │Processors│ │Processors│
  │(Stateful)│ │(Stateles)│ │(Stateful)│
  └────┬─────┘ └────┬─────┘ └────┬─────┘
       │            │            │
       ▼            ▼            ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Cortex   │ │OpenSearch│ │  Jaeger  │
  │(Multi-   │ │(Multi-   │ │(Multi-   │
  │ tenant)  │ │ tenant)  │ │ tenant)  │
  └────┬─────┘ └────┬─────┘ └────┬─────┘
       │            │            │
       └────────────┼────────────┘
                    ▼
            ┌───────────────┐
            │ Unified Query │
            │     API       │
            └───────┬───────┘
                    │
            ┌───────┴────────┐
            │                │
            ▼                ▼
        ┌────────┐      ┌────────┐
        │Team A  │      │Team B  │
        │Grafana │      │Grafana │
        └────────┘      └────────┘

Key decisions:

Kafka as backbone: Decouples collection from processing
- Replay capability for reprocessing
- Buffer during storage outages
- Multi-consumer pattern (same data → multiple destinations)
Multi-tenant storage: Each team isolated
- Query limits prevent noisy neighbors
- Cost attribution per tenant
- Separate retention policies
Processing layer: Stream processors between Kafka and storage
- Stateless: Simple transformations (parsing, filtering)
- Stateful: Aggregations requiring memory (rate calculations, cardinality estimation)
Regional deployment: Data stays in region for compliance
- Cross-region federation for global views
- Local edge collectors reduce latency

Advanced features:

Adaptive sampling: ML-based decisions on what to keep
Anomaly detection: Statistical models flag unusual patterns
Automatic correlation: Link related signals across services
Cost controls: Per-tenant budgets with hard limits

Cost: $50K-200K/month (depends on scale)

Trade-offs:

✅ Scales to massive volume
✅ Strong tenant isolation
✅ Flexible routing and processing
✅ Survives component failures
⚠️ High operational complexity
⚠️ Requires dedicated platform team
⚠️ Significant infrastructure cost

Example 4: Hybrid Cloud Architecture

Scenario: Regulated industry with on-prem core systems and cloud-native customer-facing services

Architecture:

┌─────────────────────────────────────────────────┐
│         HYBRID CLOUD ARCHITECTURE               │
└─────────────────────────────────────────────────┘

  ON-PREMISES                    CLOUD (AWS)
  ┌──────────────┐              ┌──────────────┐
  │              │              │              │
  │ 🏛️ Core Apps │              │ ☁️ Services  │
  │              │              │              │
  │      │       │              │      │       │
  │      ▼       │              │      ▼       │
  │ ┌────────┐   │              │ ┌────────┐   │
  │ │Collector│  │              │ │Collector│  │
  │ └───┬────┘   │              │ └───┬────┘   │
  │     │        │              │     │        │
  │     ▼        │              │     ▼        │
  │ ┌────────┐   │   Secure     │ ┌────────┐   │
  │ │Local   │   │   Tunnel     │ │ Managed│   │
  │ │Storage │   │   (VPN/      │ │Observ. │   │
  │ │        │◄──┼───Direct     │ │Platform│   │
  │ │        │   │   Connect)   │ │        │   │
  │ └────────┘   │              │ └────┬───┘   │
  │              │              │      │       │
  │     ▲        │              │      │       │
  │     │        │              │      │       │
  │  Replicate   │              │      ▼       │
  │  (filtered)  │              │ ┌─────────┐  │
  │              │              │ │Federated│  │
  └──────────────┘              │ │ Query   │  │
                                │ └─────────┘  │
                                │              │
                                └──────────────┘

Key decisions:

Data residency: Sensitive signals stay on-prem
- Patient data, financial records remain local
- Aggregated metrics (no PII) replicate to cloud
Federated queries: Unified view across environments
- Query API spans both locations
- Results merged before presentation
Selective replication: Filter before sending to cloud
- Strip sensitive fields
- Sample heavily to reduce egress costs
Dual storage: Same signal types in both locations
- Full fidelity on-prem for compliance
- Subset in cloud for correlation with cloud services

Challenges:

Network partitions between sites
Latency in federated queries
Data synchronization complexity
Egress costs (can be $$$)

Cost: Variable (depends on data volumes and egress)

⚠️ Common Mistakes

1. Over-Instrumenting Without Sampling Strategy

Mistake: Emitting every single event at full fidelity

Why it's wrong:

Storage costs explode (can easily hit 6-7 figures/year)
Query performance degrades with excessive data
Signal-to-noise ratio drops

Better approach:

Start with coarse sampling (1-10%)
Always capture errors and slow requests
Implement dynamic sampling based on traffic patterns
Use head-based sampling for known high-volume endpoints

💡 Rule of thumb: If you're storing 100% of success cases for an endpoint handling millions of requests, you're probably over-collecting.

2. Single Point of Failure in Collection

Mistake: All signals flow through one collector instance

Why it's wrong:

Collector failure = complete observability blackout
Restart/upgrade requires downtime
No graceful degradation

Better approach:

Deploy collectors as redundant sets
Use client-side failover (try collector A, then B)
Buffer signals in message queue if collectors unavailable
Monitor the monitors (meta-observability)

3. Ignoring Cardinality Explosion

Mistake: Adding high-cardinality labels to metrics (user ID, request ID, email)

Why it's wrong:

Time-series databases create one series per unique label combination
Example: 100 metrics × 1M users = 100M series
Queries slow to a crawl, storage exhausted

Better approach:

Keep metric labels low-cardinality (service, endpoint, status, region)
Put high-cardinality data in traces or logs instead
Use exemplars to link metrics → high-cardinality traces
If you must have high-cardinality metrics, use specialized storage (M3DB, ClickHouse)

Cardinality	Example Labels	Appropriate?
Low (✅)	service="api", endpoint="/users", status="200"	Perfect for metrics
Medium (⚠️)	+ customer_tier="enterprise" (100s of values)	Acceptable with care
High (❌)	+ user_id="abc123" (millions of values)	Use traces/logs instead

4. No Retention Strategy

Mistake: Keeping all data at full resolution forever

Why it's wrong:

Storage costs grow linearly (or worse) over time
Old high-resolution data rarely queried
Queries scan unnecessary data

Better approach:

Implement tiered retention (see storage section)
Downsample old metrics (1s → 1m → 5m → 1h)
Archive logs to cheap object storage (S3 Glacier)
Delete traces older than 30-90 days
Keep aggregated summaries for long-term trends

5. Mixing Operational and Business Metrics

Mistake: Using same observability system for both system health and business analytics

Why it's wrong:

Different query patterns (real-time alerts vs. batch analytics)
Different retention needs (days vs. years)
Business analytics often requires joins, complex aggregations
Observability systems optimized for time-series, not analytics

Better approach:

Observability platform: System health, performance, errors
Data warehouse: Business metrics, revenue, user behavior
Stream business events to both if needed
Use observability for "what's broken?" and analytics for "how's the business?"

6. Not Planning for Query Performance

Mistake: Assuming queries will always be fast

Why it's wrong:

Dashboard with 20 panels = 20+ queries on every refresh
Ad-hoc queries during incidents can overwhelm storage
No query limits = one bad query impacts everyone

Better approach:

Implement query timeouts and result limits
Pre-aggregate common queries (materialized views)
Cache dashboard results (1-5 minute TTL)
Use query sharding for large time ranges
Provide query cost estimates before execution

🎯 Key Takeaways

📋 Quick Reference: Observability Architecture Essentials

Architecture Layer	Key Decisions
🔍 Collection	Push vs. Pull, Agent deployment model, Instrumentation approach
🔄 Pipeline	Sampling strategy, Enrichment rules, Routing logic, Buffering capacity
💾 Storage	Signal-specific vs. unified, Retention tiers, Hot/warm/cold strategy
🔎 Query	Caching strategy, Index design, Query limits, Federation approach

Golden Rules:

1️⃣ Sample intelligently - Not everything needs 100% capture

2️⃣ Cardinality kills - High-cardinality labels destroy time-series databases

3️⃣ Tier your retention - Fresh data hot, old data cold, ancient data gone

4️⃣ Build redundancy - Single points of failure = blind during outages

5️⃣ Right tool, right job - Metrics for aggregates, logs for details, traces for flows

6️⃣ Cost visibility - Track spend per signal type, per team, per service

7️⃣ Query performance matters - Slow queries during incidents = worse outcomes

🧠 Architecture Decision Framework

When designing your observability architecture, ask:

Scale questions:

How many services/containers/requests?
What's the growth trajectory?
What's the event volume per second?

Reliability questions:

What's the cost of missing signals during an incident?
Can you tolerate collection delays?
What's your RTO for observability platform outages?

Cost questions:

What's the budget?
What's the cost per GB ingested/stored?
Can you optimize with sampling/aggregation?

Team questions:

How many engineers to operate the platform?
Build vs. buy vs. managed service?
What's the existing expertise?

Compliance questions:

Where must data reside?
How long must you retain it?
What PII/PHI concerns exist?

📚 Further Study

Essential Resources:

OpenTelemetry Documentation - https://opentelemetry.io/docs/concepts/observability-primer/ Industry-standard instrumentation framework, excellent architecture guides
Observability Engineering (Book) - https://www.oreilly.com/library/view/observability-engineering/9781492076438/ Comprehensive guide to observability principles and practices by Charity Majors, Liz Fong-Jones, George Miranda
Google SRE Book - Monitoring Chapter - https://sre.google/sre-book/monitoring-distributed-systems/ Foundational concepts from Google's SRE practices, especially the "Four Golden Signals"

Next Steps: Now that you understand observability architecture, explore specific implementation patterns—distributed tracing setup, log aggregation strategies, and metric collection best practices. Practice by designing architectures for different scenarios: startup MVP, scaling growth phase, enterprise compliance needs.

Your observability architecture shapes how quickly your team can diagnose and resolve production issues. Invest time in getting it right early, but remain flexible—the best architecture evolves with your system's needs. 🚀

📝

Ready to practice?

This lesson has 15 questions to help you learn