You are viewing a preview of this lesson. Sign in to start learning
Back to Production Observability: From Signals to Root Cause (2026)

Observability Architecture

Design vendor-neutral telemetry pipelines that support future migrations and tool evolution

Observability Architecture

Master observability architecture with free flashcards and spaced repetition practice. This lesson covers signal collection, data pipelines, storage strategies, and query patternsβ€”essential concepts for building production-grade observability systems that help teams diagnose issues and understand system behavior at scale.

Welcome to Observability Architecture πŸ‘‹

Observability has evolved from simple log files and basic monitoring to sophisticated distributed systems that ingest, process, and analyze billions of data points per second. The architecture you choose determines whether your team can quickly diagnose production incidents or drowns in data without insights.

Why architecture matters: A well-designed observability system provides fast answers during incidents, scales economically, and adapts as your infrastructure grows. Poor architecture leads to slow queries, storage costs that spiral out of control, and blind spots during critical outages.

In this lesson, you'll learn how the pieces fit togetherβ€”from collecting signals at the edge to querying them during an incident. We'll explore real-world trade-offs, common pitfalls, and practical patterns that production teams rely on every day.

πŸ—οΈ Core Concepts: The Observability Stack

The Three Pillars (And Why That Model Is Evolving)

Traditionally, observability rested on three pillars:

πŸ“Š Metrics - Numerical measurements over time (CPU usage, request rate, error count) πŸ“ Logs - Discrete event records with timestamps and context πŸ” Traces - Request flows showing execution paths across services

But modern architectures recognize these aren't separate pillarsβ€”they're different lenses for viewing the same underlying events. A single HTTP request generates metrics (duration, status code), logs (access log entry), and traces (span representing the request). The trend is toward unified observability where all signals share common infrastructure.

Signal Collection: The Edge Problem

Every observability architecture starts with signal collectionβ€”capturing data where it originates.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         SIGNAL COLLECTION LAYER             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“± Application          πŸ–₯️ Infrastructure
     β”‚                      β”‚
     β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SDK/    β”‚           β”‚ Agents  β”‚
β”‚ Library β”‚           β”‚ (node,  β”‚
β”‚ (OTel)  β”‚           β”‚ cAdvisorβ”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚                     β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  Collector    β”‚  ← Aggregation point
       β”‚  (Agent/      β”‚
       β”‚   Gateway)    β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
       [ Pipeline / Backend ]

Key architectural decisions:

1. Push vs. Pull

  • Push: Applications send data to collectors (common for logs, traces)
  • Pull: Collectors scrape endpoints (Prometheus model for metrics)
  • Hybrid: Many systems use both

2. Agent Deployment

  • Sidecar: Agent runs alongside each application container
  • DaemonSet: One agent per node serves all containers
  • Library-based: Instrumentation embedded directly in application

πŸ’‘ Tip: Start with library-based instrumentation for flexibility, add agents for infrastructure signals that applications can't see.

Data Pipelines: From Signal to Storage

Once collected, signals flow through pipelines that transform, route, and enrich them.

Pipeline Stage Purpose Examples
Ingestion Receive and parse incoming data Protocol handlers (HTTP, gRPC), format parsers (JSON, Protobuf)
Processing Transform and enrich Add metadata, sample, aggregate, filter
Routing Direct to appropriate storage Send errors to alerting, metrics to TSDB, traces to trace store
Buffering Handle backpressure Queue for retry, shed load during spikes

Processing patterns:

Sampling - Reduce volume by keeping representative subset

  • Head-based: Decide at collection time (keep 10% randomly)
  • Tail-based: Decide after seeing full trace (keep all errors, sample success)

Aggregation - Pre-compute summaries to reduce storage

  • Temporal: Roll up minute-level data into hourly averages
  • Spatial: Combine metrics across replica instances

Enrichment - Add context for easier querying

  • Attach environment labels (prod/staging)
  • Add Kubernetes metadata (pod name, namespace)
  • Correlate with deployment events

Storage: The Cost-Performance Trade-off

Storage architecture determines query speed, retention capabilities, andβ€”criticallyβ€”cost.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        STORAGE ARCHITECTURE SPECTRUM         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Hot Storage              Warm              Cold
    (Fast queries)        (Balanced)      (Archival)
         β”‚                    β”‚                β”‚
         β–Ό                    β–Ό                β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  SSD    β”‚          β”‚ Object  β”‚      β”‚ Glacier β”‚
    β”‚  RAM    β”‚          β”‚ Storage β”‚      β”‚ Tape    β”‚
    β”‚ Minutes β”‚          β”‚  Days   β”‚      β”‚ Months  β”‚
    β”‚ to Hoursβ”‚          β”‚ to Weeksβ”‚      β”‚ to Yearsβ”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       $$$                  $$                $
    Milliseconds         Seconds           Minutes

Storage types by signal:

Time-Series Databases (TSDB) - Optimized for metrics

  • Examples: Prometheus, InfluxDB, TimescaleDB, M3DB
  • Compression algorithms exploit time-series patterns
  • Downsampling: Keep high-resolution recent data, lower resolution historical

Log Aggregation Systems - Optimized for text search

  • Examples: Elasticsearch, Loki, ClickHouse
  • Inverted indexes for full-text search
  • Columnar storage for analytical queries

Trace Stores - Optimized for graph queries

  • Examples: Jaeger, Tempo, Zipkin
  • Index by trace ID for fast lookup
  • Store spans with parent-child relationships

Unified Storage - Single backend for all signals

  • Examples: OpenSearch, ClickHouse, Apache Druid
  • Reduces operational complexity
  • May sacrifice specialized optimizations

⚠️ Storage retention strategy:

Tier Resolution Retention Use Case
Real-time Raw (1s) 6-24 hours Active incident investigation
Recent Downsampled (1m) 7-30 days Recent history, debugging
Historical Aggregated (5m) 90-365 days Trend analysis, capacity planning
Archive Summary (1h) 1+ years Compliance, long-term patterns

Query Layer: Making Data Accessible

The query layer sits between storage and users, providing the interface for exploration and alerting.

Query patterns:

1. Time-range queries - "Show me error rate for the last hour"

  • Most common pattern in observability
  • Optimized by time-indexed storage

2. Aggregations - "What's the P95 latency by endpoint?"

  • Group by dimensions (endpoint, region, customer)
  • Apply functions (percentile, average, max)

3. Correlations - "Find traces with high latency AND errors"

  • Join conditions across signal types
  • Critical for root cause analysis

4. Exemplars - "Show me example traces matching this metric spike"

  • Jump from aggregated view to raw examples
  • Bridges metrics and traces
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         QUERY ARCHITECTURE LAYERS           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  UI / Visualization (Grafana, Custom)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query API (PromQL, LogQL, TraceQL)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
          β–Ό             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Query    β”‚   β”‚  Cache   β”‚
    β”‚ Engine   │───│  Layer   β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Storage  β”‚
    β”‚ Backend  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query optimization techniques:

Materialized views - Pre-compute common queries

  • Example: Pre-aggregate metrics by service every minute
  • Trade-off: Storage space for query speed

Query result caching - Store recent query results

  • Helps with dashboard refreshes
  • Invalidate on new data or time window shift

Index strategies - Speed up lookups

  • Inverted indexes: Find all logs containing "error"
  • Bitmap indexes: Filter by high-cardinality labels
  • Time-based partitioning: Skip irrelevant time ranges

πŸ“– Architecture Examples

Example 1: Small-Scale Startup Architecture

Scenario: A startup with 5 microservices, 20 containers, 100K requests/day

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         STARTUP STACK                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  πŸ“± Applications (OpenTelemetry SDK)
           β”‚
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ OTel Collector  β”‚  (Single instance)
  β”‚  - Receives all β”‚
  β”‚    signals      β”‚
  β”‚  - Basic filter β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
     β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
     β–Ό           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Prometheusβ”‚ β”‚Grafana   β”‚
β”‚(Metrics) β”‚ β”‚ Cloud    β”‚
β”‚          β”‚ β”‚(Logs/    β”‚
β”‚+ Loki    β”‚ β”‚ Traces)  β”‚
β”‚(Logs)    β”‚ β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key decisions:

  • Single collector: Simplifies operations, sufficient for this scale
  • Prometheus + Loki: Open-source, runs on same infrastructure
  • Grafana Cloud for traces: Avoid running complex trace storage
  • Retention: 15 days local metrics/logs, 7 days traces

Cost: ~$200/month (mostly Grafana Cloud)

Trade-offs:

  • βœ… Simple to operate
  • βœ… Low cost
  • ⚠️ Single point of failure (collector)
  • ⚠️ Limited scale headroom

Example 2: Mid-Scale SaaS Architecture

Scenario: Growing SaaS with 50 services, 500 containers, 50M requests/day

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         MID-SCALE ARCHITECTURE                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  πŸ“± Apps (OTel SDK) + πŸ–₯️ Infra (Node agents)
              β”‚
              β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚ OTel Collectorsβ”‚  (DaemonSet on K8s)
      β”‚  - Per-node   β”‚
      β”‚  - Sampling   β”‚
      β”‚  - Enrichment β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚ Load Balancer β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚                β”‚
      β–Ό                β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚Gateway β”‚      β”‚Gateway  β”‚  (Redundant)
  β”‚Collectorβ”‚     β”‚Collectorβ”‚
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
       β”‚               β”‚
  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
  β”‚                          β”‚
  β–Ό                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  M3DB    β”‚          β”‚TempoStackβ”‚
β”‚(Metrics) β”‚          β”‚(Traces)  β”‚
β”‚          β”‚          β”‚          β”‚
β”‚+ ClickHseβ”‚          β”‚+ Grafana β”‚
β”‚(Logs)    β”‚          β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key decisions:

  • Two-tier collectors: Edge (DaemonSet) + Gateway (centralized)
  • M3DB for metrics: Better scalability than Prometheus
  • ClickHouse for logs: Fast analytical queries, cost-effective
  • Tempo for traces: Open-source, S3-backed
  • Sampling: 100% errors, 10% success cases, 1% of high-throughput endpoints

Data flow:

  1. Apps emit signals β†’ Local DaemonSet collector
  2. DaemonSet adds K8s metadata, samples
  3. Gateway collectors aggregate, route by signal type
  4. Storage backends optimized per signal

Cost: ~$3-5K/month (mostly compute and storage)

Trade-offs:

  • βœ… Redundant collection path
  • βœ… Scales to 10x current load
  • βœ… Cost-optimized storage choices
  • ⚠️ More operational complexity
  • ⚠️ Multiple systems to maintain

Example 3: Enterprise Multi-Tenant Architecture

Scenario: Platform serving 100+ internal teams, thousands of services, billions of requests/day

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      ENTERPRISE MULTI-TENANT ARCHITECTURE       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“± Apps (SDK) β†’ πŸ”Ή Regional Edge Collectors
                     β”‚
                     β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Kafka Clusterβ”‚  (Buffer + Routing)
              β”‚  - Topics by β”‚
              β”‚    tenant    β”‚
              β”‚  - Topics by β”‚
              β”‚    signal    β”‚
              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚            β”‚            β”‚
        β–Ό            β–Ό            β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Metrics  β”‚ β”‚   Logs   β”‚ β”‚  Traces  β”‚
  β”‚Processorsβ”‚ β”‚Processorsβ”‚ β”‚Processorsβ”‚
  β”‚(Stateful)β”‚ β”‚(Stateles)β”‚ β”‚(Stateful)β”‚
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
       β”‚            β”‚            β”‚
       β–Ό            β–Ό            β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Cortex   β”‚ β”‚OpenSearchβ”‚ β”‚  Jaeger  β”‚
  β”‚(Multi-   β”‚ β”‚(Multi-   β”‚ β”‚(Multi-   β”‚
  β”‚ tenant)  β”‚ β”‚ tenant)  β”‚ β”‚ tenant)  β”‚
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
       β”‚            β”‚            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ Unified Query β”‚
            β”‚     API       β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                β”‚
            β–Ό                β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚Team A  β”‚      β”‚Team B  β”‚
        β”‚Grafana β”‚      β”‚Grafana β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key decisions:

  • Kafka as backbone: Decouples collection from processing

    • Replay capability for reprocessing
    • Buffer during storage outages
    • Multi-consumer pattern (same data β†’ multiple destinations)
  • Multi-tenant storage: Each team isolated

    • Query limits prevent noisy neighbors
    • Cost attribution per tenant
    • Separate retention policies
  • Processing layer: Stream processors between Kafka and storage

    • Stateless: Simple transformations (parsing, filtering)
    • Stateful: Aggregations requiring memory (rate calculations, cardinality estimation)
  • Regional deployment: Data stays in region for compliance

    • Cross-region federation for global views
    • Local edge collectors reduce latency

Advanced features:

  • Adaptive sampling: ML-based decisions on what to keep
  • Anomaly detection: Statistical models flag unusual patterns
  • Automatic correlation: Link related signals across services
  • Cost controls: Per-tenant budgets with hard limits

Cost: $50K-200K/month (depends on scale)

Trade-offs:

  • βœ… Scales to massive volume
  • βœ… Strong tenant isolation
  • βœ… Flexible routing and processing
  • βœ… Survives component failures
  • ⚠️ High operational complexity
  • ⚠️ Requires dedicated platform team
  • ⚠️ Significant infrastructure cost

Example 4: Hybrid Cloud Architecture

Scenario: Regulated industry with on-prem core systems and cloud-native customer-facing services

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         HYBRID CLOUD ARCHITECTURE               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  ON-PREMISES                    CLOUD (AWS)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚              β”‚              β”‚              β”‚
  β”‚ πŸ›οΈ Core Apps β”‚              β”‚ ☁️ Services  β”‚
  β”‚              β”‚              β”‚              β”‚
  β”‚      β”‚       β”‚              β”‚      β”‚       β”‚
  β”‚      β–Ό       β”‚              β”‚      β–Ό       β”‚
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚              β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚ β”‚Collectorβ”‚  β”‚              β”‚ β”‚Collectorβ”‚  β”‚
  β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β”‚              β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β”‚
  β”‚     β”‚        β”‚              β”‚     β”‚        β”‚
  β”‚     β–Ό        β”‚              β”‚     β–Ό        β”‚
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   Secure     β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚ β”‚Local   β”‚   β”‚   Tunnel     β”‚ β”‚ Managedβ”‚   β”‚
  β”‚ β”‚Storage β”‚   β”‚   (VPN/      β”‚ β”‚Observ. β”‚   β”‚
  β”‚ β”‚        │◄──┼───Direct     β”‚ β”‚Platformβ”‚   β”‚
  β”‚ β”‚        β”‚   β”‚   Connect)   β”‚ β”‚        β”‚   β”‚
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚              β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜   β”‚
  β”‚              β”‚              β”‚      β”‚       β”‚
  β”‚     β–²        β”‚              β”‚      β”‚       β”‚
  β”‚     β”‚        β”‚              β”‚      β”‚       β”‚
  β”‚  Replicate   β”‚              β”‚      β–Ό       β”‚
  β”‚  (filtered)  β”‚              β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚              β”‚              β”‚ β”‚Federatedβ”‚  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚ β”‚ Query   β”‚  β”‚
                                β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                                β”‚              β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key decisions:

  • Data residency: Sensitive signals stay on-prem

    • Patient data, financial records remain local
    • Aggregated metrics (no PII) replicate to cloud
  • Federated queries: Unified view across environments

    • Query API spans both locations
    • Results merged before presentation
  • Selective replication: Filter before sending to cloud

    • Strip sensitive fields
    • Sample heavily to reduce egress costs
  • Dual storage: Same signal types in both locations

    • Full fidelity on-prem for compliance
    • Subset in cloud for correlation with cloud services

Challenges:

  • Network partitions between sites
  • Latency in federated queries
  • Data synchronization complexity
  • Egress costs (can be $$$)

Cost: Variable (depends on data volumes and egress)

⚠️ Common Mistakes

1. Over-Instrumenting Without Sampling Strategy

Mistake: Emitting every single event at full fidelity

Why it's wrong:

  • Storage costs explode (can easily hit 6-7 figures/year)
  • Query performance degrades with excessive data
  • Signal-to-noise ratio drops

Better approach:

  • Start with coarse sampling (1-10%)
  • Always capture errors and slow requests
  • Implement dynamic sampling based on traffic patterns
  • Use head-based sampling for known high-volume endpoints

πŸ’‘ Rule of thumb: If you're storing 100% of success cases for an endpoint handling millions of requests, you're probably over-collecting.

2. Single Point of Failure in Collection

Mistake: All signals flow through one collector instance

Why it's wrong:

  • Collector failure = complete observability blackout
  • Restart/upgrade requires downtime
  • No graceful degradation

Better approach:

  • Deploy collectors as redundant sets
  • Use client-side failover (try collector A, then B)
  • Buffer signals in message queue if collectors unavailable
  • Monitor the monitors (meta-observability)

3. Ignoring Cardinality Explosion

Mistake: Adding high-cardinality labels to metrics (user ID, request ID, email)

Why it's wrong:

  • Time-series databases create one series per unique label combination
  • Example: 100 metrics Γ— 1M users = 100M series
  • Queries slow to a crawl, storage exhausted

Better approach:

  • Keep metric labels low-cardinality (service, endpoint, status, region)
  • Put high-cardinality data in traces or logs instead
  • Use exemplars to link metrics β†’ high-cardinality traces
  • If you must have high-cardinality metrics, use specialized storage (M3DB, ClickHouse)
Cardinality Example Labels Appropriate?
Low (βœ…) service="api", endpoint="/users", status="200" Perfect for metrics
Medium (⚠️) + customer_tier="enterprise" (100s of values) Acceptable with care
High (❌) + user_id="abc123" (millions of values) Use traces/logs instead

4. No Retention Strategy

Mistake: Keeping all data at full resolution forever

Why it's wrong:

  • Storage costs grow linearly (or worse) over time
  • Old high-resolution data rarely queried
  • Queries scan unnecessary data

Better approach:

  • Implement tiered retention (see storage section)
  • Downsample old metrics (1s β†’ 1m β†’ 5m β†’ 1h)
  • Archive logs to cheap object storage (S3 Glacier)
  • Delete traces older than 30-90 days
  • Keep aggregated summaries for long-term trends

5. Mixing Operational and Business Metrics

Mistake: Using same observability system for both system health and business analytics

Why it's wrong:

  • Different query patterns (real-time alerts vs. batch analytics)
  • Different retention needs (days vs. years)
  • Business analytics often requires joins, complex aggregations
  • Observability systems optimized for time-series, not analytics

Better approach:

  • Observability platform: System health, performance, errors
  • Data warehouse: Business metrics, revenue, user behavior
  • Stream business events to both if needed
  • Use observability for "what's broken?" and analytics for "how's the business?"

6. Not Planning for Query Performance

Mistake: Assuming queries will always be fast

Why it's wrong:

  • Dashboard with 20 panels = 20+ queries on every refresh
  • Ad-hoc queries during incidents can overwhelm storage
  • No query limits = one bad query impacts everyone

Better approach:

  • Implement query timeouts and result limits
  • Pre-aggregate common queries (materialized views)
  • Cache dashboard results (1-5 minute TTL)
  • Use query sharding for large time ranges
  • Provide query cost estimates before execution

🎯 Key Takeaways

πŸ“‹ Quick Reference: Observability Architecture Essentials

Architecture Layer Key Decisions
πŸ” Collection Push vs. Pull, Agent deployment model, Instrumentation approach
πŸ”„ Pipeline Sampling strategy, Enrichment rules, Routing logic, Buffering capacity
πŸ’Ύ Storage Signal-specific vs. unified, Retention tiers, Hot/warm/cold strategy
πŸ”Ž Query Caching strategy, Index design, Query limits, Federation approach

Golden Rules:

1️⃣ Sample intelligently - Not everything needs 100% capture

2️⃣ Cardinality kills - High-cardinality labels destroy time-series databases

3️⃣ Tier your retention - Fresh data hot, old data cold, ancient data gone

4️⃣ Build redundancy - Single points of failure = blind during outages

5️⃣ Right tool, right job - Metrics for aggregates, logs for details, traces for flows

6️⃣ Cost visibility - Track spend per signal type, per team, per service

7️⃣ Query performance matters - Slow queries during incidents = worse outcomes

🧠 Architecture Decision Framework

When designing your observability architecture, ask:

Scale questions:

  • How many services/containers/requests?
  • What's the growth trajectory?
  • What's the event volume per second?

Reliability questions:

  • What's the cost of missing signals during an incident?
  • Can you tolerate collection delays?
  • What's your RTO for observability platform outages?

Cost questions:

  • What's the budget?
  • What's the cost per GB ingested/stored?
  • Can you optimize with sampling/aggregation?

Team questions:

  • How many engineers to operate the platform?
  • Build vs. buy vs. managed service?
  • What's the existing expertise?

Compliance questions:

  • Where must data reside?
  • How long must you retain it?
  • What PII/PHI concerns exist?

πŸ“š Further Study

Essential Resources:

  1. OpenTelemetry Documentation - https://opentelemetry.io/docs/concepts/observability-primer/ Industry-standard instrumentation framework, excellent architecture guides

  2. Observability Engineering (Book) - https://www.oreilly.com/library/view/observability-engineering/9781492076438/ Comprehensive guide to observability principles and practices by Charity Majors, Liz Fong-Jones, George Miranda

  3. Google SRE Book - Monitoring Chapter - https://sre.google/sre-book/monitoring-distributed-systems/ Foundational concepts from Google's SRE practices, especially the "Four Golden Signals"


Next Steps: Now that you understand observability architecture, explore specific implementation patternsβ€”distributed tracing setup, log aggregation strategies, and metric collection best practices. Practice by designing architectures for different scenarios: startup MVP, scaling growth phase, enterprise compliance needs.

Your observability architecture shapes how quickly your team can diagnose and resolve production issues. Invest time in getting it right early, but remain flexibleβ€”the best architecture evolves with your system's needs. πŸš€