Observability Architecture
Design vendor-neutral telemetry pipelines that support future migrations and tool evolution
Observability Architecture
Master observability architecture with free flashcards and spaced repetition practice. This lesson covers signal collection, data pipelines, storage strategies, and query patternsβessential concepts for building production-grade observability systems that help teams diagnose issues and understand system behavior at scale.
Welcome to Observability Architecture π
Observability has evolved from simple log files and basic monitoring to sophisticated distributed systems that ingest, process, and analyze billions of data points per second. The architecture you choose determines whether your team can quickly diagnose production incidents or drowns in data without insights.
Why architecture matters: A well-designed observability system provides fast answers during incidents, scales economically, and adapts as your infrastructure grows. Poor architecture leads to slow queries, storage costs that spiral out of control, and blind spots during critical outages.
In this lesson, you'll learn how the pieces fit togetherβfrom collecting signals at the edge to querying them during an incident. We'll explore real-world trade-offs, common pitfalls, and practical patterns that production teams rely on every day.
ποΈ Core Concepts: The Observability Stack
The Three Pillars (And Why That Model Is Evolving)
Traditionally, observability rested on three pillars:
π Metrics - Numerical measurements over time (CPU usage, request rate, error count) π Logs - Discrete event records with timestamps and context π Traces - Request flows showing execution paths across services
But modern architectures recognize these aren't separate pillarsβthey're different lenses for viewing the same underlying events. A single HTTP request generates metrics (duration, status code), logs (access log entry), and traces (span representing the request). The trend is toward unified observability where all signals share common infrastructure.
Signal Collection: The Edge Problem
Every observability architecture starts with signal collectionβcapturing data where it originates.
βββββββββββββββββββββββββββββββββββββββββββββββ
β SIGNAL COLLECTION LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββ
π± Application π₯οΈ Infrastructure
β β
βΌ βΌ
βββββββββββ βββββββββββ
β SDK/ β β Agents β
β Library β β (node, β
β (OTel) β β cAdvisorβ
ββββββ¬βββββ ββββββ¬βββββ
β β
βββββββββββ¬ββββββββββββ
βΌ
βββββββββββββββββ
β Collector β β Aggregation point
β (Agent/ β
β Gateway) β
βββββββββ¬ββββββββ
β
βΌ
[ Pipeline / Backend ]
Key architectural decisions:
1. Push vs. Pull
- Push: Applications send data to collectors (common for logs, traces)
- Pull: Collectors scrape endpoints (Prometheus model for metrics)
- Hybrid: Many systems use both
2. Agent Deployment
- Sidecar: Agent runs alongside each application container
- DaemonSet: One agent per node serves all containers
- Library-based: Instrumentation embedded directly in application
π‘ Tip: Start with library-based instrumentation for flexibility, add agents for infrastructure signals that applications can't see.
Data Pipelines: From Signal to Storage
Once collected, signals flow through pipelines that transform, route, and enrich them.
| Pipeline Stage | Purpose | Examples |
|---|---|---|
| Ingestion | Receive and parse incoming data | Protocol handlers (HTTP, gRPC), format parsers (JSON, Protobuf) |
| Processing | Transform and enrich | Add metadata, sample, aggregate, filter |
| Routing | Direct to appropriate storage | Send errors to alerting, metrics to TSDB, traces to trace store |
| Buffering | Handle backpressure | Queue for retry, shed load during spikes |
Processing patterns:
Sampling - Reduce volume by keeping representative subset
- Head-based: Decide at collection time (keep 10% randomly)
- Tail-based: Decide after seeing full trace (keep all errors, sample success)
Aggregation - Pre-compute summaries to reduce storage
- Temporal: Roll up minute-level data into hourly averages
- Spatial: Combine metrics across replica instances
Enrichment - Add context for easier querying
- Attach environment labels (prod/staging)
- Add Kubernetes metadata (pod name, namespace)
- Correlate with deployment events
Storage: The Cost-Performance Trade-off
Storage architecture determines query speed, retention capabilities, andβcriticallyβcost.
ββββββββββββββββββββββββββββββββββββββββββββββββ
β STORAGE ARCHITECTURE SPECTRUM β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Hot Storage Warm Cold
(Fast queries) (Balanced) (Archival)
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β SSD β β Object β β Glacier β
β RAM β β Storage β β Tape β
β Minutes β β Days β β Months β
β to Hoursβ β to Weeksβ β to Yearsβ
βββββββββββ βββββββββββ βββββββββββ
$$$ $$ $
Milliseconds Seconds Minutes
Storage types by signal:
Time-Series Databases (TSDB) - Optimized for metrics
- Examples: Prometheus, InfluxDB, TimescaleDB, M3DB
- Compression algorithms exploit time-series patterns
- Downsampling: Keep high-resolution recent data, lower resolution historical
Log Aggregation Systems - Optimized for text search
- Examples: Elasticsearch, Loki, ClickHouse
- Inverted indexes for full-text search
- Columnar storage for analytical queries
Trace Stores - Optimized for graph queries
- Examples: Jaeger, Tempo, Zipkin
- Index by trace ID for fast lookup
- Store spans with parent-child relationships
Unified Storage - Single backend for all signals
- Examples: OpenSearch, ClickHouse, Apache Druid
- Reduces operational complexity
- May sacrifice specialized optimizations
β οΈ Storage retention strategy:
| Tier | Resolution | Retention | Use Case |
|---|---|---|---|
| Real-time | Raw (1s) | 6-24 hours | Active incident investigation |
| Recent | Downsampled (1m) | 7-30 days | Recent history, debugging |
| Historical | Aggregated (5m) | 90-365 days | Trend analysis, capacity planning |
| Archive | Summary (1h) | 1+ years | Compliance, long-term patterns |
Query Layer: Making Data Accessible
The query layer sits between storage and users, providing the interface for exploration and alerting.
Query patterns:
1. Time-range queries - "Show me error rate for the last hour"
- Most common pattern in observability
- Optimized by time-indexed storage
2. Aggregations - "What's the P95 latency by endpoint?"
- Group by dimensions (endpoint, region, customer)
- Apply functions (percentile, average, max)
3. Correlations - "Find traces with high latency AND errors"
- Join conditions across signal types
- Critical for root cause analysis
4. Exemplars - "Show me example traces matching this metric spike"
- Jump from aggregated view to raw examples
- Bridges metrics and traces
βββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY ARCHITECTURE LAYERS β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β UI / Visualization (Grafana, Custom) β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Query API (PromQL, LogQL, TraceQL) β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
ββββββββ΄βββββββ
βΌ βΌ
ββββββββββββ ββββββββββββ
β Query β β Cache β
β Engine βββββ Layer β
ββββββ¬ββββββ ββββββββββββ
β
βΌ
ββββββββββββ
β Storage β
β Backend β
ββββββββββββ
Query optimization techniques:
Materialized views - Pre-compute common queries
- Example: Pre-aggregate metrics by service every minute
- Trade-off: Storage space for query speed
Query result caching - Store recent query results
- Helps with dashboard refreshes
- Invalidate on new data or time window shift
Index strategies - Speed up lookups
- Inverted indexes: Find all logs containing "error"
- Bitmap indexes: Filter by high-cardinality labels
- Time-based partitioning: Skip irrelevant time ranges
π Architecture Examples
Example 1: Small-Scale Startup Architecture
Scenario: A startup with 5 microservices, 20 containers, 100K requests/day
Architecture:
βββββββββββββββββββββββββββββββββββββββββββββ
β STARTUP STACK β
βββββββββββββββββββββββββββββββββββββββββββββ
π± Applications (OpenTelemetry SDK)
β
βΌ
βββββββββββββββββββ
β OTel Collector β (Single instance)
β - Receives all β
β signals β
β - Basic filter β
ββββββββββ¬βββββββββ
β
βββββββ΄ββββββ
βΌ βΌ
βββββββββββ ββββββββββββ
βPrometheusβ βGrafana β
β(Metrics) β β Cloud β
β β β(Logs/ β
β+ Loki β β Traces) β
β(Logs) β β β
βββββββββββ ββββββββββββ
Key decisions:
- Single collector: Simplifies operations, sufficient for this scale
- Prometheus + Loki: Open-source, runs on same infrastructure
- Grafana Cloud for traces: Avoid running complex trace storage
- Retention: 15 days local metrics/logs, 7 days traces
Cost: ~$200/month (mostly Grafana Cloud)
Trade-offs:
- β Simple to operate
- β Low cost
- β οΈ Single point of failure (collector)
- β οΈ Limited scale headroom
Example 2: Mid-Scale SaaS Architecture
Scenario: Growing SaaS with 50 services, 500 containers, 50M requests/day
Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β MID-SCALE ARCHITECTURE β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
π± Apps (OTel SDK) + π₯οΈ Infra (Node agents)
β
βΌ
βββββββββββββββββ
β OTel Collectorsβ (DaemonSet on K8s)
β - Per-node β
β - Sampling β
β - Enrichment β
βββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββ
β Load Balancer β
βββββββββ¬ββββββββ
β
βββββββββ΄βββββββββ
β β
βΌ βΌ
ββββββββββ βββββββββββ
βGateway β βGateway β (Redundant)
βCollectorβ βCollectorβ
ββββββ¬ββββ ββββββ¬βββββ
β β
ββββββ΄ββββββββββββββββ΄βββββ
β β
βΌ βΌ
ββββββββββββ ββββββββββββ
β M3DB β βTempoStackβ
β(Metrics) β β(Traces) β
β β β β
β+ ClickHseβ β+ Grafana β
β(Logs) β β β
ββββββββββββ ββββββββββββ
Key decisions:
- Two-tier collectors: Edge (DaemonSet) + Gateway (centralized)
- M3DB for metrics: Better scalability than Prometheus
- ClickHouse for logs: Fast analytical queries, cost-effective
- Tempo for traces: Open-source, S3-backed
- Sampling: 100% errors, 10% success cases, 1% of high-throughput endpoints
Data flow:
- Apps emit signals β Local DaemonSet collector
- DaemonSet adds K8s metadata, samples
- Gateway collectors aggregate, route by signal type
- Storage backends optimized per signal
Cost: ~$3-5K/month (mostly compute and storage)
Trade-offs:
- β Redundant collection path
- β Scales to 10x current load
- β Cost-optimized storage choices
- β οΈ More operational complexity
- β οΈ Multiple systems to maintain
Example 3: Enterprise Multi-Tenant Architecture
Scenario: Platform serving 100+ internal teams, thousands of services, billions of requests/day
Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENTERPRISE MULTI-TENANT ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
π± Apps (SDK) β πΉ Regional Edge Collectors
β
βΌ
ββββββββββββββββ
β Kafka Clusterβ (Buffer + Routing)
β - Topics by β
β tenant β
β - Topics by β
β signal β
ββββββββ¬ββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Metrics β β Logs β β Traces β
βProcessorsβ βProcessorsβ βProcessorsβ
β(Stateful)β β(Stateles)β β(Stateful)β
ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Cortex β βOpenSearchβ β Jaeger β
β(Multi- β β(Multi- β β(Multi- β
β tenant) β β tenant) β β tenant) β
ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ
β β β
ββββββββββββββΌβββββββββββββ
βΌ
βββββββββββββββββ
β Unified Query β
β API β
βββββββββ¬ββββββββ
β
βββββββββ΄βββββββββ
β β
βΌ βΌ
ββββββββββ ββββββββββ
βTeam A β βTeam B β
βGrafana β βGrafana β
ββββββββββ ββββββββββ
Key decisions:
Kafka as backbone: Decouples collection from processing
- Replay capability for reprocessing
- Buffer during storage outages
- Multi-consumer pattern (same data β multiple destinations)
Multi-tenant storage: Each team isolated
- Query limits prevent noisy neighbors
- Cost attribution per tenant
- Separate retention policies
Processing layer: Stream processors between Kafka and storage
- Stateless: Simple transformations (parsing, filtering)
- Stateful: Aggregations requiring memory (rate calculations, cardinality estimation)
Regional deployment: Data stays in region for compliance
- Cross-region federation for global views
- Local edge collectors reduce latency
Advanced features:
- Adaptive sampling: ML-based decisions on what to keep
- Anomaly detection: Statistical models flag unusual patterns
- Automatic correlation: Link related signals across services
- Cost controls: Per-tenant budgets with hard limits
Cost: $50K-200K/month (depends on scale)
Trade-offs:
- β Scales to massive volume
- β Strong tenant isolation
- β Flexible routing and processing
- β Survives component failures
- β οΈ High operational complexity
- β οΈ Requires dedicated platform team
- β οΈ Significant infrastructure cost
Example 4: Hybrid Cloud Architecture
Scenario: Regulated industry with on-prem core systems and cloud-native customer-facing services
Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β HYBRID CLOUD ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
ON-PREMISES CLOUD (AWS)
ββββββββββββββββ ββββββββββββββββ
β β β β
β ποΈ Core Apps β β βοΈ Services β
β β β β
β β β β β β
β βΌ β β βΌ β
β ββββββββββ β β ββββββββββ β
β βCollectorβ β β βCollectorβ β
β βββββ¬βββββ β β βββββ¬βββββ β
β β β β β β
β βΌ β β βΌ β
β ββββββββββ β Secure β ββββββββββ β
β βLocal β β Tunnel β β Managedβ β
β βStorage β β (VPN/ β βObserv. β β
β β βββββΌβββDirect β βPlatformβ β
β β β β Connect) β β β β
β ββββββββββ β β ββββββ¬ββββ β
β β β β β
β β² β β β β
β β β β β β
β Replicate β β βΌ β
β (filtered) β β βββββββββββ β
β β β βFederatedβ β
ββββββββββββββββ β β Query β β
β βββββββββββ β
β β
ββββββββββββββββ
Key decisions:
Data residency: Sensitive signals stay on-prem
- Patient data, financial records remain local
- Aggregated metrics (no PII) replicate to cloud
Federated queries: Unified view across environments
- Query API spans both locations
- Results merged before presentation
Selective replication: Filter before sending to cloud
- Strip sensitive fields
- Sample heavily to reduce egress costs
Dual storage: Same signal types in both locations
- Full fidelity on-prem for compliance
- Subset in cloud for correlation with cloud services
Challenges:
- Network partitions between sites
- Latency in federated queries
- Data synchronization complexity
- Egress costs (can be $$$)
Cost: Variable (depends on data volumes and egress)
β οΈ Common Mistakes
1. Over-Instrumenting Without Sampling Strategy
Mistake: Emitting every single event at full fidelity
Why it's wrong:
- Storage costs explode (can easily hit 6-7 figures/year)
- Query performance degrades with excessive data
- Signal-to-noise ratio drops
Better approach:
- Start with coarse sampling (1-10%)
- Always capture errors and slow requests
- Implement dynamic sampling based on traffic patterns
- Use head-based sampling for known high-volume endpoints
π‘ Rule of thumb: If you're storing 100% of success cases for an endpoint handling millions of requests, you're probably over-collecting.
2. Single Point of Failure in Collection
Mistake: All signals flow through one collector instance
Why it's wrong:
- Collector failure = complete observability blackout
- Restart/upgrade requires downtime
- No graceful degradation
Better approach:
- Deploy collectors as redundant sets
- Use client-side failover (try collector A, then B)
- Buffer signals in message queue if collectors unavailable
- Monitor the monitors (meta-observability)
3. Ignoring Cardinality Explosion
Mistake: Adding high-cardinality labels to metrics (user ID, request ID, email)
Why it's wrong:
- Time-series databases create one series per unique label combination
- Example: 100 metrics Γ 1M users = 100M series
- Queries slow to a crawl, storage exhausted
Better approach:
- Keep metric labels low-cardinality (service, endpoint, status, region)
- Put high-cardinality data in traces or logs instead
- Use exemplars to link metrics β high-cardinality traces
- If you must have high-cardinality metrics, use specialized storage (M3DB, ClickHouse)
| Cardinality | Example Labels | Appropriate? |
|---|---|---|
| Low (β ) | service="api", endpoint="/users", status="200" | Perfect for metrics |
| Medium (β οΈ) | + customer_tier="enterprise" (100s of values) | Acceptable with care |
| High (β) | + user_id="abc123" (millions of values) | Use traces/logs instead |
4. No Retention Strategy
Mistake: Keeping all data at full resolution forever
Why it's wrong:
- Storage costs grow linearly (or worse) over time
- Old high-resolution data rarely queried
- Queries scan unnecessary data
Better approach:
- Implement tiered retention (see storage section)
- Downsample old metrics (1s β 1m β 5m β 1h)
- Archive logs to cheap object storage (S3 Glacier)
- Delete traces older than 30-90 days
- Keep aggregated summaries for long-term trends
5. Mixing Operational and Business Metrics
Mistake: Using same observability system for both system health and business analytics
Why it's wrong:
- Different query patterns (real-time alerts vs. batch analytics)
- Different retention needs (days vs. years)
- Business analytics often requires joins, complex aggregations
- Observability systems optimized for time-series, not analytics
Better approach:
- Observability platform: System health, performance, errors
- Data warehouse: Business metrics, revenue, user behavior
- Stream business events to both if needed
- Use observability for "what's broken?" and analytics for "how's the business?"
6. Not Planning for Query Performance
Mistake: Assuming queries will always be fast
Why it's wrong:
- Dashboard with 20 panels = 20+ queries on every refresh
- Ad-hoc queries during incidents can overwhelm storage
- No query limits = one bad query impacts everyone
Better approach:
- Implement query timeouts and result limits
- Pre-aggregate common queries (materialized views)
- Cache dashboard results (1-5 minute TTL)
- Use query sharding for large time ranges
- Provide query cost estimates before execution
π― Key Takeaways
π Quick Reference: Observability Architecture Essentials
| Architecture Layer | Key Decisions |
| π Collection | Push vs. Pull, Agent deployment model, Instrumentation approach |
| π Pipeline | Sampling strategy, Enrichment rules, Routing logic, Buffering capacity |
| πΎ Storage | Signal-specific vs. unified, Retention tiers, Hot/warm/cold strategy |
| π Query | Caching strategy, Index design, Query limits, Federation approach |
Golden Rules:
1οΈβ£ Sample intelligently - Not everything needs 100% capture
2οΈβ£ Cardinality kills - High-cardinality labels destroy time-series databases
3οΈβ£ Tier your retention - Fresh data hot, old data cold, ancient data gone
4οΈβ£ Build redundancy - Single points of failure = blind during outages
5οΈβ£ Right tool, right job - Metrics for aggregates, logs for details, traces for flows
6οΈβ£ Cost visibility - Track spend per signal type, per team, per service
7οΈβ£ Query performance matters - Slow queries during incidents = worse outcomes
π§ Architecture Decision Framework
When designing your observability architecture, ask:
Scale questions:
- How many services/containers/requests?
- What's the growth trajectory?
- What's the event volume per second?
Reliability questions:
- What's the cost of missing signals during an incident?
- Can you tolerate collection delays?
- What's your RTO for observability platform outages?
Cost questions:
- What's the budget?
- What's the cost per GB ingested/stored?
- Can you optimize with sampling/aggregation?
Team questions:
- How many engineers to operate the platform?
- Build vs. buy vs. managed service?
- What's the existing expertise?
Compliance questions:
- Where must data reside?
- How long must you retain it?
- What PII/PHI concerns exist?
π Further Study
Essential Resources:
OpenTelemetry Documentation - https://opentelemetry.io/docs/concepts/observability-primer/ Industry-standard instrumentation framework, excellent architecture guides
Observability Engineering (Book) - https://www.oreilly.com/library/view/observability-engineering/9781492076438/ Comprehensive guide to observability principles and practices by Charity Majors, Liz Fong-Jones, George Miranda
Google SRE Book - Monitoring Chapter - https://sre.google/sre-book/monitoring-distributed-systems/ Foundational concepts from Google's SRE practices, especially the "Four Golden Signals"
Next Steps: Now that you understand observability architecture, explore specific implementation patternsβdistributed tracing setup, log aggregation strategies, and metric collection best practices. Practice by designing architectures for different scenarios: startup MVP, scaling growth phase, enterprise compliance needs.
Your observability architecture shapes how quickly your team can diagnose and resolve production issues. Invest time in getting it right early, but remain flexibleβthe best architecture evolves with your system's needs. π