A collector pipeline processes data in this order: receivers → {{1}} → {{2}} → {{3}} → exporters, where {{1}} protects memory, {{2}} transforms data, and {{3}} groups data.

["memory_limiter","processors","batch"]

Collectors and Pipelines

Design telemetry collection architecture with receivers, processors, and exporters

OpenTelemetry Collectors and Pipelines

Master OpenTelemetry collectors and pipelines with free flashcards and spaced repetition practice. This lesson covers collector architecture, pipeline components, and processing patterns—essential concepts for building production-grade observability systems in modern distributed environments.

Welcome

💻 Understanding OpenTelemetry Collectors and their pipeline architecture is fundamental to implementing effective observability. While instrumentation generates telemetry signals (traces, metrics, logs), collectors act as the intelligent middleware that receives, processes, and exports this data to your backend systems. Think of collectors as the postal service of observability—they handle routing, transformation, batching, and delivery of your telemetry data.

This lesson demystifies how collectors work internally, why pipelines matter, and how to configure them for real-world production scenarios. Whether you're running a microservices architecture or monitoring a monolithic application, collectors provide the flexibility and scalability you need.

Core Concepts

What is an OpenTelemetry Collector?

🔧 The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It's a standalone binary that runs as a sidecar, daemon, or centralized service in your infrastructure.

Key characteristics:

Vendor-neutral: Works with any backend (Prometheus, Jaeger, Datadog, New Relic, etc.)
Language-agnostic: Accepts data from applications in any programming language
Configurable: Uses YAML configuration for complete pipeline customization
Extensible: Supports custom receivers, processors, and exporters via plugins
High-performance: Written in Go, handles millions of spans per second

💡 Pro tip: Start with the OpenTelemetry Collector Contrib distribution, which includes 100+ components. The core distribution contains only basic components.

Deployment Patterns

Collectors can be deployed in three primary patterns:

Pattern	Description	Use Case	Trade-offs
Agent Mode	Deployed on each host/node (DaemonSet in Kubernetes)	Local collection from apps on the same host	✅ Low latency ❌ Higher resource usage per node
Gateway Mode	Centralized collector receiving from multiple agents	Data aggregation, enrichment, sampling	✅ Centralized processing ❌ Single point of failure (mitigate with load balancing)
Sidecar Mode	Container deployed alongside application container	Per-service isolation, custom processing	✅ Service-specific config ❌ Highest resource overhead

🌍 Real-world analogy: Agent mode is like neighborhood post offices, gateway mode is the regional distribution center, and sidecar mode is a personal assistant handling your mail.

DEPLOYMENT ARCHITECTURE

┌─────────────────────────────────────────────────┐
│                 APPLICATION LAYER               │
│  ┌────────┐  ┌────────┐  ┌────────┐           │
│  │ App A  │  │ App B  │  │ App C  │           │
│  └───┬────┘  └───┬────┘  └───┬────┘           │
│      │ OTLP      │ OTLP      │ OTLP           │
└──────┼───────────┼───────────┼─────────────────┘
       ↓           ↓           ↓
┌─────────────────────────────────────────────────┐
│              COLLECTOR AGENT LAYER              │
│  ┌────────┐  ┌────────┐  ┌────────┐           │
│  │Agent 1 │  │Agent 2 │  │Agent 3 │           │
│  └───┬────┘  └───┬────┘  └───┬────┘           │
└──────┼───────────┼───────────┼─────────────────┘
       │           │           │
       └───────────┴───────────┘
                   ↓
       ┌───────────────────────┐
       │  COLLECTOR GATEWAY    │
       │  (Load Balanced)      │
       └───────────┬───────────┘
                   ↓
       ┌───────────────────────┐
       │   BACKEND SYSTEMS     │
       │ Jaeger | Prometheus   │
       │ Loki   | Datadog      │
       └───────────────────────┘

Pipeline Architecture

🔺 A pipeline is the core abstraction in collectors. Every collector configuration defines three types of pipelines:

Traces pipeline: Processes distributed trace spans
Metrics pipeline: Handles time-series measurements
Logs pipeline: Manages structured log records

Each pipeline consists of three component types:

PIPELINE FLOW

┌────────────┐     ┌────────────┐     ┌────────────┐
│ RECEIVERS  │ ──→ │ PROCESSORS │ ──→ │ EXPORTERS  │
└────────────┘     └────────────┘     └────────────┘
     │                    │                  │
     │                    │                  │
   Input              Transform            Output
 (OTLP, Jaeger)   (batch, filter,     (Jaeger, Prom,
  Prometheus)      sample, enrich)     OTLP, files)

1. Receivers

Receivers are the entry points for telemetry data. They listen on specific protocols and ports.

Common receivers:

otlp: Native OpenTelemetry protocol (gRPC or HTTP)
jaeger: Jaeger Thrift format
zipkin: Zipkin JSON v1/v2
prometheus: Scrapes Prometheus metrics
hostmetrics: Collects system metrics (CPU, memory, disk)
filelog: Reads logs from files

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-app'
          scrape_interval: 30s
          static_configs:
            - targets: ['localhost:8080']

💡 Important: Receivers are push-based (OTLP, Jaeger) or pull-based (Prometheus scraping). Choose based on your application's export method.

2. Processors

Processors transform, filter, or enrich data as it flows through the pipeline. They run in sequence.

Essential processors:

Processor	Purpose	Example Use Case
`batch`	Groups telemetry before export	Reduce network calls (export every 10s or 8192 spans)
`memory_limiter`	Prevents OOM by applying backpressure	Limit collector to 512MB memory usage
`resource`	Adds/modifies resource attributes	Add environment=production, cluster=us-west
`attributes`	Manipulates span/metric attributes	Remove PII, add derived fields
`filter`	Drops unwanted telemetry	Exclude health check spans
`probabilistic_sampler`	Samples percentage of traces	Keep only 10% of traces to reduce volume
`tail_sampling`	Smart sampling based on span data	Keep all error traces, sample successful ones

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-west-2
        action: insert
  
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'

⚠️ Critical ordering: Place memory_limiter first, batch last:

processors: [memory_limiter, filter, resource, batch]

This ensures memory protection happens before processing, and batching happens right before export.

3. Exporters

Exporters send processed telemetry to backend systems. Multiple exporters can run in parallel.

Popular exporters:

otlp: Send to any OTLP-compatible backend
otlphttp: OTLP over HTTP (better for proxies/firewalls)
jaeger: Export to Jaeger backend
prometheus: Expose metrics endpoint for Prometheus to scrape
prometheusremotewrite: Push metrics to Prometheus remote write endpoint
logging: Debug exporter (prints to console)
file: Write to local files (useful for replay/debugging)

exporters:
  otlp:
    endpoint: jaeger:4317
    tls:
      insecure: false
      cert_file: /certs/client.crt
      key_file: /certs/client.key
  
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "my_app"
  
  logging:
    loglevel: debug
    sampling_initial: 5
    sampling_thereafter: 200

Complete Pipeline Configuration

Here's how receivers, processors, and exporters connect into pipelines:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    limit_mib: 512
  batch:
    timeout: 10s
    send_batch_size: 1024
  resource:
    attributes:
      - key: service.namespace
        value: production
        action: insert

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp]
    
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

🔧 Key insight: The service section wires everything together. Components are referenced by name from their respective sections.

CONFIGURATION STRUCTURE

┌─────────────────────────────────────────┐
│         receivers: {...}                │  Define components
│         processors: {...}               │  (implementation)
│         exporters: {...}                │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│  service:                               │  Wire components
│    pipelines:                           │  into pipelines
│      traces:                            │  (configuration)
│        receivers: [otlp, jaeger]        │
│        processors: [batch, resource]    │
│        exporters: [otlp]                │
└─────────────────────────────────────────┘

Examples

Example 1: Simple Local Development Setup

Scenario: You're developing a microservice locally and want to send traces to Jaeger running in Docker.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317
      http:
        endpoint: localhost:4318

processors:
  batch:
    timeout: 1s

exporters:
  logging:
    loglevel: debug
  otlp:
    endpoint: localhost:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, otlp]

Why this works:

Accepts OTLP on both gRPC (4317) and HTTP (4318) for flexibility
Minimal batching (1s) for quick feedback during development
logging exporter prints spans to console for immediate debugging
Sends to local Jaeger on 4317
No memory limiter needed (low volume)

💡 Dev tip: Keep the logging exporter during development—it's invaluable for debugging instrumentation issues.

Example 2: Production Gateway with Sampling

Scenario: High-traffic production system generating millions of spans. You need cost-effective sampling while keeping all error traces.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
  
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Keep all error traces
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Keep all slow traces (>2s)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      
      # Sample 5% of successful traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
  
  batch:
    timeout: 10s
    send_batch_size: 8192
  
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: insert

exporters:
  otlp:
    endpoint: tempo-gateway:4317
    tls:
      insecure: false
      cert_file: /etc/collector/certs/client.crt
      key_file: /etc/collector/certs/client.key

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, resource, batch]
      exporters: [otlp]
  
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Key decisions explained:

Tail sampling (not probabilistic): Makes decisions after seeing the entire trace
- Keeps 100% of errors and slow requests
- Samples only 5% of fast successful requests
- Waits 10s to collect all spans of a trace before deciding
Memory protection: 2GB limit prevents OOM during traffic spikes
Large batches: 8192 spans per batch reduces network overhead at high volume
TLS: Production requires encrypted communication
Collector telemetry: Exposes metrics on :8888 for monitoring the collector itself

⚠️ Production warning: Tail sampling requires significant memory (stores traces while waiting). Size num_traces and decision_wait based on your trace rate.

Example 3: Multi-Backend Fan-Out

Scenario: You want to send traces to both Jaeger (for developers) and a commercial APM vendor (for operations), while keeping metrics in Prometheus.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  memory_limiter:
    limit_mib: 1024
  
  batch:
    timeout: 10s
  
  # Filter PII before sending to commercial vendor
  attributes/strip-pii:
    actions:
      - key: user.email
        action: delete
      - key: user.phone
        action: delete
      - key: credit_card
        action: delete
  
  resource:
    attributes:
      - key: environment
        value: staging
        action: insert

exporters:
  # Internal Jaeger (full data)
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # Commercial vendor (PII stripped)
  otlp/vendor:
    endpoint: vendor-endpoint:443
    headers:
      api-key: ${VENDOR_API_KEY}
  
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    tls:
      insecure: true

service:
  pipelines:
    # Traces to Jaeger (internal, full data)
    traces/internal:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/jaeger]
    
    # Traces to vendor (PII stripped)
    traces/vendor:
      receivers: [otlp]
      processors: [memory_limiter, attributes/strip-pii, resource, batch]
      exporters: [otlp/vendor]
    
    # Metrics to Prometheus
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Architecture insights:

Multiple pipelines: Same signal type (traces) can have multiple pipelines with different processing
Named exporters: Suffix like /jaeger and /vendor creates unique exporter instances
Differential processing: PII stripping only in vendor pipeline
Environment variables: Use ${VENDOR_API_KEY} for secrets (pass via env vars, not hardcode)
Host metrics: Collector monitors itself and the host it runs on

🌍 Real-world use: This pattern is common in enterprises with compliance requirements—keep full data internal, sanitize for external vendors.

Example 4: Kubernetes DaemonSet with Service Discovery

Scenario: Collector agents running on every Kubernetes node, automatically discovering and scraping Prometheus metrics from pods.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names: ['production', 'staging']
          relabel_configs:
            # Only scrape pods with prometheus.io/scrape=true annotation
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true
            # Use port from prometheus.io/port annotation
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
            # Add pod labels as metric labels
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)

processors:
  memory_limiter:
    limit_mib: 512
  
  batch:
    timeout: 10s
  
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name
        - k8s.node.name
      labels:
        - tag_name: app.name
          key: app
          from: pod

exporters:
  otlp:
    endpoint: collector-gateway.observability:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]
    
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]

Kubernetes-specific features:

Service discovery: Automatically finds pods with prometheus.io/scrape: "true" annotation
k8sattributes processor: Enriches telemetry with Kubernetes metadata (namespace, pod name, labels)
RBAC: Requires ServiceAccount with permissions to list/watch pods and nodes
DaemonSet deployment: One collector per node ensures local collection
Gateway forwarding: Agents send to central gateway for aggregation

💡 Deployment tip: Use Helm charts or Kubernetes Operator for production deployment—they handle RBAC, ConfigMaps, and upgrades automatically.

Common Mistakes

❌ Mistake 1: Not Using memory_limiter

Problem: Collector crashes with OOM under traffic spikes.

Why it happens: Without backpressure, collector accepts unlimited data, overwhelming memory.

Solution: Always configure memory_limiter as the first processor:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512  # 80% of container limit
    spike_limit_mib: 128  # 20% buffer for spikes

⚠️ Set limit_mib to 80% of your container's memory limit, leaving headroom for Go's garbage collector.

❌ Mistake 2: Batching Too Aggressively

Problem: High latency or lost data during collector restarts.

Symptoms:

Traces appear 30+ seconds after generation
Collector restart loses thousands of spans

Why it happens: Oversized batches (e.g., timeout: 60s, send_batch_size: 100000) hold data too long.

Solution: Use reasonable batch settings:

processors:
  batch:
    timeout: 10s  # Export at least every 10s
    send_batch_size: 8192  # Or when 8192 items collected
    send_batch_max_size: 10000  # Hard limit

💡 Rule of thumb: timeout should be 5-10 seconds for production, 1-2 seconds for development.

❌ Mistake 3: Wrong Processor Order

Problem: Processors don't work as expected, or memory protection fails.

Bad order:

processors: [batch, filter, memory_limiter]  # WRONG!

Why it's wrong:

Batching happens before filtering (wastes memory on unwanted data)
Memory limiter runs last (data already consumed memory)

Correct order:

processors: [memory_limiter, filter, resource, attributes, batch]  # RIGHT!

Best practice ordering:

memory_limiter (protect first)
Filters (remove unwanted data early)
Enrichment processors (resource, attributes, k8sattributes)
Sampling (tail_sampling, probabilistic_sampler)
batch (batch last before export)

❌ Mistake 4: Ignoring Collector Self-Monitoring

Problem: Collector silently drops data, no visibility into why.

Solution: Enable collector telemetry and monitor these metrics:

service:
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Key metrics to alert on:

otelcol_receiver_refused_spans: Backpressure from memory limiter
otelcol_exporter_send_failed_spans: Export failures
otelcol_processor_dropped_spans: Sampling or filtering drops
otelcol_process_memory_rss: Collector memory usage

❌ Mistake 5: Using Tail Sampling Without Understanding Cost

Problem: Collector uses 10GB+ memory, crashes randomly.

Why it happens: Tail sampling stores entire traces in memory while waiting for decision_wait period.

Cost calculation:

Memory needed = avg_trace_size × traces_per_second × decision_wait
Example: 50KB × 1000 tps × 10s = 500MB minimum

Solution: Either:

Use head-based sampling (probabilistic_sampler) if you don't need smart decisions

Size tail sampling carefully:

tail_sampling:
  decision_wait: 5s  # Shorter wait
  num_traces: 50000  # Fewer buffered traces

Deploy tail sampling only in gateway collectors (not agents)

❌ Mistake 6: Hardcoding Secrets

Problem: API keys visible in configuration files, committed to Git.

Bad:

exporters:
  otlp:
    headers:
      api-key: "sk_live_abc123..."  # NEVER DO THIS!

Good:

exporters:
  otlp:
    headers:
      api-key: ${VENDOR_API_KEY}  # Read from environment

Then pass via environment variable:

export VENDOR_API_KEY="sk_live_abc123..."
./otelcol --config=config.yaml

🔒 Security tip: Use Kubernetes Secrets, AWS Secrets Manager, or HashiCorp Vault for production secrets.

Key Takeaways

📋 Quick Reference Card: Collectors and Pipelines

Concept	Key Points
Collector Role	Vendor-agnostic proxy: receives, processes, exports telemetry
Deployment Patterns	Agent (per-host), Gateway (centralized), Sidecar (per-service)
Pipeline Types	Traces, Metrics, Logs—each has receivers → processors → exporters
Receivers	Entry points: otlp, jaeger, prometheus, hostmetrics, filelog
Processors	Transform data: batch, memory_limiter, filter, resource, tail_sampling
Exporters	Send to backends: otlp, prometheus, jaeger, logging, file
Processor Order	memory_limiter → filters → enrichment → sampling → batch
Production Essentials	1. memory_limiter (prevent OOM) 2. Batch reasonably (10s timeout) 3. Enable telemetry (:8888) 4. Use env vars for secrets
Sampling Strategies	Head-based (probabilistic) = simple, low memory Tail-based = smart (keep errors), high memory
Multi-Backend	Use multiple pipelines with different processors per destination

🧠 Mental Model: Think of collectors as intelligent routers with a three-stage pipeline:

Receive (accept from multiple protocols)
Process (filter, enrich, sample, batch)
Export (deliver to one or more backends)

💡 Remember: Start simple (receiver → batch → exporter), then add processors as needs emerge. Premature optimization leads to complex, brittle configurations.

📚 Further Study

OpenTelemetry Collector Official Docs: https://opentelemetry.io/docs/collector/ - Comprehensive reference for all components and configuration options
OpenTelemetry Collector Contrib Repository: https://github.com/open-telemetry/opentelemetry-collector-contrib - Source code and documentation for 100+ community-contributed receivers, processors, and exporters
Collector Performance Tuning Guide: https://opentelemetry.io/docs/collector/performance/ - Best practices for scaling collectors to millions of spans per second in production environments

📝

Ready to practice?

This lesson has 15 questions to help you learn