You are viewing a preview of this lesson. Sign in to start learning
Back to Mastering AWS

Observability at Scale

CloudWatch, X-Ray, OpenTelemetry, distributed tracing, and log architecture

Observability at Scale on AWS

Master observability at scale with free flashcards and comprehensive examples. This lesson covers distributed tracing architectures, metric aggregation patterns, log centralization strategies, and cost optimization techniquesβ€”essential concepts for building resilient, observable systems on AWS that handle production workloads efficiently.

Welcome to Observability at Scale πŸ”

Building observable systems is challenging enough at small scale, but when you're managing hundreds of microservices, thousands of servers, or millions of requests per minute, observability becomes a fundamentally different problem. At scale, naive approaches quickly become unworkableβ€”you can't manually check dashboards, logs become impossibly large, and costs spiral out of control.

Observability at scale requires intelligent sampling strategies, efficient aggregation, distributed correlation, and careful cost management. AWS provides powerful tools like CloudWatch, X-Ray, OpenSearch, and Managed Grafana, but using them effectively at scale demands architectural discipline and operational expertise.

In this lesson, you'll learn how to design observability systems that remain useful and affordable even as your infrastructure grows exponentially. We'll explore real-world patterns for handling billions of data points while maintaining the insights you need to debug production issues quickly.

Core Concepts of Observability at Scale πŸ“Š

The Three Pillars: Metrics, Logs, and Traces

Metrics are numerical measurements taken over time intervals. They answer "what" questions:

  • Request rate, error rate, latency percentiles (RED metrics)
  • CPU, memory, disk usage (system metrics)
  • Business metrics like orders/minute, revenue/hour

Logs are structured or unstructured event records. They answer "when" and "why" questions:

  • Application events with timestamps and context
  • Error messages, stack traces, debug information
  • Audit trails and security events

Traces show request paths through distributed systems. They answer "where" and "how" questions:

  • End-to-end latency breakdown across services
  • Service dependency maps
  • Bottleneck identification

πŸ’‘ At scale, you cannot collect everything. A system processing 100,000 requests/second generates massive data volumes. A full trace of every request would cost millions monthly in storage and transfer fees.

Sampling Strategies for Scale πŸ“‰

Head-based sampling decides at the request start whether to trace:

import random
from aws_xray_sdk.core import xray_recorder

## Sample 1% of requests
SAMPLE_RATE = 0.01

def should_sample():
    return random.random() < SAMPLE_RATE

if should_sample():
    segment = xray_recorder.begin_segment('api-request')
    # Trace this request
else:
    # Skip tracing
    pass

Tail-based sampling decides after request completion, keeping only interesting traces:

from aws_xray_sdk.core import xray_recorder

def complete_trace(segment):
    # Keep all errors
    if segment.error or segment.fault:
        xray_recorder.end_segment()
        return
    
    # Keep slow requests (p99)
    if segment.duration > 2.0:
        xray_recorder.end_segment()
        return
    
    # Sample 0.1% of normal requests
    if random.random() < 0.001:
        xray_recorder.end_segment()
    else:
        # Discard trace
        segment.sampled = False
        xray_recorder.end_segment()

Adaptive sampling adjusts rates based on traffic patterns and error rates. AWS X-Ray supports this through sampling rules:

Rule PriorityServiceHTTP MethodFixed RateReservoir
1checkout-apiPOST100%N/A
2**5%1 req/sec
3health-checkGET0%0

⚠️ Common mistake: Using head-based sampling only. You'll miss rare but critical errors that only appear under specific conditions.

Metric Aggregation Architectures πŸ“ˆ

At scale, individual data points become less important than aggregates. CloudWatch allows you to emit high-resolution metrics (1-second granularity) but stores them as aggregates:

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

## Publishing individual metrics (expensive at scale)
def publish_individual_metrics(values):
    for value in values:
        cloudwatch.put_metric_data(
            Namespace='MyApp',
            MetricData=[{
                'MetricName': 'ResponseTime',
                'Value': value,
                'Timestamp': datetime.utcnow(),
                'Unit': 'Milliseconds'
            }]
        )

This approach has problems:

  • API call per metric (throttling at high volume)
  • High cost: $0.01 per 1,000 metrics beyond free tier
  • Network overhead

Better approach: Use StatisticSets for local aggregation:

from collections import defaultdict
import time

class MetricAggregator:
    def __init__(self, flush_interval=60):
        self.metrics = defaultdict(list)
        self.flush_interval = flush_interval
        self.last_flush = time.time()
    
    def record(self, metric_name, value):
        self.metrics[metric_name].append(value)
        
        if time.time() - self.last_flush > self.flush_interval:
            self.flush()
    
    def flush(self):
        for metric_name, values in self.metrics.items():
            cloudwatch.put_metric_data(
                Namespace='MyApp',
                MetricData=[{
                    'MetricName': metric_name,
                    'StatisticValues': {
                        'SampleCount': len(values),
                        'Sum': sum(values),
                        'Minimum': min(values),
                        'Maximum': max(values)
                    },
                    'Timestamp': datetime.utcnow(),
                    'Unit': 'Milliseconds'
                }]
            )
        
        self.metrics.clear()
        self.last_flush = time.time()

## Usage: aggregate locally, flush periodically
aggregator = MetricAggregator(flush_interval=60)

for response_time in response_times:
    aggregator.record('ResponseTime', response_time)

This reduces API calls by ~1000x and cuts costs proportionally.

Log Aggregation and Indexing πŸ“

The log volume problem: A microservice generating 100 log lines per request, at 1,000 req/sec, produces 100,000 log lines/sec = 8.64 billion lines/day.

At $0.50/GB ingested into CloudWatch Logs, assuming 500 bytes per line:

  • 100,000 lines/sec Γ— 500 bytes = 50 MB/sec
  • 50 MB/sec Γ— 86,400 sec/day = 4.32 TB/day
  • 4.32 TB Γ— $0.50/GB = $2,160/day = $64,800/month for one service!

Cost optimization strategies:

1. Structured logging with filtering:

import json
import logging
from pythonjsonlogger import jsonlogger

## Structured JSON logs enable efficient querying
logger = logging.getLogger()
loghandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
loghandler.setFormatter(formatter)
logger.addHandler(loghandler)

def process_request(request_id, user_id, action):
    # Log with structured fields
    logger.info(
        "Request processed",
        extra={
            "request_id": request_id,
            "user_id": user_id,
            "action": action,
            "duration_ms": 245,
            "status": "success"
        }
    )

2. Log level management:

import os

## Production: only warnings and errors
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'WARNING')
logger.setLevel(LOG_LEVEL)

## DEBUG logs only for specific high-value requests
if request.headers.get('X-Debug-Token') == SECRET_DEBUG_TOKEN:
    logger.setLevel('DEBUG')

3. Subscription filters for critical logs only:

Don't send all logs to OpenSearch. Use CloudWatch Logs subscription filters to forward only errors:

{
  "filterPattern": "{ $.level = \"ERROR\" || $.status_code >= 500 }",
  "destinationArn": "arn:aws:lambda:us-east-1:123456789012:function:LogProcessor"
}

4. Log sampling:

import random

class SampledLogger:
    def __init__(self, logger, sample_rate=0.1):
        self.logger = logger
        self.sample_rate = sample_rate
    
    def info(self, message, **kwargs):
        # Always log errors, sample info
        if kwargs.get('level') == 'ERROR' or random.random() < self.sample_rate:
            self.logger.info(message, extra=kwargs)

## Log only 10% of successful requests
sampled_logger = SampledLogger(logger, sample_rate=0.1)
sampled_logger.info("Request completed", status="success")

Distributed Tracing Architecture πŸ”—

In microservices, a single user request might touch 10+ services. Distributed tracing tracks requests across service boundaries.

AWS X-Ray tracing flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Distributed Trace Flow                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Client Request
       β”‚
       ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    Trace ID: 1-67890-abc123
  β”‚ API GW  β”‚    Segment ID: seg-001
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    Start: T+0ms
       β”‚
       β”œβ”€β”€β”€β”€β”€β”€β”€β†’ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚         β”‚ Lambda 1 β”‚  Subsegment: sub-001
       β”‚         β”‚ (Auth)   β”‚  Parent: seg-001
       β”‚         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  Duration: 50ms
       β”‚              β”‚
       β”‚              ↓
       β”œβ”€β”€β”€β”€β”€β”€β”€β†’ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚         β”‚ Lambda 2 β”‚  Subsegment: sub-002
       β”‚         β”‚ (Logic)  β”‚  Parent: seg-001
       β”‚         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  Duration: 200ms
       β”‚              β”‚
       β”‚              β”œβ”€β”€β”€β†’ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚              β”‚     β”‚DynamoDB β”‚ Subsegment: sub-003
       β”‚              β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Duration: 45ms
       β”‚              β”‚
       β”‚              └───→ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚                    β”‚   SQS   β”‚ Subsegment: sub-004
       β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Duration: 15ms
       β”‚
       ↓
  Response (Total: 250ms)

Implementing X-Ray in Python:

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
import boto3

## Automatically instrument AWS SDK calls
patch_all()

def lambda_handler(event, context):
    # Lambda automatically creates segment
    # Add custom subsegments for business logic
    
    with xray_recorder.capture('validate_input') as subsegment:
        user_id = event['user_id']
        subsegment.put_annotation('user_id', user_id)
        subsegment.put_metadata('payload', event)
        validate(user_id)
    
    with xray_recorder.capture('database_query'):
        dynamodb = boto3.resource('dynamodb')
        table = dynamodb.Table('Users')
        response = table.get_item(Key={'user_id': user_id})
    
    return {'statusCode': 200, 'body': response}

Key X-Ray concepts:

  • Trace ID: Unique identifier for entire request (propagated via HTTP headers)
  • Segment: Work done by a single service (e.g., Lambda function execution)
  • Subsegment: Finer-grained work within a segment (e.g., DB call)
  • Annotations: Indexed key-value pairs for filtering traces (max 50 per segment)
  • Metadata: Non-indexed additional context (not searchable)

πŸ’‘ Use annotations for filterable data (user_id, customer_tier, region) and metadata for debug details (full request body, configuration).

Service Map Generation πŸ—ΊοΈ

X-Ray automatically builds service maps showing dependencies:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Service Dependency Map                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Client   β”‚
         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
              β”‚ 10k req/min
              ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”Œβ”€β”€β”€β”€β”€ API GW   β”œβ”€β”€β”€β”€β”
    β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
    β”‚                    β”‚
    ↓                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Auth    β”‚         β”‚ Order   β”‚
β”‚ Service β”‚         β”‚ Service β”‚
β”‚ 5k req  β”‚         β”‚ 8k req  β”‚
β”‚ 2% err  β”‚         β”‚ 0.1% errβ”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚                   β”‚
     ↓                   ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User DB β”‚         β”‚Product  β”‚
β”‚DynamoDB β”‚         β”‚Database β”‚
β”‚ 45ms p99β”‚         β”‚ RDS     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚ 120ms   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cardinality Management 🎯

Cardinality is the number of unique values in a dimension. High cardinality kills observability performance.

High cardinality examples (avoid):

  • User IDs as metric dimensions (millions of users = millions of time series)
  • Request IDs as labels
  • Full URLs with query parameters
  • Timestamps as tags

Low cardinality examples (use these):

  • Service name (dozens of services)
  • Region (handful of AWS regions)
  • HTTP status code buckets (2xx, 4xx, 5xx)
  • Customer tier (free, premium, enterprise)

Bad metric:

## Creates millions of time series!
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{
        'MetricName': 'RequestDuration',
        'Dimensions': [
            {'Name': 'UserID', 'Value': user_id},  # ❌ High cardinality
            {'Name': 'RequestID', 'Value': request_id}  # ❌ Unbounded
        ],
        'Value': duration
    }]
)

Good metric:

## Creates manageable number of time series
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{
        'MetricName': 'RequestDuration',
        'Dimensions': [
            {'Name': 'Service', 'Value': 'checkout-api'},  # βœ… Low cardinality
            {'Name': 'Region', 'Value': 'us-east-1'},  # βœ… Low cardinality
            {'Name': 'Tier', 'Value': customer_tier}  # βœ… Low cardinality (3-5 values)
        ],
        'Value': duration
    }]
)

Cardinality calculation:

Total time series = Product of all dimension cardinalities

Example:
- Service: 20 values
- Region: 4 values  
- Tier: 3 values
- Status: 5 values (2xx, 3xx, 4xx, 5xx, timeout)

Total: 20 Γ— 4 Γ— 3 Γ— 5 = 1,200 time series βœ… Manageable

If you add UserID with 1M users:
1,200 Γ— 1,000,000 = 1.2 billion time series ❌ Unworkable!

Cost Optimization Patterns πŸ’°

CloudWatch Logs cost breakdown:

ComponentCostOptimization
Ingestion$0.50/GBLog sampling, level filtering
Storage$0.03/GB-monthShort retention, archive to S3
Analysis (Insights)$0.005/GB scannedNarrow time ranges, use filters
Data transfer$0.01-0.09/GB outProcess in-region

Cost optimization strategies:

1. Tiered retention:

import boto3

logs = boto3.client('logs')

## Recent logs: short retention
logs.put_retention_policy(
    logGroupName='/aws/lambda/high-volume-api',
    retentionInDays=7  # Only 7 days in CloudWatch
)

## Archive to S3 for long-term (much cheaper)
logs.create_export_task(
    logGroupName='/aws/lambda/high-volume-api',
    fromTime=int((datetime.now() - timedelta(days=1)).timestamp() * 1000),
    to=int(datetime.now().timestamp() * 1000),
    destination='observability-archive-bucket',
    destinationPrefix='logs/2024/01/'
)

S3 storage cost: $0.023/GB-month (standard) vs CloudWatch $0.03/GB-month S3 Glacier cost: $0.004/GB-month for rarely accessed logs

2. Metric filtering:

## Create metric filter only for errors (not all logs)
logs.put_metric_filter(
    logGroupName='/aws/lambda/api',
    filterName='ErrorCount',
    filterPattern='[time, request_id, level = ERROR*, ...]',
    metricTransformations=[{
        'metricName': 'Errors',
        'metricNamespace': 'MyApp',
        'metricValue': '1',
        'defaultValue': 0
    }]
)

3. Embedded Metric Format (EMF) for zero-overhead metrics:

import json

def emit_emf_metric(metric_name, value, dimensions):
    # EMF: metrics embedded in log structure, extracted automatically
    emf_log = {
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "MyApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{"Name": metric_name, "Unit": "Milliseconds"}]
            }]
        },
        metric_name: value,
        **dimensions
    }
    
    # Print to stdout - Lambda/ECS sends to CloudWatch Logs
    # CloudWatch automatically extracts metrics (no put_metric_data calls!)
    print(json.dumps(emf_log))

## Usage
emit_emf_metric(
    "ProcessingTime",
    245,
    {"Service": "checkout", "Region": "us-east-1"}
)

EMF benefits:

  • No additional API calls (reduces throttling)
  • Lower cost (uses existing log ingestion)
  • Single source of truth (metrics derived from logs)

Real-World Examples πŸ› οΈ

Example 1: High-Throughput API Observability

Scenario: E-commerce API handling 50,000 requests/second during peak traffic. Need to monitor latency, errors, and throughput without breaking the bank.

Challenge:

  • 50k req/sec Γ— 86,400 sec/day = 4.32 billion requests/day
  • Full tracing: 4.32B Γ— $0.000001 per trace = $4,320/day = $130k/month ❌
  • Full logs: Hundreds of TB/month ❌

Solution architecture:

import random
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
from flask import Flask, request
import time
import json

app = Flask(__name__)
XRayMiddleware(app, xray_recorder)

class AdaptiveSampler:
    def __init__(self):
        self.base_sample_rate = 0.001  # 0.1% baseline
        self.error_count = 0
        self.total_count = 0
    
    def should_sample(self, is_error=False, latency_ms=0):
        self.total_count += 1
        
        # Always sample errors
        if is_error:
            self.error_count += 1
            return True
        
        # Always sample slow requests (p99)
        if latency_ms > 1000:
            return True
        
        # Increase sampling if error rate rises
        if self.total_count > 1000:
            error_rate = self.error_count / self.total_count
            if error_rate > 0.01:  # >1% errors
                return random.random() < 0.1  # Sample 10%
        
        # Normal sampling
        return random.random() < self.base_sample_rate

sampler = AdaptiveSampler()

@app.route('/api/checkout')
def checkout():
    start = time.time()
    is_error = False
    
    try:
        # Business logic here
        process_checkout()
        status_code = 200
    except Exception as e:
        is_error = True
        status_code = 500
    
    duration_ms = (time.time() - start) * 1000
    
    # Decide whether to create detailed trace
    if sampler.should_sample(is_error, duration_ms):
        segment = xray_recorder.current_segment()
        segment.put_annotation('checkout_type', request.args.get('type'))
        segment.put_annotation('customer_tier', get_customer_tier())
    
    # Always emit aggregated metrics via EMF
    emit_emf_metric('CheckoutLatency', duration_ms, {
        'Service': 'checkout-api',
        'Region': 'us-east-1',
        'Status': f'{status_code // 100}xx'
    })
    
    return {'status': 'success'}, status_code

def emit_emf_metric(name, value, dimensions):
    print(json.dumps({
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "EcommerceAPI",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{"Name": name, "Unit": "Milliseconds"}]
            }]
        },
        name: value,
        **dimensions
    }))

Results:

  • Traces: 0.1% baseline + all errors = ~43M traces/day
  • Cost: 43M Γ— $0.000001 = $43/day = $1,290/month βœ… (99% reduction)
  • Full metrics coverage via EMF (no additional cost)
  • Still catch all errors and performance outliers

Example 2: Multi-Region Observability

Scenario: Global application deployed in 6 AWS regions. Need unified observability across all regions.

Challenge:

  • CloudWatch metrics are region-specific
  • X-Ray traces don't cross regions
  • Need single pane of glass for global view

Solution: Cross-region aggregation with Amazon Managed Grafana

## CloudFormation/CDK for cross-region metric streams
import aws_cdk as cdk
from aws_cdk import aws_cloudwatch as cloudwatch
from aws_cdk import aws_kinesis as kinesis
from aws_cdk import aws_kinesisfirehose as firehose

class CrossRegionObservability(cdk.Stack):
    def __init__(self, scope, id, **kwargs):
        super().__init__(scope, id, **kwargs)
        
        # Central Kinesis Data Stream for metric aggregation
        central_stream = kinesis.Stream(
            self, 'CentralMetricStream',
            shard_count=10,
            retention_period=cdk.Duration.hours(24)
        )
        
        # Firehose to S3 for long-term storage
        metric_bucket = s3.Bucket(self, 'MetricArchive')
        
        delivery_stream = firehose.CfnDeliveryStream(
            self, 'MetricDeliveryStream',
            delivery_stream_type='KinesisStreamAsSource',
            kinesis_stream_source_configuration={
                'kinesisStreamARN': central_stream.stream_arn,
                'roleARN': firehose_role.role_arn
            },
            s3_destination_configuration={
                'bucketARN': metric_bucket.bucket_arn,
                'prefix': 'metrics/',
                'compressionFormat': 'GZIP'
            }
        )

Regional metric forwarder Lambda:

import boto3
import json
import os

kinesis = boto3.client('kinesis')
cloudwatch = boto3.client('cloudwatch')

CENTRAL_STREAM = os.environ['CENTRAL_STREAM_NAME']
REGION = os.environ['AWS_REGION']

def lambda_handler(event, context):
    # Query regional metrics
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/Lambda',
        MetricName='Duration',
        Dimensions=[{'Name': 'FunctionName', 'Value': 'checkout-api'}],
        StartTime=datetime.utcnow() - timedelta(minutes=5),
        EndTime=datetime.utcnow(),
        Period=60,
        Statistics=['Average', 'Maximum', 'Minimum']
    )
    
    # Forward to central stream with region tag
    for datapoint in response['Datapoints']:
        record = {
            'region': REGION,
            'metric': 'Lambda.Duration',
            'timestamp': datapoint['Timestamp'].isoformat(),
            'average': datapoint['Average'],
            'maximum': datapoint['Maximum'],
            'minimum': datapoint['Minimum']
        }
        
        kinesis.put_record(
            StreamName=CENTRAL_STREAM,
            Data=json.dumps(record),
            PartitionKey=REGION
        )

Grafana dashboard configuration:

{
  "dashboard": {
    "title": "Global API Performance",
    "panels": [
      {
        "title": "Requests by Region",
        "type": "graph",
        "targets": [
          {
            "datasource": "CloudWatch",
            "namespace": "AWS/ApiGateway",
            "metricName": "Count",
            "dimensions": {"ApiName": "checkout-api"},
            "statistic": "Sum",
            "period": 60,
            "region": "us-east-1",
            "alias": "US East"
          },
          {
            "region": "eu-west-1",
            "alias": "EU West"
          },
          {
            "region": "ap-southeast-1",
            "alias": "Asia Pacific"
          }
        ]
      }
    ]
  }
}

Example 3: Cost-Optimized Log Pipeline

Scenario: Microservices platform with 200 services generating 10 TB of logs daily. Current CloudWatch Logs bill: $150k/month.

Challenge: Need to reduce costs by 80% without losing critical debugging capabilities.

Solution: Hybrid log architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Tiered Log Pipeline Architecture              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Application Logs
        β”‚
        ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Log Router  β”‚ (Lambda@Edge or Fluent Bit)
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚    β”‚    β”‚            β”‚
    ↓    ↓    ↓            ↓
 β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
 β”‚ERRORβ”‚ β”‚WARN β”‚ β”‚INFO  β”‚ β”‚DEBUG β”‚
 β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”˜
    β”‚       β”‚        β”‚        β”‚
    ↓       ↓        ↓        ↓
 β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
 β”‚  CW  β”‚ β”‚  CW  β”‚ β”‚  S3  β”‚ β”‚  /   β”‚
 β”‚ Logs β”‚ β”‚ Logs β”‚ β”‚Directβ”‚ β”‚ dev/ β”‚
 β”‚30day β”‚ β”‚7day  β”‚ β”‚      β”‚ β”‚ null β”‚
 β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
    β”‚        β”‚         β”‚
    ↓        ↓         ↓
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚   OpenSearch (errors +   β”‚
 β”‚   warnings only)         β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         |
         ↓
    Alerting & Analysis

Implementation with Fluent Bit:

## fluent-bit.conf
[SERVICE]
    Flush         5
    Log_Level     info

[INPUT]
    Name              tail
    Path              /var/log/app/*.log
    Parser            json
    Tag               app.*
    Refresh_Interval  5

## Route ERROR logs to CloudWatch + OpenSearch
[FILTER]
    Name    grep
    Match   app.*
    Regex   level ERROR|CRITICAL

[OUTPUT]
    Name                cloudwatch_logs
    Match               app.*
    region              us-east-1
    log_group_name      /app/errors
    log_stream_prefix   error-
    auto_create_group   true

[OUTPUT]
    Name            es
    Match           app.*
    Host            search-mydomain.us-east-1.es.amazonaws.com
    Port            443
    TLS             On
    Index           app-errors
    Type            _doc

## Route WARN logs to CloudWatch only (shorter retention)
[FILTER]
    Name    grep
    Match   app.*
    Regex   level WARN

[OUTPUT]
    Name                cloudwatch_logs
    Match               app.*
    region              us-east-1
    log_group_name      /app/warnings
    log_stream_prefix   warn-
    auto_create_group   true

## Route INFO logs directly to S3 (bypass CloudWatch)
[FILTER]
    Name    grep
    Match   app.*
    Regex   level INFO

[OUTPUT]
    Name            s3
    Match           app.*
    bucket          app-logs-archive
    region          us-east-1
    store_dir       /tmp/fluent-bit/s3
    s3_key_format   /logs/$TAG[1]/%Y/%m/%d/%H_%M_%S_$UUID.gz
    compression     gzip

Cost breakdown before vs after:

TierVolumeBefore (CW)AfterSavings
ERROR0.1 TB/day$1,500/mo$1,500/mo (CW)$0
WARN0.5 TB/day$7,500/mo$3,750/mo (7d retention)$3,750
INFO5 TB/day$75,000/mo$3,600/mo (S3 direct)$71,400
DEBUG4 TB/day$60,000/mo$0 (dropped)$60,000
Total9.6 TB/day$144,000/mo$8,850/mo$135,150 (94%)

Example 4: Real-Time Anomaly Detection

Scenario: Detect unusual patterns in API metrics (latency spikes, error rate increases) and alert within 60 seconds.

Solution: CloudWatch Anomaly Detection with Lambda-based alerting

import boto3
import json
from datetime import datetime, timedelta

cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')

## Enable anomaly detection on key metrics
def enable_anomaly_detection():
    cloudwatch.put_anomaly_detector(
        Namespace='AWS/ApiGateway',
        MetricName='Latency',
        Dimensions=[{'Name': 'ApiName', 'Value': 'checkout-api'}],
        Stat='Average'
    )
    
    # Create alarm for anomalies
    cloudwatch.put_metric_alarm(
        AlarmName='API-Latency-Anomaly',
        ComparisonOperator='LessThanLowerOrGreaterThanUpperThreshold',
        EvaluationPeriods=2,
        Metrics=[
            {
                'Id': 'm1',
                'ReturnData': True,
                'MetricStat': {
                    'Metric': {
                        'Namespace': 'AWS/ApiGateway',
                        'MetricName': 'Latency',
                        'Dimensions': [{'Name': 'ApiName', 'Value': 'checkout-api'}]
                    },
                    'Period': 60,
                    'Stat': 'Average'
                }
            },
            {
                'Id': 'ad1',
                'Expression': 'ANOMALY_DETECTION_BAND(m1, 2)',  # 2 std deviations
                'ReturnData': True
            }
        ],
        ThresholdMetricId='ad1',
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts']
    )

## Lambda function to enrich alerts with context
def lambda_handler(event, context):
    # Triggered by CloudWatch Alarm via SNS
    alarm = json.loads(event['Records'][0]['Sns']['Message'])
    
    metric_name = alarm['Trigger']['MetricName']
    namespace = alarm['Trigger']['Namespace']
    
    # Fetch recent data for context
    response = cloudwatch.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric_name,
        Dimensions=alarm['Trigger']['Dimensions'],
        StartTime=datetime.utcnow() - timedelta(minutes=15),
        EndTime=datetime.utcnow(),
        Period=60,
        Statistics=['Average', 'Maximum', 'Minimum', 'SampleCount']
    )
    
    # Get anomaly band for comparison
    anomaly_response = cloudwatch.get_metric_data(
        MetricDataQueries=[
            {
                'Id': 'm1',
                'MetricStat': {
                    'Metric': {
                        'Namespace': namespace,
                        'MetricName': metric_name,
                        'Dimensions': alarm['Trigger']['Dimensions']
                    },
                    'Period': 60,
                    'Stat': 'Average'
                }
            },
            {
                'Id': 'ad1',
                'Expression': 'ANOMALY_DETECTION_BAND(m1)'
            }
        ],
        StartTime=datetime.utcnow() - timedelta(minutes=15),
        EndTime=datetime.utcnow()
    )
    
    # Build enriched alert
    datapoints = response['Datapoints']
    recent_avg = sum(d['Average'] for d in datapoints[-5:]) / 5
    baseline_avg = sum(d['Average'] for d in datapoints[:-5]) / 10
    
    enriched_message = f"""
    🚨 ANOMALY DETECTED: {metric_name}
    
    Current Average: {recent_avg:.2f}ms
    Baseline Average: {baseline_avg:.2f}ms
    Change: {((recent_avg - baseline_avg) / baseline_avg * 100):.1f}%
    
    Recent Datapoints:
    {json.dumps(datapoints[-5:], indent=2, default=str)}
    
    Runbook: https://wiki.company.com/runbooks/api-latency
    Dashboard: https://console.aws.amazon.com/cloudwatch/...
    """
    
    # Send to incident management system
    sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789012:incidents',
        Subject=f'Anomaly: {metric_name}',
        Message=enriched_message
    )

Common Mistakes and How to Avoid Them ⚠️

Mistake 1: Logging Everything at DEBUG Level in Production

Problem: DEBUG logs contain extensive detail useful during development but generate massive volumes in production, causing:

  • High CloudWatch Logs costs ($50k+/month)
  • Storage bloat
  • Difficult to find critical information in noise

Solution:

import logging
import os

## Use environment-based log levels
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'INFO')
logger = logging.getLogger()
logger.setLevel(LOG_LEVEL)

## Selectively enable DEBUG for troubleshooting
if request.headers.get('X-Debug-Mode') == 'true':
    # Only for authenticated debug requests
    logger.setLevel('DEBUG')
    logger.debug("Detailed debugging info here")

Mistake 2: High-Cardinality Metric Dimensions

Problem: Using unbounded dimensions (user IDs, request IDs, timestamps) creates millions of unique metric time series:

## ❌ WRONG: Creates millions of time series
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{
        'MetricName': 'RequestDuration',
        'Dimensions': [
            {'Name': 'UserID', 'Value': user_id},  # 1M+ values
            {'Name': 'URL', 'Value': request.url}   # Thousands of values
        ],
        'Value': duration
    }]
)

Result:

  • CloudWatch throttling
  • Expensive metric storage
  • Slow dashboard queries

Solution:

## βœ… RIGHT: Low-cardinality dimensions
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{
        'MetricName': 'RequestDuration',
        'Dimensions': [
            {'Name': 'Service', 'Value': 'api'},
            {'Name': 'Endpoint', 'Value': '/checkout'},  # Grouped endpoint
            {'Name': 'StatusClass', 'Value': '2xx'}  # Not individual codes
        ],
        'Value': duration
    }]
)

## Store high-cardinality data in logs, not metrics
logger.info("Request completed", extra={
    'user_id': user_id,  # High cardinality OK in logs
    'request_id': request_id,
    'duration_ms': duration
})

Mistake 3: Not Using Sampling for Traces

Problem: Tracing every request at scale:

  • 100k req/sec Γ— 86,400 sec/day = 8.64B traces/day
  • 8.64B Γ— $0.000001 = $8,640/day = $259k/month

Solution: Intelligent sampling

## βœ… Sample strategically
def should_trace(request, response, duration_ms):
    # Always trace errors
    if response.status_code >= 400:
        return True
    
    # Always trace slow requests
    if duration_ms > 1000:
        return True
    
    # Sample 0.1% of normal traffic
    return random.random() < 0.001

Result: 99.9% cost reduction while catching all critical issues.

Mistake 4: Not Setting Log Retention Policies

Problem: CloudWatch Logs default retention is indefinite. Logs accumulate forever, costing $0.03/GB-month.

  • 1 TB logs = $30/month ongoing
  • After 1 year: $360 for old logs you'll never access
  • After 5 years: $1,800 wasted

Solution:

import boto3

logs = boto3.client('logs')

## Set retention on all log groups
for log_group in logs.describe_log_groups()['logGroups']:
    logs.put_retention_policy(
        logGroupName=log_group['logGroupName'],
        retentionInDays=30  # Adjust based on compliance needs
    )

Mistake 5: Synchronous Logging in Critical Path

Problem: Blocking on log writes adds latency:

## ❌ WRONG: Synchronous logging slows request
def process_request():
    start = time.time()
    
    # Each log write takes 5-10ms
    logger.info("Starting processing")
    result = do_work()
    logger.info("Work completed")
    logger.info("Finalizing")
    
    # Total request latency increased by 15-30ms!
    return result

Solution: Asynchronous logging

import queue
import threading

log_queue = queue.Queue(maxsize=10000)

def async_log_worker():
    while True:
        log_entry = log_queue.get()
        logger.info(log_entry['message'], extra=log_entry['extra'])
        log_queue.task_done()

## Start background thread
threading.Thread(target=async_log_worker, daemon=True).start()

def process_request():
    # Non-blocking log
    log_queue.put({
        'message': 'Starting processing',
        'extra': {'request_id': request_id}
    })
    
    result = do_work()
    
    log_queue.put({
        'message': 'Work completed',
        'extra': {'duration_ms': time.time() - start}
    })
    
    return result

Key Takeaways 🎯

πŸ’‘ Observability at scale is fundamentally about trade-offs: You cannot collect, store, and analyze everything. Successful strategies balance completeness, cost, and performance.

Critical principles:

  1. Sample intelligently: Use tail-based sampling to capture 100% of errors and outliers while sampling only 0.1-1% of normal traffic

  2. Manage cardinality ruthlessly: Keep metric dimensions low-cardinality. Use logs for high-cardinality data (user IDs, request IDs)

  3. Tier your data: Hot data (recent errors) β†’ CloudWatch. Warm data (recent info logs) β†’ S3. Cold data (old logs) β†’ Glacier

  4. Aggregate locally: Use StatisticSets and EMF to reduce API calls by 100-1000x

  5. Set retention policies: Don't pay to store logs you'll never access. 30-90 days is often sufficient

  6. Use structured logging: JSON logs enable efficient filtering and analysis. Unstructured logs are expensive to query

  7. Optimize for your access patterns: If you query errors frequently, store them separately from info logs

  8. Monitor your monitoring costs: Set CloudWatch billing alarms. Observability shouldn't cost more than the infrastructure it monitors!

πŸ“š Further Study

πŸ“‹ Quick Reference Card

ConceptKey Points
Sampling RatesBaseline: 0.1-1% | Errors: 100% | Slow requests: 100%
Cardinality LimitKeep total time series under 10,000 per metric
Log RetentionErrors: 30-90 days | Info: 7-30 days | Debug: 1-7 days
Cost OptimizationUse EMF, aggregate locally, sample logs, set retention
X-Ray ConceptsTrace ID (entire request) | Segment (service) | Subsegment (operation)
Good DimensionsService, Region, Tier, StatusClass (2xx/4xx/5xx)
Bad DimensionsUserID, RequestID, Timestamp, full URLs
CloudWatch CostsIngestion: $0.50/GB | Storage: $0.03/GB-month | Analysis: $0.005/GB scanned