Observability at Scale
CloudWatch, X-Ray, OpenTelemetry, distributed tracing, and log architecture
Observability at Scale on AWS
Master observability at scale with free flashcards and comprehensive examples. This lesson covers distributed tracing architectures, metric aggregation patterns, log centralization strategies, and cost optimization techniquesβessential concepts for building resilient, observable systems on AWS that handle production workloads efficiently.
Welcome to Observability at Scale π
Building observable systems is challenging enough at small scale, but when you're managing hundreds of microservices, thousands of servers, or millions of requests per minute, observability becomes a fundamentally different problem. At scale, naive approaches quickly become unworkableβyou can't manually check dashboards, logs become impossibly large, and costs spiral out of control.
Observability at scale requires intelligent sampling strategies, efficient aggregation, distributed correlation, and careful cost management. AWS provides powerful tools like CloudWatch, X-Ray, OpenSearch, and Managed Grafana, but using them effectively at scale demands architectural discipline and operational expertise.
In this lesson, you'll learn how to design observability systems that remain useful and affordable even as your infrastructure grows exponentially. We'll explore real-world patterns for handling billions of data points while maintaining the insights you need to debug production issues quickly.
Core Concepts of Observability at Scale π
The Three Pillars: Metrics, Logs, and Traces
Metrics are numerical measurements taken over time intervals. They answer "what" questions:
- Request rate, error rate, latency percentiles (RED metrics)
- CPU, memory, disk usage (system metrics)
- Business metrics like orders/minute, revenue/hour
Logs are structured or unstructured event records. They answer "when" and "why" questions:
- Application events with timestamps and context
- Error messages, stack traces, debug information
- Audit trails and security events
Traces show request paths through distributed systems. They answer "where" and "how" questions:
- End-to-end latency breakdown across services
- Service dependency maps
- Bottleneck identification
π‘ At scale, you cannot collect everything. A system processing 100,000 requests/second generates massive data volumes. A full trace of every request would cost millions monthly in storage and transfer fees.
Sampling Strategies for Scale π
Head-based sampling decides at the request start whether to trace:
import random
from aws_xray_sdk.core import xray_recorder
## Sample 1% of requests
SAMPLE_RATE = 0.01
def should_sample():
return random.random() < SAMPLE_RATE
if should_sample():
segment = xray_recorder.begin_segment('api-request')
# Trace this request
else:
# Skip tracing
pass
Tail-based sampling decides after request completion, keeping only interesting traces:
from aws_xray_sdk.core import xray_recorder
def complete_trace(segment):
# Keep all errors
if segment.error or segment.fault:
xray_recorder.end_segment()
return
# Keep slow requests (p99)
if segment.duration > 2.0:
xray_recorder.end_segment()
return
# Sample 0.1% of normal requests
if random.random() < 0.001:
xray_recorder.end_segment()
else:
# Discard trace
segment.sampled = False
xray_recorder.end_segment()
Adaptive sampling adjusts rates based on traffic patterns and error rates. AWS X-Ray supports this through sampling rules:
| Rule Priority | Service | HTTP Method | Fixed Rate | Reservoir |
|---|---|---|---|---|
| 1 | checkout-api | POST | 100% | N/A |
| 2 | * | * | 5% | 1 req/sec |
| 3 | health-check | GET | 0% | 0 |
β οΈ Common mistake: Using head-based sampling only. You'll miss rare but critical errors that only appear under specific conditions.
Metric Aggregation Architectures π
At scale, individual data points become less important than aggregates. CloudWatch allows you to emit high-resolution metrics (1-second granularity) but stores them as aggregates:
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
## Publishing individual metrics (expensive at scale)
def publish_individual_metrics(values):
for value in values:
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{
'MetricName': 'ResponseTime',
'Value': value,
'Timestamp': datetime.utcnow(),
'Unit': 'Milliseconds'
}]
)
This approach has problems:
- API call per metric (throttling at high volume)
- High cost: $0.01 per 1,000 metrics beyond free tier
- Network overhead
Better approach: Use StatisticSets for local aggregation:
from collections import defaultdict
import time
class MetricAggregator:
def __init__(self, flush_interval=60):
self.metrics = defaultdict(list)
self.flush_interval = flush_interval
self.last_flush = time.time()
def record(self, metric_name, value):
self.metrics[metric_name].append(value)
if time.time() - self.last_flush > self.flush_interval:
self.flush()
def flush(self):
for metric_name, values in self.metrics.items():
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{
'MetricName': metric_name,
'StatisticValues': {
'SampleCount': len(values),
'Sum': sum(values),
'Minimum': min(values),
'Maximum': max(values)
},
'Timestamp': datetime.utcnow(),
'Unit': 'Milliseconds'
}]
)
self.metrics.clear()
self.last_flush = time.time()
## Usage: aggregate locally, flush periodically
aggregator = MetricAggregator(flush_interval=60)
for response_time in response_times:
aggregator.record('ResponseTime', response_time)
This reduces API calls by ~1000x and cuts costs proportionally.
Log Aggregation and Indexing π
The log volume problem: A microservice generating 100 log lines per request, at 1,000 req/sec, produces 100,000 log lines/sec = 8.64 billion lines/day.
At $0.50/GB ingested into CloudWatch Logs, assuming 500 bytes per line:
- 100,000 lines/sec Γ 500 bytes = 50 MB/sec
- 50 MB/sec Γ 86,400 sec/day = 4.32 TB/day
- 4.32 TB Γ $0.50/GB = $2,160/day = $64,800/month for one service!
Cost optimization strategies:
1. Structured logging with filtering:
import json
import logging
from pythonjsonlogger import jsonlogger
## Structured JSON logs enable efficient querying
logger = logging.getLogger()
loghandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
loghandler.setFormatter(formatter)
logger.addHandler(loghandler)
def process_request(request_id, user_id, action):
# Log with structured fields
logger.info(
"Request processed",
extra={
"request_id": request_id,
"user_id": user_id,
"action": action,
"duration_ms": 245,
"status": "success"
}
)
2. Log level management:
import os
## Production: only warnings and errors
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'WARNING')
logger.setLevel(LOG_LEVEL)
## DEBUG logs only for specific high-value requests
if request.headers.get('X-Debug-Token') == SECRET_DEBUG_TOKEN:
logger.setLevel('DEBUG')
3. Subscription filters for critical logs only:
Don't send all logs to OpenSearch. Use CloudWatch Logs subscription filters to forward only errors:
{
"filterPattern": "{ $.level = \"ERROR\" || $.status_code >= 500 }",
"destinationArn": "arn:aws:lambda:us-east-1:123456789012:function:LogProcessor"
}
4. Log sampling:
import random
class SampledLogger:
def __init__(self, logger, sample_rate=0.1):
self.logger = logger
self.sample_rate = sample_rate
def info(self, message, **kwargs):
# Always log errors, sample info
if kwargs.get('level') == 'ERROR' or random.random() < self.sample_rate:
self.logger.info(message, extra=kwargs)
## Log only 10% of successful requests
sampled_logger = SampledLogger(logger, sample_rate=0.1)
sampled_logger.info("Request completed", status="success")
Distributed Tracing Architecture π
In microservices, a single user request might touch 10+ services. Distributed tracing tracks requests across service boundaries.
AWS X-Ray tracing flow:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Distributed Trace Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Client Request
β
β
βββββββββββ Trace ID: 1-67890-abc123
β API GW β Segment ID: seg-001
ββββββ¬βββββ Start: T+0ms
β
βββββββββ ββββββββββββ
β β Lambda 1 β Subsegment: sub-001
β β (Auth) β Parent: seg-001
β ββββββ¬ββββββ Duration: 50ms
β β
β β
βββββββββ ββββββββββββ
β β Lambda 2 β Subsegment: sub-002
β β (Logic) β Parent: seg-001
β ββββββ¬ββββββ Duration: 200ms
β β
β βββββ βββββββββββ
β β βDynamoDB β Subsegment: sub-003
β β βββββββββββ Duration: 45ms
β β
β βββββ βββββββββββ
β β SQS β Subsegment: sub-004
β βββββββββββ Duration: 15ms
β
β
Response (Total: 250ms)
Implementing X-Ray in Python:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
import boto3
## Automatically instrument AWS SDK calls
patch_all()
def lambda_handler(event, context):
# Lambda automatically creates segment
# Add custom subsegments for business logic
with xray_recorder.capture('validate_input') as subsegment:
user_id = event['user_id']
subsegment.put_annotation('user_id', user_id)
subsegment.put_metadata('payload', event)
validate(user_id)
with xray_recorder.capture('database_query'):
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Users')
response = table.get_item(Key={'user_id': user_id})
return {'statusCode': 200, 'body': response}
Key X-Ray concepts:
- Trace ID: Unique identifier for entire request (propagated via HTTP headers)
- Segment: Work done by a single service (e.g., Lambda function execution)
- Subsegment: Finer-grained work within a segment (e.g., DB call)
- Annotations: Indexed key-value pairs for filtering traces (max 50 per segment)
- Metadata: Non-indexed additional context (not searchable)
π‘ Use annotations for filterable data (user_id, customer_tier, region) and metadata for debug details (full request body, configuration).
Service Map Generation πΊοΈ
X-Ray automatically builds service maps showing dependencies:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Service Dependency Map β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ
β Client β
ββββββ¬ββββββ
β 10k req/min
β
ββββββββββββ
ββββββ€ API GW ββββββ
β ββββββββββββ β
β β
β β
βββββββββββ βββββββββββ
β Auth β β Order β
β Service β β Service β
β 5k req β β 8k req β
β 2% err β β 0.1% errβ
ββββββ¬βββββ ββββββ¬βββββ
β β
β β
βββββββββββ βββββββββββ
β User DB β βProduct β
βDynamoDB β βDatabase β
β 45ms p99β β RDS β
βββββββββββ β 120ms β
βββββββββββ
Cardinality Management π―
Cardinality is the number of unique values in a dimension. High cardinality kills observability performance.
High cardinality examples (avoid):
- User IDs as metric dimensions (millions of users = millions of time series)
- Request IDs as labels
- Full URLs with query parameters
- Timestamps as tags
Low cardinality examples (use these):
- Service name (dozens of services)
- Region (handful of AWS regions)
- HTTP status code buckets (2xx, 4xx, 5xx)
- Customer tier (free, premium, enterprise)
Bad metric:
## Creates millions of time series!
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{
'MetricName': 'RequestDuration',
'Dimensions': [
{'Name': 'UserID', 'Value': user_id}, # β High cardinality
{'Name': 'RequestID', 'Value': request_id} # β Unbounded
],
'Value': duration
}]
)
Good metric:
## Creates manageable number of time series
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{
'MetricName': 'RequestDuration',
'Dimensions': [
{'Name': 'Service', 'Value': 'checkout-api'}, # β
Low cardinality
{'Name': 'Region', 'Value': 'us-east-1'}, # β
Low cardinality
{'Name': 'Tier', 'Value': customer_tier} # β
Low cardinality (3-5 values)
],
'Value': duration
}]
)
Cardinality calculation:
Total time series = Product of all dimension cardinalities
Example:
- Service: 20 values
- Region: 4 values
- Tier: 3 values
- Status: 5 values (2xx, 3xx, 4xx, 5xx, timeout)
Total: 20 Γ 4 Γ 3 Γ 5 = 1,200 time series β
Manageable
If you add UserID with 1M users:
1,200 Γ 1,000,000 = 1.2 billion time series β Unworkable!
Cost Optimization Patterns π°
CloudWatch Logs cost breakdown:
| Component | Cost | Optimization |
|---|---|---|
| Ingestion | $0.50/GB | Log sampling, level filtering |
| Storage | $0.03/GB-month | Short retention, archive to S3 |
| Analysis (Insights) | $0.005/GB scanned | Narrow time ranges, use filters |
| Data transfer | $0.01-0.09/GB out | Process in-region |
Cost optimization strategies:
1. Tiered retention:
import boto3
logs = boto3.client('logs')
## Recent logs: short retention
logs.put_retention_policy(
logGroupName='/aws/lambda/high-volume-api',
retentionInDays=7 # Only 7 days in CloudWatch
)
## Archive to S3 for long-term (much cheaper)
logs.create_export_task(
logGroupName='/aws/lambda/high-volume-api',
fromTime=int((datetime.now() - timedelta(days=1)).timestamp() * 1000),
to=int(datetime.now().timestamp() * 1000),
destination='observability-archive-bucket',
destinationPrefix='logs/2024/01/'
)
S3 storage cost: $0.023/GB-month (standard) vs CloudWatch $0.03/GB-month S3 Glacier cost: $0.004/GB-month for rarely accessed logs
2. Metric filtering:
## Create metric filter only for errors (not all logs)
logs.put_metric_filter(
logGroupName='/aws/lambda/api',
filterName='ErrorCount',
filterPattern='[time, request_id, level = ERROR*, ...]',
metricTransformations=[{
'metricName': 'Errors',
'metricNamespace': 'MyApp',
'metricValue': '1',
'defaultValue': 0
}]
)
3. Embedded Metric Format (EMF) for zero-overhead metrics:
import json
def emit_emf_metric(metric_name, value, dimensions):
# EMF: metrics embedded in log structure, extracted automatically
emf_log = {
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "MyApp",
"Dimensions": [list(dimensions.keys())],
"Metrics": [{"Name": metric_name, "Unit": "Milliseconds"}]
}]
},
metric_name: value,
**dimensions
}
# Print to stdout - Lambda/ECS sends to CloudWatch Logs
# CloudWatch automatically extracts metrics (no put_metric_data calls!)
print(json.dumps(emf_log))
## Usage
emit_emf_metric(
"ProcessingTime",
245,
{"Service": "checkout", "Region": "us-east-1"}
)
EMF benefits:
- No additional API calls (reduces throttling)
- Lower cost (uses existing log ingestion)
- Single source of truth (metrics derived from logs)
Real-World Examples π οΈ
Example 1: High-Throughput API Observability
Scenario: E-commerce API handling 50,000 requests/second during peak traffic. Need to monitor latency, errors, and throughput without breaking the bank.
Challenge:
- 50k req/sec Γ 86,400 sec/day = 4.32 billion requests/day
- Full tracing: 4.32B Γ $0.000001 per trace = $4,320/day = $130k/month β
- Full logs: Hundreds of TB/month β
Solution architecture:
import random
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
from flask import Flask, request
import time
import json
app = Flask(__name__)
XRayMiddleware(app, xray_recorder)
class AdaptiveSampler:
def __init__(self):
self.base_sample_rate = 0.001 # 0.1% baseline
self.error_count = 0
self.total_count = 0
def should_sample(self, is_error=False, latency_ms=0):
self.total_count += 1
# Always sample errors
if is_error:
self.error_count += 1
return True
# Always sample slow requests (p99)
if latency_ms > 1000:
return True
# Increase sampling if error rate rises
if self.total_count > 1000:
error_rate = self.error_count / self.total_count
if error_rate > 0.01: # >1% errors
return random.random() < 0.1 # Sample 10%
# Normal sampling
return random.random() < self.base_sample_rate
sampler = AdaptiveSampler()
@app.route('/api/checkout')
def checkout():
start = time.time()
is_error = False
try:
# Business logic here
process_checkout()
status_code = 200
except Exception as e:
is_error = True
status_code = 500
duration_ms = (time.time() - start) * 1000
# Decide whether to create detailed trace
if sampler.should_sample(is_error, duration_ms):
segment = xray_recorder.current_segment()
segment.put_annotation('checkout_type', request.args.get('type'))
segment.put_annotation('customer_tier', get_customer_tier())
# Always emit aggregated metrics via EMF
emit_emf_metric('CheckoutLatency', duration_ms, {
'Service': 'checkout-api',
'Region': 'us-east-1',
'Status': f'{status_code // 100}xx'
})
return {'status': 'success'}, status_code
def emit_emf_metric(name, value, dimensions):
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "EcommerceAPI",
"Dimensions": [list(dimensions.keys())],
"Metrics": [{"Name": name, "Unit": "Milliseconds"}]
}]
},
name: value,
**dimensions
}))
Results:
- Traces: 0.1% baseline + all errors = ~43M traces/day
- Cost: 43M Γ $0.000001 = $43/day = $1,290/month β (99% reduction)
- Full metrics coverage via EMF (no additional cost)
- Still catch all errors and performance outliers
Example 2: Multi-Region Observability
Scenario: Global application deployed in 6 AWS regions. Need unified observability across all regions.
Challenge:
- CloudWatch metrics are region-specific
- X-Ray traces don't cross regions
- Need single pane of glass for global view
Solution: Cross-region aggregation with Amazon Managed Grafana
## CloudFormation/CDK for cross-region metric streams
import aws_cdk as cdk
from aws_cdk import aws_cloudwatch as cloudwatch
from aws_cdk import aws_kinesis as kinesis
from aws_cdk import aws_kinesisfirehose as firehose
class CrossRegionObservability(cdk.Stack):
def __init__(self, scope, id, **kwargs):
super().__init__(scope, id, **kwargs)
# Central Kinesis Data Stream for metric aggregation
central_stream = kinesis.Stream(
self, 'CentralMetricStream',
shard_count=10,
retention_period=cdk.Duration.hours(24)
)
# Firehose to S3 for long-term storage
metric_bucket = s3.Bucket(self, 'MetricArchive')
delivery_stream = firehose.CfnDeliveryStream(
self, 'MetricDeliveryStream',
delivery_stream_type='KinesisStreamAsSource',
kinesis_stream_source_configuration={
'kinesisStreamARN': central_stream.stream_arn,
'roleARN': firehose_role.role_arn
},
s3_destination_configuration={
'bucketARN': metric_bucket.bucket_arn,
'prefix': 'metrics/',
'compressionFormat': 'GZIP'
}
)
Regional metric forwarder Lambda:
import boto3
import json
import os
kinesis = boto3.client('kinesis')
cloudwatch = boto3.client('cloudwatch')
CENTRAL_STREAM = os.environ['CENTRAL_STREAM_NAME']
REGION = os.environ['AWS_REGION']
def lambda_handler(event, context):
# Query regional metrics
response = cloudwatch.get_metric_statistics(
Namespace='AWS/Lambda',
MetricName='Duration',
Dimensions=[{'Name': 'FunctionName', 'Value': 'checkout-api'}],
StartTime=datetime.utcnow() - timedelta(minutes=5),
EndTime=datetime.utcnow(),
Period=60,
Statistics=['Average', 'Maximum', 'Minimum']
)
# Forward to central stream with region tag
for datapoint in response['Datapoints']:
record = {
'region': REGION,
'metric': 'Lambda.Duration',
'timestamp': datapoint['Timestamp'].isoformat(),
'average': datapoint['Average'],
'maximum': datapoint['Maximum'],
'minimum': datapoint['Minimum']
}
kinesis.put_record(
StreamName=CENTRAL_STREAM,
Data=json.dumps(record),
PartitionKey=REGION
)
Grafana dashboard configuration:
{
"dashboard": {
"title": "Global API Performance",
"panels": [
{
"title": "Requests by Region",
"type": "graph",
"targets": [
{
"datasource": "CloudWatch",
"namespace": "AWS/ApiGateway",
"metricName": "Count",
"dimensions": {"ApiName": "checkout-api"},
"statistic": "Sum",
"period": 60,
"region": "us-east-1",
"alias": "US East"
},
{
"region": "eu-west-1",
"alias": "EU West"
},
{
"region": "ap-southeast-1",
"alias": "Asia Pacific"
}
]
}
]
}
}
Example 3: Cost-Optimized Log Pipeline
Scenario: Microservices platform with 200 services generating 10 TB of logs daily. Current CloudWatch Logs bill: $150k/month.
Challenge: Need to reduce costs by 80% without losing critical debugging capabilities.
Solution: Hybrid log architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tiered Log Pipeline Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Application Logs
β
β
βββββββββββββββ
β Log Router β (Lambda@Edge or Fluent Bit)
ββββββββ¬βββββββ
β
ββββββΌβββββ¬βββββββββββββ
β β β β
β β β β
βββββββ βββββββ ββββββββ ββββββββ
βERRORβ βWARN β βINFO β βDEBUG β
ββββ¬βββ ββββ¬βββ βββββ¬βββ βββββ¬βββ
β β β β
β β β β
ββββββββ ββββββββ ββββββββ ββββββββ
β CW β β CW β β S3 β β / β
β Logs β β Logs β βDirectβ β dev/ β
β30day β β7day β β β β null β
ββββ¬ββββ ββββ¬ββββ βββββ¬βββ ββββββββ
β β β
β β β
ββββββββββββββββββββββββββββ
β OpenSearch (errors + β
β warnings only) β
ββββββββββββββββββββββββββββ
|
β
Alerting & Analysis
Implementation with Fluent Bit:
## fluent-bit.conf
[SERVICE]
Flush 5
Log_Level info
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
Tag app.*
Refresh_Interval 5
## Route ERROR logs to CloudWatch + OpenSearch
[FILTER]
Name grep
Match app.*
Regex level ERROR|CRITICAL
[OUTPUT]
Name cloudwatch_logs
Match app.*
region us-east-1
log_group_name /app/errors
log_stream_prefix error-
auto_create_group true
[OUTPUT]
Name es
Match app.*
Host search-mydomain.us-east-1.es.amazonaws.com
Port 443
TLS On
Index app-errors
Type _doc
## Route WARN logs to CloudWatch only (shorter retention)
[FILTER]
Name grep
Match app.*
Regex level WARN
[OUTPUT]
Name cloudwatch_logs
Match app.*
region us-east-1
log_group_name /app/warnings
log_stream_prefix warn-
auto_create_group true
## Route INFO logs directly to S3 (bypass CloudWatch)
[FILTER]
Name grep
Match app.*
Regex level INFO
[OUTPUT]
Name s3
Match app.*
bucket app-logs-archive
region us-east-1
store_dir /tmp/fluent-bit/s3
s3_key_format /logs/$TAG[1]/%Y/%m/%d/%H_%M_%S_$UUID.gz
compression gzip
Cost breakdown before vs after:
| Tier | Volume | Before (CW) | After | Savings |
|---|---|---|---|---|
| ERROR | 0.1 TB/day | $1,500/mo | $1,500/mo (CW) | $0 |
| WARN | 0.5 TB/day | $7,500/mo | $3,750/mo (7d retention) | $3,750 |
| INFO | 5 TB/day | $75,000/mo | $3,600/mo (S3 direct) | $71,400 |
| DEBUG | 4 TB/day | $60,000/mo | $0 (dropped) | $60,000 |
| Total | 9.6 TB/day | $144,000/mo | $8,850/mo | $135,150 (94%) |
Example 4: Real-Time Anomaly Detection
Scenario: Detect unusual patterns in API metrics (latency spikes, error rate increases) and alert within 60 seconds.
Solution: CloudWatch Anomaly Detection with Lambda-based alerting
import boto3
import json
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
## Enable anomaly detection on key metrics
def enable_anomaly_detection():
cloudwatch.put_anomaly_detector(
Namespace='AWS/ApiGateway',
MetricName='Latency',
Dimensions=[{'Name': 'ApiName', 'Value': 'checkout-api'}],
Stat='Average'
)
# Create alarm for anomalies
cloudwatch.put_metric_alarm(
AlarmName='API-Latency-Anomaly',
ComparisonOperator='LessThanLowerOrGreaterThanUpperThreshold',
EvaluationPeriods=2,
Metrics=[
{
'Id': 'm1',
'ReturnData': True,
'MetricStat': {
'Metric': {
'Namespace': 'AWS/ApiGateway',
'MetricName': 'Latency',
'Dimensions': [{'Name': 'ApiName', 'Value': 'checkout-api'}]
},
'Period': 60,
'Stat': 'Average'
}
},
{
'Id': 'ad1',
'Expression': 'ANOMALY_DETECTION_BAND(m1, 2)', # 2 std deviations
'ReturnData': True
}
],
ThresholdMetricId='ad1',
AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts']
)
## Lambda function to enrich alerts with context
def lambda_handler(event, context):
# Triggered by CloudWatch Alarm via SNS
alarm = json.loads(event['Records'][0]['Sns']['Message'])
metric_name = alarm['Trigger']['MetricName']
namespace = alarm['Trigger']['Namespace']
# Fetch recent data for context
response = cloudwatch.get_metric_statistics(
Namespace=namespace,
MetricName=metric_name,
Dimensions=alarm['Trigger']['Dimensions'],
StartTime=datetime.utcnow() - timedelta(minutes=15),
EndTime=datetime.utcnow(),
Period=60,
Statistics=['Average', 'Maximum', 'Minimum', 'SampleCount']
)
# Get anomaly band for comparison
anomaly_response = cloudwatch.get_metric_data(
MetricDataQueries=[
{
'Id': 'm1',
'MetricStat': {
'Metric': {
'Namespace': namespace,
'MetricName': metric_name,
'Dimensions': alarm['Trigger']['Dimensions']
},
'Period': 60,
'Stat': 'Average'
}
},
{
'Id': 'ad1',
'Expression': 'ANOMALY_DETECTION_BAND(m1)'
}
],
StartTime=datetime.utcnow() - timedelta(minutes=15),
EndTime=datetime.utcnow()
)
# Build enriched alert
datapoints = response['Datapoints']
recent_avg = sum(d['Average'] for d in datapoints[-5:]) / 5
baseline_avg = sum(d['Average'] for d in datapoints[:-5]) / 10
enriched_message = f"""
π¨ ANOMALY DETECTED: {metric_name}
Current Average: {recent_avg:.2f}ms
Baseline Average: {baseline_avg:.2f}ms
Change: {((recent_avg - baseline_avg) / baseline_avg * 100):.1f}%
Recent Datapoints:
{json.dumps(datapoints[-5:], indent=2, default=str)}
Runbook: https://wiki.company.com/runbooks/api-latency
Dashboard: https://console.aws.amazon.com/cloudwatch/...
"""
# Send to incident management system
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:incidents',
Subject=f'Anomaly: {metric_name}',
Message=enriched_message
)
Common Mistakes and How to Avoid Them β οΈ
Mistake 1: Logging Everything at DEBUG Level in Production
Problem: DEBUG logs contain extensive detail useful during development but generate massive volumes in production, causing:
- High CloudWatch Logs costs ($50k+/month)
- Storage bloat
- Difficult to find critical information in noise
Solution:
import logging
import os
## Use environment-based log levels
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'INFO')
logger = logging.getLogger()
logger.setLevel(LOG_LEVEL)
## Selectively enable DEBUG for troubleshooting
if request.headers.get('X-Debug-Mode') == 'true':
# Only for authenticated debug requests
logger.setLevel('DEBUG')
logger.debug("Detailed debugging info here")
Mistake 2: High-Cardinality Metric Dimensions
Problem: Using unbounded dimensions (user IDs, request IDs, timestamps) creates millions of unique metric time series:
## β WRONG: Creates millions of time series
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{
'MetricName': 'RequestDuration',
'Dimensions': [
{'Name': 'UserID', 'Value': user_id}, # 1M+ values
{'Name': 'URL', 'Value': request.url} # Thousands of values
],
'Value': duration
}]
)
Result:
- CloudWatch throttling
- Expensive metric storage
- Slow dashboard queries
Solution:
## β
RIGHT: Low-cardinality dimensions
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{
'MetricName': 'RequestDuration',
'Dimensions': [
{'Name': 'Service', 'Value': 'api'},
{'Name': 'Endpoint', 'Value': '/checkout'}, # Grouped endpoint
{'Name': 'StatusClass', 'Value': '2xx'} # Not individual codes
],
'Value': duration
}]
)
## Store high-cardinality data in logs, not metrics
logger.info("Request completed", extra={
'user_id': user_id, # High cardinality OK in logs
'request_id': request_id,
'duration_ms': duration
})
Mistake 3: Not Using Sampling for Traces
Problem: Tracing every request at scale:
- 100k req/sec Γ 86,400 sec/day = 8.64B traces/day
- 8.64B Γ $0.000001 = $8,640/day = $259k/month
Solution: Intelligent sampling
## β
Sample strategically
def should_trace(request, response, duration_ms):
# Always trace errors
if response.status_code >= 400:
return True
# Always trace slow requests
if duration_ms > 1000:
return True
# Sample 0.1% of normal traffic
return random.random() < 0.001
Result: 99.9% cost reduction while catching all critical issues.
Mistake 4: Not Setting Log Retention Policies
Problem: CloudWatch Logs default retention is indefinite. Logs accumulate forever, costing $0.03/GB-month.
- 1 TB logs = $30/month ongoing
- After 1 year: $360 for old logs you'll never access
- After 5 years: $1,800 wasted
Solution:
import boto3
logs = boto3.client('logs')
## Set retention on all log groups
for log_group in logs.describe_log_groups()['logGroups']:
logs.put_retention_policy(
logGroupName=log_group['logGroupName'],
retentionInDays=30 # Adjust based on compliance needs
)
Mistake 5: Synchronous Logging in Critical Path
Problem: Blocking on log writes adds latency:
## β WRONG: Synchronous logging slows request
def process_request():
start = time.time()
# Each log write takes 5-10ms
logger.info("Starting processing")
result = do_work()
logger.info("Work completed")
logger.info("Finalizing")
# Total request latency increased by 15-30ms!
return result
Solution: Asynchronous logging
import queue
import threading
log_queue = queue.Queue(maxsize=10000)
def async_log_worker():
while True:
log_entry = log_queue.get()
logger.info(log_entry['message'], extra=log_entry['extra'])
log_queue.task_done()
## Start background thread
threading.Thread(target=async_log_worker, daemon=True).start()
def process_request():
# Non-blocking log
log_queue.put({
'message': 'Starting processing',
'extra': {'request_id': request_id}
})
result = do_work()
log_queue.put({
'message': 'Work completed',
'extra': {'duration_ms': time.time() - start}
})
return result
Key Takeaways π―
π‘ Observability at scale is fundamentally about trade-offs: You cannot collect, store, and analyze everything. Successful strategies balance completeness, cost, and performance.
Critical principles:
Sample intelligently: Use tail-based sampling to capture 100% of errors and outliers while sampling only 0.1-1% of normal traffic
Manage cardinality ruthlessly: Keep metric dimensions low-cardinality. Use logs for high-cardinality data (user IDs, request IDs)
Tier your data: Hot data (recent errors) β CloudWatch. Warm data (recent info logs) β S3. Cold data (old logs) β Glacier
Aggregate locally: Use StatisticSets and EMF to reduce API calls by 100-1000x
Set retention policies: Don't pay to store logs you'll never access. 30-90 days is often sufficient
Use structured logging: JSON logs enable efficient filtering and analysis. Unstructured logs are expensive to query
Optimize for your access patterns: If you query errors frequently, store them separately from info logs
Monitor your monitoring costs: Set CloudWatch billing alarms. Observability shouldn't cost more than the infrastructure it monitors!
π Further Study
- AWS Observability Best Practices Guide
- AWS X-Ray Developer Guide - Sampling
- CloudWatch Embedded Metric Format Specification
π Quick Reference Card
| Concept | Key Points |
|---|---|
| Sampling Rates | Baseline: 0.1-1% | Errors: 100% | Slow requests: 100% |
| Cardinality Limit | Keep total time series under 10,000 per metric |
| Log Retention | Errors: 30-90 days | Info: 7-30 days | Debug: 1-7 days |
| Cost Optimization | Use EMF, aggregate locally, sample logs, set retention |
| X-Ray Concepts | Trace ID (entire request) | Segment (service) | Subsegment (operation) |
| Good Dimensions | Service, Region, Tier, StatusClass (2xx/4xx/5xx) |
| Bad Dimensions | UserID, RequestID, Timestamp, full URLs |
| CloudWatch Costs | Ingestion: $0.50/GB | Storage: $0.03/GB-month | Analysis: $0.005/GB scanned |