Span Boundary Design
Choose instrumentation points that survive refactors and provide meaningful debugging signal
Span Boundary Design
Master observability span boundary design with free flashcards and practice exercises. This lesson covers span granularity principles, boundary selection strategies, context propagation patterns, and anti-patternsβessential concepts for building production-grade distributed tracing systems.
Welcome to Span Boundary Design π―
In distributed tracing, spans are the fundamental units of work that represent operations in your system. But where should one span end and another begin? This seemingly simple question has profound implications for your observability strategy. Poor span boundaries create noise, miss critical insights, and make root cause analysis nearly impossible. Great span boundaries illuminate system behavior, enable precise performance optimization, and make debugging feel like turning on the lights in a dark room.
Think of spans like chapters in a book π. Too many short chapters (over-instrumentation) makes the story choppy and hard to follow. Too few long chapters (under-instrumentation) loses important plot details. The art of span boundary design is finding the perfect narrative structure for your system's story.
Core Concepts: Understanding Span Boundaries π
What Is a Span Boundary?
A span boundary marks the beginning and end of a discrete unit of work in your distributed system. Each span represents:
- A temporal scope: when the operation started and finished
- A logical scope: what the operation accomplished
- A contextual scope: where in the system the operation occurred
- A causality link: how this operation relates to parent and child operations
The boundary decision determines what gets traced as a single atomic operation versus what gets broken into multiple observable steps.
The Span Granularity Spectrum π
Too Coarse Optimal Too Fine
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Entire β βHTTP β βVariable β
βRequest β βHandler β βAssignmentβ
β β β β β β
β 500ms β βDB Query β βFunction β
β β β β βCall β
β β βCache β β β
β β βCheck β βLoop β
β β β β βIterationβ
βββββββββββ βββββββββββ βββββββββββ
β Loses detail β
Actionable β Noise overload
β Can't optimize β
Clear narrative β Performance cost
β Vague problems β
Root cause ready β Storage burden
The Four Principles of Span Boundary Design π
1. Semantic Significance Principle
A span should represent an operation that has business or technical meaning on its own. Ask: "Would an engineer investigating an issue care about this operation independently?"
β Good boundaries:
POST /api/orders- meaningful HTTP endpointgetUserFromDatabase- clear data operationvalidatePaymentMethod- distinct business logicpublishOrderEvent- observable integration point
β Poor boundaries:
parseJSON- too low-level, internal detailfor loop iteration 47- implementation noisevariableAssignment- not independently meaningfullogStatement- meta-operation, not real work
2. Actionability Principle
Span boundaries should enable concrete actions when problems occur. If a span shows high latency or errors, you should be able to:
- Identify the specific system component involved
- Understand what operation failed
- Know where to look in the codebase
- Determine potential fixes
π‘ Tip: If your span name is "doWork" or "processData", it's probably not actionable enough!
3. Performance-Cost Balance Principle
Every span has overhead:
- CPU cost: creating span objects, recording timestamps
- Memory cost: storing span data before export
- Network cost: transmitting span data to collectors
- Storage cost: persisting spans for analysis
| Span Frequency | Overhead Impact | When Appropriate |
|---|---|---|
| 1-10 per request | Negligible (<1ms) | Most applications |
| 10-100 per request | Noticeable (1-5ms) | Complex workflows |
| 100-1000 per request | Significant (5-20ms) | High-value paths only |
| >1000 per request | Prohibitive (>20ms) | β οΈ Redesign needed |
4. Context Propagation Principle
Span boundaries should align with context propagation boundaries in your system:
- Network calls (HTTP, gRPC, message queues)
- Thread boundaries (async operations, thread pools)
- Process boundaries (child processes, containers)
- System boundaries (database, cache, external APIs)
When context crosses these boundaries, you need a span to capture:
- The transition itself
- Timing of the boundary crossing
- Success/failure of the propagation
- Metadata about the destination
Span Boundary Patterns π§
Pattern 1: Synchronous Call Boundaries
For synchronous operations, create spans around complete request-response cycles:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Parent Span: HandleRequest β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββββββββ ββββββββββββββββββ β β
β β β Child Span: β β Child Span: β β β
β β β ValidateInput β β QueryDatabase β β β
β β β (5ms) β β (45ms) β β β
β β ββββββββββββββββββ ββββββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Timeline: 0ms ββββββββββββ> 50ms
Implementation guidance:
## β
Good: Span per meaningful operation
with tracer.start_span("handle_order_request") as parent:
with tracer.start_span("validate_order", parent=parent):
validate_order_data(order)
with tracer.start_span("check_inventory", parent=parent):
inventory = db.query_inventory(order.items)
with tracer.start_span("calculate_total", parent=parent):
total = calculate_order_total(order, inventory)
## β Bad: Too granular, implementation details
with tracer.start_span("handle_order_request") as parent:
with tracer.start_span("parse_json", parent=parent): # Too low-level
order = json.loads(request.body)
with tracer.start_span("for_loop_iteration", parent=parent): # Noise
for item in order.items:
# ...
Pattern 2: Asynchronous Operation Boundaries
For async operations, spans must capture the full lifecycle including queuing time:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Span: ProcessOrder (async operation) β β β β Enqueued βββββ Picked Up βββββ Completed β β t=0 t=50ms t=200ms β β β β β β β ββ Queue Time ββββββ β β β βββββββββββ Processing Time βββββββββββ β β βββββββββββββ Total Span Duration ββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key consideration: Distinguish between:
- Queuing latency: time waiting for worker availability
- Execution latency: time actively processing
- Total latency: end-to-end duration
π‘ Tip: Add span events to mark state transitions:
// β
Good: Track async lifecycle
const span = tracer.startSpan('process_payment');
span.addEvent('enqueued', { queue: 'payments', position: 12 });
// ... later when picked up by worker ...
span.addEvent('processing_started', { worker_id: 'worker-7' });
// ... after processing ...
span.addEvent('processing_completed', { status: 'approved' });
span.end();
Pattern 3: Network Boundary Spans
Every network call should have spans on both sides of the boundary:
Service A Service B
βββββββββββββ βββββββββββββ
β β HTTP Request β β
β Client β βββββββββββββββββββ β Server β
β Span β β Span β
β (caller) β βββββββββββββββββββ β (handler) β
β β HTTP Response β β
βββββββββββββ βββββββββββββ
β β
βββββββ trace_id: abc123 ββββββββββββ
βββββββ parent_span_id propagated βββ
Client span captures: Server span captures:
β’ Serialization time β’ Deserialization time
β’ Network transmission β’ Handler execution
β’ Deserialization time β’ Response serialization
β’ Full RTT latency β’ Processing latency
Critical practice: Always propagate trace context across network boundaries using standard headers:
traceparent(W3C Trace Context standard)tracestate(vendor-specific data)- Custom headers (legacy systems)
Pattern 4: Database Operation Boundaries
Database operations warrant spans when they represent logical queries, not internal implementation:
| Span Level | Example | When to Use |
|---|---|---|
| β Query Span | getUserOrders | Logical database operation |
| β Transaction Span | createOrderTransaction | Multi-query atomic operation |
| β οΈ Connection Span | getConnection | Only if connection pooling is a bottleneck |
| β Internal Operation | prepareStatement | Too low-level, internal detail |
Best practice: Capture query details as span attributes, not separate spans:
// β
Good: One span with rich attributes
span := tracer.Start(ctx, "query_user_orders")
span.SetAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.statement", "SELECT * FROM orders WHERE user_id = ?"),
attribute.Int("db.rows_affected", rowCount),
)
defer span.End()
Strategic Span Boundary Selection π―
The Business Logic vs Technical Implementation Divide
One of the most critical decisions: should your spans represent business operations or technical operations?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β BUSINESS-ORIENTED SPAN STRUCTURE β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β CreateOrder β β βββ ValidateCustomerCredit β β βββ ReserveInventory β β βββ CalculateShipping β β βββ ConfirmOrder β β β β β Matches business workflow β β β Product managers understand traces β β β Aligns with business metrics β β β May hide technical bottlenecks β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β TECHNICAL-ORIENTED SPAN STRUCTURE β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β POST /api/orders β β βββ PostgresQuery: SELECT users β β βββ RedisGet: inventory:item123 β β βββ HttpPost: shipping-service/calculate β β βββ KafkaPublish: order-confirmed-topic β β β β β Engineers quickly identify components β β β Clear technical bottlenecks β β β Easy to correlate with infrastructure metrics β β β Business stakeholders need translation β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Recommendation: Use a hybrid approach with two span layers:
- Outer business spans: High-level operations ("PlaceOrder", "ProcessPayment")
- Inner technical spans: Implementation details (database queries, API calls)
This gives you both business visibility and technical precision.
Dynamic Span Boundary Decisions
Sometimes span boundaries should adapt based on context:
Sampling-based boundaries:
## Create detailed spans only for sampled traces
if trace_context.is_sampled():
with tracer.start_span("detailed_validation"):
# Expensive instrumentation
validate_with_full_details()
else:
# Just do the work without extra spans
validate_with_full_details()
Error-triggered boundaries:
// Add detailed spans when errors occur
try {
processPayment(order);
} catch (PaymentException e) {
// Create additional diagnostic spans
Span debugSpan = tracer.spanBuilder("payment_failure_debug")
.startSpan();
try {
capturePaymentState();
validatePaymentGatewayConnection();
} finally {
debugSpan.end();
}
throw e;
}
Performance-triggered boundaries:
// Add granular spans only for slow operations
const startTime = Date.now();
const result = await expensiveOperation();
const duration = Date.now() - startTime;
if (duration > SLOW_THRESHOLD_MS) {
// Retrospectively create detailed spans
await analyzeSlowOperation(result, duration);
}
Span Boundary Alignment with System Architecture
Your span boundaries should mirror your system's conceptual architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β MICROSERVICES ARCHITECTURE β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β API Gateway Span β β βββ Auth Service Span β β βββ Order Service Span β β βββ Inventory Service Span β β βββ Payment Service Span β β βββ Notification Service Span β β β β β One span per service boundary β β β Clear service responsibility β β β Service-level SLO tracking β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β LAYERED ARCHITECTURE β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Controller Layer Span β β βββ Service Layer Span β β βββ Repository Layer Span β β βββ Database Span β β β β β One span per layer transition β β β Layer-specific performance analysis β β β Architectural compliance verification β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Tip: If your traces don't match your architecture diagrams, one of them is wrong!
Common Mistakes and Anti-Patterns β οΈ
1. The "Instrumentation Everywhere" Anti-Pattern
Symptom: Every function call becomes a span
β Bad example:
def process_order(order)
span1 = tracer.start_span("process_order")
span2 = tracer.start_span("validate_order") # OK
result = validate(order)
span2.finish
span3 = tracer.start_span("log_validation") # β Logging is not work
logger.info("Validated order #{order.id}")
span3.finish
span4 = tracer.start_span("variable_assignment") # β Absurd
validated = result.success?
span4.finish
span1.finish
end
Why it's bad:
- 90% of spans provide no value
- Obscures the 10% that matter
- Significant performance overhead
- Massive storage costs
β Better approach: Instrument only meaningful operations
2. The "Transaction Script" Anti-Pattern
Symptom: One giant span for entire request
β Bad example:
@trace_route
def handle_checkout():
# Everything happens in one 500ms span
user = get_user()
cart = get_cart()
payment = process_payment() # 400ms - but hidden!
order = create_order()
send_confirmation()
return order
Why it's bad:
- Can't identify which operation is slow
- No visibility into failure points
- Can't optimize specific components
β Better approach: Break down into logical operations
3. The "Span Soup" Anti-Pattern
Symptom: Flat span structure with no parent-child relationships
β BAD: Flat structure (span soup) span1: handle_request (50ms) span2: query_database (30ms) span3: call_api (20ms) span4: format_response (5ms) β οΈ Can't tell which operations are sequential vs parallel β οΈ Can't identify causal relationships β οΈ Timeline reconstruction is impossible β GOOD: Hierarchical structure span1: handle_request (50ms) ββ span2: query_database (30ms) ββ span3: call_api (20ms) [started after span2] ββ span4: format_response (5ms) β Clear causality and sequencing β Accurate timeline visualization β Easy to identify parallel vs sequential work
4. The "Boundary Mismatch" Anti-Pattern
Symptom: Span boundaries don't align with actual work boundaries
β Bad example:
const span = tracer.startSpan('database_operation');
const connection = await pool.getConnection();
span.end(); // β Ended too early!
const result = await connection.query('SELECT * FROM users');
// β Actual database work happens outside span
Why it's bad:
- Span timings are meaningless
- Doesn't capture actual operation duration
- Misleading performance data
β Correct approach:
const span = tracer.startSpan('database_operation');
try {
const connection = await pool.getConnection();
const result = await connection.query('SELECT * FROM users');
return result;
} finally {
span.end(); // β
Captures complete operation
}
5. The "Context Loss" Anti-Pattern
Symptom: Spans created but trace context not propagated
β Bad example:
def handle_request():
span = tracer.start_span("handle_request")
# β Context not passed to background task
task_queue.enqueue(process_async, data)
span.end()
def process_async(data):
# β This span becomes orphaned - no parent!
span = tracer.start_span("process_async")
# ...
Result: Trace fragments, broken causality chains
β Correct approach:
def handle_request():
span = tracer.start_span("handle_request")
# β
Extract and propagate context
context = tracer.extract_context()
task_queue.enqueue(process_async, data, context)
span.end()
def process_async(data, context):
# β
Restore context and create child span
with tracer.use_context(context):
span = tracer.start_span("process_async")
# ...
Real-World Examples π
Example 1: E-Commerce Checkout Flow
Let's design span boundaries for a realistic checkout operation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Trace: checkout-flow (trace_id: abc123) β β Duration: 847ms β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ POST /api/checkout (847ms) βββββββββββββββββββββββββββββ β span_id: span-001 β β service: api-gateway β β β β ββ authenticate_user (23ms) ββββββββββββ β β β span_id: span-002 β β β β service: auth-service β β β β attributes: β β β β - user_id: user_456 β β β β - auth_method: jwt β β β ββββββββββββββββββββββββββββββββββββββββ β β β β ββ validate_cart (67ms) ββββββββββββββββ β β β span_id: span-003 β β β β service: cart-service β β β β β β β β ββ db.query.get_cart_items (45ms) β β β β β span_id: span-004 β β β β β db.system: postgresql β β β β β db.rows: 3 β β β β βββββββββββββββββββββββββββββββββββββ β β β β β β β ββ check_inventory (18ms) β β β β β span_id: span-005 β β β β β service: inventory-service β β β β βββββββββββββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββββββββββββββ β β β β ββ process_payment (687ms) βββββββββββββ β οΈ SLOW! β β β span_id: span-006 β β β β service: payment-service β β β β status: error β β β β β β β β ββ validate_payment_method (12ms) β β β β β span_id: span-007 β β β β βββββββββββββββββββββββββββββββββββββ β β β β β β β ββ call_stripe_api (623ms) ββββββββββ€ π΄ ROOT CAUSEβ β β β span_id: span-008 β β β β β http.url: api.stripe.com β β β β β http.status: 429 β β β β β error: rate_limit_exceeded β β β β βββββββββββββββββββββββββββββββββββββ β β β β β β β ββ retry_payment (52ms) β β β β β span_id: span-009 β β β β β http.status: 200 β β β β βββββββββββββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββββββββββββββ β β β β ββ create_order (45ms) βββββββββββββββββ β β β span_id: span-010 β β β β service: order-service β β β β attributes: β β β β - order_id: ord_789 β β β β - order_total: 127.50 β β β ββββββββββββββββββββββββββββββββββββββββ β β β β ββ send_confirmation (25ms) ββββββββββββ β β span_id: span-011 β β β service: notification-service β β β messaging.destination: order-events β β β messaging.system: kafka β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key decisions:
- β Each service boundary gets a span
- β Long-running payment operation broken into sub-steps
- β Database and external API calls instrumented
- β Error context preserved (Stripe rate limit)
- β Business attributes (order_id, user_id) attached
- β Clear root cause identification possible
Example 2: Async Message Processing Pipeline
Span boundaries for event-driven architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Trace: order-fulfillment (trace_id: xyz789) β
β Total Duration: 3,245ms (including queue time) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[t=0ms] Producer Service
ββ publish_order_created_event (15ms) ββββββββββββββββββ
β span_id: span-100 β
β span.kind: PRODUCER β
β messaging.system: kafka β
β messaging.destination: order-events β
β messaging.message_id: msg-456 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β [context propagated via message headers]
β
[t=850ms] Queue Time = 850ms β οΈ
β
β
[t=850ms] Consumer Service
ββ process_order_event (2,380ms) βββββββββββββββββββββββ
β span_id: span-101 β
β span.kind: CONSUMER β
β parent_span_id: span-100 β
Linked! β
β β
β events: β
β - [t=850ms] message_received β
β - [t=855ms] processing_started β
β β
β ββ allocate_warehouse_inventory (145ms) βββββββ β
β β span_id: span-102 β β
β β warehouse_id: wh-east-1 β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββ generate_shipping_label (2,100ms) ββββββββββ π΄ β
β β span_id: span-103 β β
β β β β
β β ββ call_fedex_api (2,050ms) ββββββββββββββββ€ β
β β β span_id: span-104 β β
β β β http.url: api.fedex.com β β
β β β http.status: 200 β β
β β β http.duration: 2050ms β οΈ SLOW β β
β β ββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββ update_order_status (89ms) ββββββββββββββββ β
β β span_id: span-105 β β
β β db.statement: UPDATE orders SET... β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββ publish_shipping_notification (46ms) βββββββ β
β span_id: span-106 β β
β span.kind: PRODUCER β β
β messaging.destination: shipping-events β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key decisions:
- β Separate spans for PRODUCER and CONSUMER sides
- β Context propagated through message headers
- β Parent-child relationship maintained across async boundary
- β Queue time visible (850ms - potential bottleneck!)
- β Span events mark lifecycle transitions
- β External API slowness clearly identified
Example 3: Database Transaction with Retry Logic
How to handle complex database operations:
## Span structure for retryable database transaction
@trace_operation("create_user_with_profile")
def create_user_with_profile(user_data, profile_data):
"""Business operation span - outermost"""
max_retries = 3
for attempt in range(1, max_retries + 1):
# β
Each retry attempt gets its own span
with tracer.start_span(f"transaction_attempt_{attempt}") as attempt_span:
attempt_span.set_attribute("retry.attempt", attempt)
try:
with database.transaction() as tx:
# β
Individual operations within transaction
with tracer.start_span("insert_user") as user_span:
user_id = tx.execute(
"INSERT INTO users (name, email) VALUES (?, ?)",
user_data
)
user_span.set_attribute("user_id", user_id)
user_span.set_attribute("db.rows_affected", 1)
with tracer.start_span("insert_profile") as profile_span:
tx.execute(
"INSERT INTO profiles (user_id, bio) VALUES (?, ?)",
(user_id, profile_data)
)
profile_span.set_attribute("db.rows_affected", 1)
# β
Transaction commit is a meaningful operation
with tracer.start_span("commit_transaction"):
tx.commit()
attempt_span.set_status(StatusCode.OK)
return user_id
except DeadlockError as e:
# β
Record retry reason
attempt_span.record_exception(e)
attempt_span.set_attribute("retry.reason", "deadlock")
if attempt == max_retries:
attempt_span.set_status(StatusCode.ERROR, "max_retries_exceeded")
raise
else:
attempt_span.set_status(StatusCode.ERROR, "retry_scheduled")
time.sleep(2 ** attempt) # Exponential backoff
Resulting trace structure:
| Span Name | Duration | Status | Notes |
|---|---|---|---|
| create_user_with_profile | 156ms | OK | Business operation |
| ββ transaction_attempt_1 | 23ms | ERROR | Deadlock occurred |
| β ββ insert_user | 12ms | OK | User inserted |
| β ββ insert_profile | 8ms | ERROR | Deadlock here |
| ββ transaction_attempt_2 | 89ms | OK | Success! |
| β ββ insert_user | 15ms | OK | User inserted |
| β ββ insert_profile | 11ms | OK | Profile inserted |
| β ββ commit_transaction | 63ms | OK | Commit was slow! |
Benefits of this structure:
- See exactly which retry succeeded
- Identify which operation caused deadlock
- Measure retry overhead separately
- Track commit performance (often overlooked!)
Example 4: Parallel Operations with Context Propagation
Handling concurrent operations correctly:
// Processing multiple items in parallel
async function processOrders(orderIds) {
const parentSpan = tracer.startSpan('process_multiple_orders');
parentSpan.setAttribute('order_count', orderIds.length);
try {
// β
Each parallel operation gets its own span
const results = await Promise.all(
orderIds.map(async (orderId) => {
// β
Create child span with proper parent
const childSpan = tracer.startSpan(
'process_single_order',
{ parent: parentSpan }
);
childSpan.setAttribute('order_id', orderId);
try {
const result = await processOrder(orderId);
childSpan.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
childSpan.recordException(error);
childSpan.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
childSpan.end();
}
})
);
parentSpan.setAttribute('success_count', results.length);
return results;
} finally {
parentSpan.end();
}
}
Trace visualization shows parallelism:
process_multiple_orders (234ms) ββ process_single_order [ord_1] (198ms) ββββββββββββββββββββββ ββ process_single_order [ord_2] (145ms) βββββββββββββββββ ββ process_single_order [ord_3] (234ms) ββββββββββββββββββββββββββ ββ process_single_order [ord_4] (167ms) ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Time 0ms 234ms β Spans show parallel execution (overlapping bars) β Total time = max(child durations), not sum β Each order's performance independently visible
Key Takeaways π―
π Span Boundary Design Principles
| Semantic Significance | Spans represent operations with independent meaning |
| Actionability | Span data enables concrete debugging actions |
| Performance Balance | Overhead justified by observability value |
| Context Propagation | Boundaries align with context transitions |
β Do This
- Create spans for network calls (both client and server side)
- Instrument database queries as logical operations
- Capture async operation full lifecycle (queue + execution)
- Propagate context across all boundaries
- Use span events for state transitions within long operations
- Align span structure with system architecture
- Include business attributes (user_id, order_id)
- Record errors and retry logic
β Avoid This
- Instrumenting every function call (over-instrumentation)
- Creating one giant span per request (under-instrumentation)
- Flat span structures without parent-child relationships
- Ending spans before work completes
- Failing to propagate context across async boundaries
- Spans for logging, variable assignment, control flow
- Mixing span granularity levels inconsistently
π§ Remember
"Spans should tell a story, not list every sentence."
Your traces are a narrative about what your system does. Good span boundaries create chapters and paragraphs that make the story comprehensible. Bad boundaries either lose the plot or drown readers in unnecessary detail.
π Further Study
OpenTelemetry Tracing Specification: https://opentelemetry.io/docs/specs/otel/trace/api/ - Official specification for span semantics and API
Distributed Tracing in Practice (O'Reilly): https://www.oreilly.com/library/view/distributed-tracing-in/9781492056621/ - Comprehensive guide to real-world tracing patterns
Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ - Fundamental principles that inform good span design
Next Steps: Practice applying these principles to your own systems. Start with high-traffic critical paths, instrument thoughtfully, and iterate based on what actually helps you debug production issues. Remember: observability is about building understanding, not just collecting data. π