You are viewing a preview of this lesson. Sign in to start learning
Back to Kafka for .NET Developers 2026

Production Patterns & Observability

Production readiness means retry topics, dead-letter queues, idempotent consumers, outbox pattern, consumer lag alerting, and a real observability dashboard.

Last generated

Why Production Kafka Is Different From Development Kafka

Your local Kafka setup is a polite fiction. The broker starts reliably, messages arrive in order, your consumer processes them without interruption, and the whole stack hums along on a single machine with predictable latency. It's a useful fiction — it lets you focus on business logic without fighting infrastructure — but it trains you to expect a world that production will never deliver. In production, brokers restart under you mid-consumer, partitions rebalance while you're mid-batch, network timeouts turn a five-millisecond round-trip into a thirty-second stall, and messages arrive that your deserializer has never seen and cannot parse. The gap between local and production is not a configuration gap. It is a design gap — and closing it requires understanding precisely where and why things break.

This section maps that gap. The patterns covered in the rest of this lesson — idempotent consumers, dead-letter topics, the outbox pattern, lag alerting — all exist because of the failure modes introduced here.


The Happy Path Is a Special Case

In development, the implicit contract looks something like this:

Produce message → Broker stores it → Consumer reads it → Processing succeeds → Done

Every component cooperates. In production, any arrow in that chain can fail, delay, or produce partial results:

  Producer          Broker             Consumer             Downstream
     │                │                   │                     │
     │──produce()────►│                   │                     │
     │                │ ← broker restart  │                     │
     │                │   or leader       │                     │
     │                │   election        │                     │
     │                │──deliver─────────►│                     │
     │                │                   │ ← rebalance fires   │
     │                │                   │   mid-batch         │
     │                │                   │──downstream ───────►│
     │                │                   │   write             │ ← timeout,
     │                │                   │                     │   conflict,
     │                │                   │                     │   or crash

Each failure point has a different character. A broker restart is transient — wait and retry, and the broker comes back. A malformed message payload is permanent — no amount of retrying will make unparseable JSON deserialize correctly. Understanding which failure you're facing is the first design decision your consumer must make.

💡 Mental Model: Think of your Kafka pipeline the way a civil engineer thinks about a bridge: each individual component can be well-built, but the system's integrity is determined by the weakest load path under stress. A producer that reliably publishes messages cannot compensate for a consumer that silently swallows errors. Production readiness belongs to the full pipeline, not to any single component.


Brokers Restart, Partitions Rebalance, Timeouts Happen

These three events are not edge cases in a well-run Kafka cluster. They are routine operational events. Rolling upgrades restart brokers one at a time. Consumer group membership changes trigger rebalances. Cloud networking introduces variable latency that occasionally crosses your client's timeout thresholds.

The critical implication: your consumer will be interrupted mid-processing. The message your consumer is currently working on may have its partition reassigned to a different consumer instance before you commit the offset. If you don't commit at the right moment, that message will be redelivered — possibly many times.

Here's a minimal consumer loop in C# that exposes where the interruption risk lives:

// Simplified to illustrate timing — production patterns shown in later sections
using Confluent.Kafka;

var config = new ConsumerConfig
{
    BootstrapServers = "kafka:9092",
    GroupId = "order-processor",
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false  // Manual commit gives us control over timing
};

using var consumer = new ConsumerBuilder<string, string>(config).Build();
consumer.Subscribe("orders");

while (!cancellationToken.IsCancellationRequested)
{
    // ① Message received
    var result = consumer.Consume(cancellationToken);

    // ② Processing happens here — rebalance or crash can interrupt at any point
    await ProcessOrderAsync(result.Message.Value);

    // ③ Offset committed AFTER successful processing
    //    If the process crashes between ② and ③, the message is redelivered
    //    Both behaviors are intentional — they preserve at-least-once delivery
    consumer.Commit(result);
}

A crash between receiving the message and committing its offset means Kafka will redeliver the message when the consumer restarts or when the partition is reassigned. That's not a bug — it's the at-least-once delivery guarantee in action. But it means your processing logic must be safe to run more than once on the same message.

🤔 Did you know? A partition rebalance invokes the PartitionsRevoked callback on your consumer — the last opportunity to commit any in-flight offsets before they're handed to another instance. If your consumer doesn't implement this callback, those in-flight offsets may be lost, leading to reprocessing on the new owner. Confluent.Kafka exposes it via ConsumerBuilder.SetPartitionsRevokedHandler.


At-Least-Once Is the Default — and the Starting Design Constraint

At-least-once delivery means Kafka guarantees that every produced message will be consumed at least once, but potentially more than once. This is the right default for most use cases — the alternative (at-most-once) means accepting data loss on failure.

But at-least-once puts the burden of duplicate safety on your consumer.

📋 Delivery Semantics Comparison

┌─────────────────────┬──────────────────────┬───────────────────────┐
│ Semantic            │ Data Loss Risk       │ Duplication Risk      │
├─────────────────────┼──────────────────────┼───────────────────────┤
│ At-most-once        │ 🔴 Yes (on failure)   │ ✅ None               │
│ At-least-once       │ ✅ None               │ 🟡 Yes (on retry)     │
│ Exactly-once        │ ✅ None               │ ✅ None (within Kafka)│
└─────────────────────┴──────────────────────┴───────────────────────┘

Note: Exactly-once semantics apply within the Kafka broker and transactional
producer/consumer pair. They do NOT automatically extend to your external
downstream systems (databases, APIs, caches).

Wrong thinking: "I'll configure enable.idempotence=true on the producer and exactly-once will take care of itself."

Correct thinking: Producer-side idempotence prevents duplicate broker writes from retried produce calls, but your consumer processing the same message twice — due to a rebalance, a crash, or a retry — is a separate concern that requires deliberate consumer-side design.


Transient Errors vs. Poison-Pill Messages

Not all failures in a consumer loop are the same kind of problem. The distinction that matters most is between a transient error and a poison-pill message.

A transient error is a failure likely to resolve on its own — a temporarily overloaded database, a network timeout, a momentary loss of connectivity. The right response is to retry with backoff.

A poison-pill message will never succeed regardless of how many times you retry it — malformed JSON, an incompatible schema version, a reference to an entity ID that doesn't exist. Retrying a poison pill doesn't fix anything; it blocks the partition and stops all forward progress on every subsequent message behind it.

// Illustrating the two different failure branches — full DLT routing
// is covered in 'Idempotent Consumers and the Dead-Letter Queue Contract'.
try
{
    var order = JsonSerializer.Deserialize<OrderEvent>(result.Message.Value);
    await SaveToDatabase(order);
    consumer.Commit(result);
}
catch (JsonException ex)
{
    // Poison pill: no retry will fix a malformed payload.
    _logger.LogError(ex, "Undeserializable message at {Offset}", result.TopicPartitionOffset);
    await RouteToDeadLetterTopicAsync(result);
    consumer.Commit(result);
}
catch (DbException ex) when (IsTransient(ex))
{
    // Transient error: retry with backoff. Do NOT commit.
    _logger.LogWarning(ex, "Transient DB error, will retry");
    await Task.Delay(retryDelay, cancellationToken);
}

🎯 Key Principle: Route transient errors toward retry (don't commit, let the message be redelivered). Route permanent errors toward dead-letter topics (commit, preserve the message for inspection). Confusing the two produces either an infinite retry loop or silent data loss — both are production outages.


Production Readiness Is a Pipeline Property

The failure modes that cause real incidents almost always span component boundaries:

┌──────────────────────────────────────────────────────────────────────────┐
│                   Cross-Boundary Failure Scenarios                       │
├─────────────────────┬────────────────────────────────────────────────────┤
│ 🔒 Producer + DB    │ Produce succeeds, DB transaction rolls back →      │
│                     │ event published for business state that never      │
│                     │ committed (dual-write race condition)              │
├─────────────────────┼────────────────────────────────────────────────────┤
│ 🔁 Broker + Consumer│ Broker delivers message, consumer processes it,    │
│                     │ broker triggers rebalance before commit →          │
│                     │ message redelivered and processed twice            │
├─────────────────────┼────────────────────────────────────────────────────┤
│ 📉 Consumer + Down- │ Consumer commits offset, downstream write fails    │
│    stream           │ silently → message consumed but effect lost        │
├─────────────────────┼────────────────────────────────────────────────────┤
│ 🕳️ All components   │ Lag grows silently, no alert fires, backlog of     │
│                     │ unprocessed messages accumulates undetected        │
└─────────────────────┴────────────────────────────────────────────────────┘

Each row is a scenario where individual components function correctly but the system as a whole is failing. The outbox pattern addresses the producer-to-database boundary. Consumer lag alerting addresses the detection gap where a pipeline can fail silently without any single component throwing an error.

The rest of this lesson builds directly on these failure modes:

  • 🔁 Redelivery and at-least-once semantics → idempotent consumer design and offset management
  • 💀 Poison-pill messages → dead-letter topics that don't block partition progress
  • 📉 Silent pipeline failures → consumer lag measurement and alerting
  • 🔒 Dual-write race conditions → the outbox pattern

Idempotent Consumers and the Dead-Letter Queue Contract

With those failure modes in mind, two mechanisms close the gap between Kafka's delivery guarantee and the safety your downstream systems require: idempotent consumers and the dead-letter topic (DLT). Neither is complicated in isolation, but the interaction between them — particularly around offset commits — is where most production incidents originate.

The Idempotency Key: Your Defense Against Duplicate Processing

An idempotent consumer produces the same observable outcome whether it processes a given message once or a hundred times. The mechanism: before acting on a message, check whether you have already processed it; if you have, skip it and commit the offset as if you had processed it normally. The check depends on an idempotency key — a value that uniquely identifies the message across deliveries.

You have two options for sourcing this key:

  • 🔑 Natural key: a domain-level identifier embedded in the message payload — an OrderId, a PaymentTransactionId, a DeduplicationToken. Preferred when the upstream system can guarantee a stable unique identifier per business event.
  • 🔧 Synthetic key: the Kafka coordinates themselves — topic + partition + offset. Because Kafka guarantees a message always has the same topic-partition-offset triple, this tuple is a reliable key even when the payload carries no domain identifier.

The key must be stored in a durable store — a database, Redis, or any system that survives a consumer restart. An in-memory set works locally but is useless in production because a pod restart empties it, and the very restart that triggers redelivery also clears your deduplication record.

┌─────────────────────────────────────────────────────┐
│                  Consumer Process                   │
│                                                     │
│  Receive message (topic, partition, offset, payload)│
│          │                                          │
│          ▼                                          │
│  ┌───────────────┐   key found   ┌───────────────┐  │
│  │ Check durable │──────────────►│ Skip + Commit │  │
│  │ idempotency   │               │    offset     │  │
│  │    store      │               └───────────────┘  │
│  └───────┬───────┘                                  │
│          │ key NOT found                            │
│          ▼                                          │
│  ┌───────────────┐               ┌───────────────┐  │
│  │ Execute       │──────────────►│ Write key to  │  │
│  │ downstream    │    success    │ store, Commit │  │
│  │ business      │               │    offset     │  │
│  │ logic         │               └───────────────┘  │
│  └───────┬───────┘                                  │
│          │ failure                                  │
│          ▼                                          │
│  ┌───────────────┐                                  │
│  │ Retry or      │                                  │
│  │ route to DLT  │                                  │
│  └───────────────┘                                  │
└─────────────────────────────────────────────────────┘

⚠️ Critical ordering: Write the idempotency key in the same atomic operation as the downstream write, or after it — never before. If you write the key first and then crash before the downstream write completes, a retry will find the key, believe the message was processed, and silently discard the event.

The Commit Tradeoff: At-Least-Once vs. At-Most-Once

The position of the commit relative to the downstream write determines your delivery semantics:

Commit timing Semantics Risk
Commit before downstream write At-most-once Crash between commit and write → message lost forever
Commit after downstream write At-least-once Crash after write, before commit → message redelivered

At-least-once with idempotent consumers is the pattern almost all production Kafka systems target. The Confluent .NET client exposes two methods for precise control:

  • StoreOffset(consumeResult) — marks the offset ready to commit on the next cycle without immediately contacting the broker. Useful for buffering commits for throughput.
  • Commit(consumeResult) — synchronously sends the offset to the broker. Use when you need a hard durability guarantee before proceeding.

Set EnableAutoCommit = false to use these methods meaningfully. With auto-commit enabled, the client advances the offset on a background timer regardless of whether processing succeeded.

var config = new ConsumerConfig
{
    BootstrapServers = "localhost:9092",
    GroupId = "order-processor",
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false
};

using var consumer = new ConsumerBuilder<string, string>(config).Build();
consumer.Subscribe("orders");

while (!cancellationToken.IsCancellationRequested)
{
    ConsumeResult<string, string> result;
    try
    {
        result = consumer.Consume(cancellationToken);
    }
    catch (ConsumeException ex)
    {
        // Transport-level error — log and continue; do NOT commit
        logger.LogError(ex, "Kafka transport error: {Reason}", ex.Error.Reason);
        continue;
    }

    var idempotencyKey = $"{result.Topic}-{result.Partition}-{result.Offset}";

    if (await idempotencyStore.ExistsAsync(idempotencyKey))
    {
        consumer.StoreOffset(result);
        continue;
    }

    await ProcessOrderAsync(result.Message.Value);
    await idempotencyStore.SetAsync(idempotencyKey, TimeSpan.FromDays(7));

    consumer.Commit(result);
}

The Dead-Letter Topic: Quarantine Without Blocking

A poison-pill message — malformed payload, invalid schema, unrecoverable business logic exception — will fail on every retry. Without a safety valve, your consumer stalls at that offset indefinitely, blocking every message behind it on that partition. This is partition starvation, one of the most disruptive failure modes in production Kafka.

The dead-letter topic (DLT) is the safety valve. The contract: after a configured number of retry attempts, route the unprocessable message to a separate topic (conventionally <original-topic>.DLT) and commit the original offset, allowing the main consumer to advance. The message is preserved for operator inspection and eventual replay.

Main Topic: orders
─────────────────────────────────────────────────────
  [msg A] [msg B] [msg C ← poison pill] [msg D] [msg E]
                       │
                       │  fails 3 times
                       ▼
             ┌─────────────────┐
             │  Retry counter  │
             │  exhausted (3)  │
             └────────┬────────┘
                      │
          ┌───────────┴───────────┐
          ▼                       ▼
  DLT Topic:               Original offset
  orders.DLT               committed → msg D
  [msg C + headers:         and E can now
   error reason,            be processed
   retry count,
   original offset]

The ordering constraint is strict: produce to the DLT and flush before committing the original offset. If you commit first and the DLT produce fails, the message vanishes from both the main consumer and the DLT. If the DLT write succeeds but the commit crashes, you may write to the DLT twice on replay — that is recoverable.

private async Task RouteToDeadLetterAsync(
    IProducer<string, string> dltProducer,
    IConsumer<string, string> consumer,
    ConsumeResult<string, string> failed,
    Exception reason,
    int retryCount)
{
    var dltMessage = new Message<string, string>
    {
        Key = failed.Message.Key,
        Value = failed.Message.Value,
        Headers = new Headers
        {
            { "dlt.original.topic",     Encoding.UTF8.GetBytes(failed.Topic) },
            { "dlt.original.partition", Encoding.UTF8.GetBytes(failed.Partition.Value.ToString()) },
            { "dlt.original.offset",    Encoding.UTF8.GetBytes(failed.Offset.Value.ToString()) },
            { "dlt.error.reason",       Encoding.UTF8.GetBytes(reason.Message) },
            { "dlt.retry.count",        Encoding.UTF8.GetBytes(retryCount.ToString()) },
            { "dlt.timestamp.utc",      Encoding.UTF8.GetBytes(DateTime.UtcNow.ToString("O")) }
        }
    };

    // Produce and flush BEFORE committing the original offset
    dltProducer.Produce($"{failed.Topic}.DLT", dltMessage);
    dltProducer.Flush(TimeSpan.FromSeconds(10));

    consumer.Commit(failed);
}

The headers are critical — without them the DLT is a graveyard rather than a quarantine: you have the message but no cause of death.

The Outbox Pattern: Eliminating the Dual-Write Race

Neither idempotency nor the DLT addresses the race condition between writing to your database and producing to Kafka. The naive pattern:

1. BEGIN database transaction
2. Write business record to DB
3. COMMIT database transaction
4. Produce event to Kafka

If the process crashes after step 3 and before step 4, the database reflects a state change that Kafka never announced. This is a dual-write race condition — insidious because it produces no error. Both operations appear successful up to the moment of the crash.

The outbox pattern eliminates the race by writing the event to a table in the same database transaction as the business record. Both writes share an ACID transaction — they either both commit or both roll back.

┌─────────────────────────────────────────────────────┐
│               Application Database                  │
│                                                     │
│  ┌──────────────────┐    same transaction           │
│  │  orders table    │◄──────────────────────┐       │
│  │  (business data) │                       │       │
│  └──────────────────┘            ┌──────────┴─────┐ │
│                                  │  Application   │ │
│  ┌──────────────────┐            │  writes both   │ │
│  │  outbox table    │◄──────────►│  atomically    │ │
│  │  (pending events)│            └────────────────┘ │
│  └────────┬─────────┘                               │
└───────────┼─────────────────────────────────────────┘
            │ polled by relay worker
            ▼
┌─────────────────────┐        ┌──────────────────┐
│  Outbox Relay       │──────► │  Kafka Broker    │
│  BackgroundService  │  ack   │  (orders topic)  │
│  (polls + produces) │◄───────│                  │
└─────────────────────┘        └──────────────────┘
     deletes row only
     after ack received

The outbox relay worker polls the outbox table, produces each unpublished row to Kafka, and deletes the row only after receiving a broker acknowledgment. If it crashes after producing but before deleting, it will produce the event again on restart — that duplication is handled by the idempotent consumer on the receiving end, completing the reliability loop.

The relay producer should be configured with EnableIdempotence = true, which activates producer-side deduplication (sequence numbers + in-flight tracking) to prevent the broker from recording the same event twice on retried produces. This is distinct from consumer-side idempotency and complements rather than replaces it.

💡 Real-World Example: An e-commerce checkout service writes the Order record to a relational database and must emit an OrderPlaced event to Kafka. Without the outbox pattern, a container eviction between the DB commit and the Kafka produce means the order exists in the database but no downstream service knows about it — the warehouse never packs the box, the customer never gets a confirmation email. With the outbox pattern, the OrderPlaced row survives any crash and the relay eventually delivers it.

Connecting the Mechanisms

Idempotency, offset management, the DLT contract, and the outbox pattern form a pipeline of guarantees:

🔧 Mechanism 🎯 Problem Solved ⚠️ Without It
🔑 Idempotency key + durable store Duplicate processing on redelivery Duplicate side effects (double charges, double inserts)
🕐 Process-before-commit ordering Crash between process and commit Silent data loss
☠️ Dead-letter topic Poison pills stalling the partition Partition starvation, cascading lag
📬 Outbox pattern Dual-write race between DB and Kafka Events lost on producer-side crash

The practical C# code that assembles these mechanisms into a working consumer loop is walked through in the next section.


Consumer Lag: What It Measures and How to Act on It

With the reliability mechanisms in place, you need a way to know when the pipeline is under stress before it causes a customer-facing incident. Consumer lag is that signal. It is the earliest warning that your pipeline is struggling, and it is the metric most likely to give you time to act before a backlog becomes an outage. But lag is also routinely misread — teams set alerts on raw offset counts and either panic unnecessarily or miss the slow growth that signals genuine capacity exhaustion.

What Lag Actually Measures

Consumer lag is the difference between the log-end offset (the offset of the next message that will be written to a partition) and the committed offset (the last offset a consumer group has acknowledged processing). That subtraction happens at the partition level, not the topic level — a topic with 12 partitions has 12 independent lag values, each potentially telling a different story.

Partition 0:  log-end offset = 10,450   committed offset = 10,448   lag = 2
Partition 1:  log-end offset = 10,501   committed offset = 9,870    lag = 631
Partition 2:  log-end offset = 10,389   committed offset = 10,387   lag = 2

An aggregate "topic lag" of 635 sounds alarming, but partition 1 is doing all the work — which immediately tells you to look at the consumer instance assigned to that partition, not the consumer group as a whole. Per-partition lag is the unit of diagnosis; a rolled-up sum is useful only as a high-level health indicator.

Reading the Shape of Lag, Not Just the Number

The pattern of lag over time is more diagnostic than any single reading. Two shapes appear repeatedly in production:

Steady, monotonically growing lag means the consumer is processing messages more slowly than the producer is writing them. This is a capacity problem: the consumer group lacks throughput to keep up, and the solution involves horizontal scaling, throughput optimization, or reducing per-message processing cost.

Time →
Lag:  100 ... 400 ... 900 ... 1,800 ... 3,200
                ↑ steady growth = capacity exhaustion

Spike-and-recover lag looks different — lag jumps sharply, then returns to near-zero. This pattern almost never indicates capacity problems. It is the signature of a consumer group rebalance (all consumers pause during assignment) or a processing slowdown on a single expensive message.

|         ___
|        /   \
|       /     \
|______/       \_____________
time→   ^ rebalance  ^ recovery
        starts

A stuck partition whose lag never recovers after a spike is the clearest indicator of a message that needs routing to a dead-letter topic. As covered in the previous section, the correct response to a stalled partition is to route the unprocessable message to the DLT — not to increase max.poll.interval.ms, which only postpones the rebalance trigger.

Where Lag Numbers Come From

Every lag monitoring tool reads from the same two authoritative sources: the __consumer_offsets internal topic (committed offsets) and the broker's metadata API (log-end offsets per partition). Third-party tools are all building on these primitives.

In .NET, Confluent.Kafka exposes this through IAdminClient:

using Confluent.Kafka;
using Confluent.Kafka.Admin;

var config = new AdminClientConfig { BootstrapServers = "localhost:9092" };
using var adminClient = new AdminClientBuilder(config).Build();

// Fetch committed offsets for a consumer group
var committedOffsets = await adminClient.ListConsumerGroupOffsetsAsync(
    new List<ConsumerGroupTopicPartitions>
    {
        new ConsumerGroupTopicPartitions(
            "my-consumer-group",
            new List<TopicPartition>
            {
                new TopicPartition("orders", new Partition(0)),
                new TopicPartition("orders", new Partition(1)),
            })
    });

// Fetch log-end offsets for the same partitions
// Note: verify exact overload against the version of Confluent.Kafka you are targeting
var logEndOffsets = await adminClient.ListOffsetsAsync(
    committedOffsets
        .SelectMany(g => g.Partitions)
        .Select(tp => new TopicPartitionOffsetSpec(tp.TopicPartition, OffsetSpec.Latest))
        .ToList());

// Compute lag per partition
foreach (var committed in committedOffsets.SelectMany(g => g.Partitions))
{
    var logEnd = logEndOffsets
        .FirstOrDefault(x => x.TopicPartition == committed.TopicPartition);

    if (logEnd != null)
    {
        var lag = logEnd.Offset.Value - committed.Offset.Value;
        Console.WriteLine($"{committed.TopicPartition}: lag = {lag}");
    }
}

In a real monitoring service you would emit these values as metrics (Prometheus gauges, OpenTelemetry observability signals) on a polling interval rather than printing them.

🤔 Did you know? When a consumer group has no live members — paused, crashed, or stopped — its committed offsets remain frozen in __consumer_offsets while the log-end offset keeps advancing. Lag grows silently with no consumer running. A lag alert will fire correctly in this case, but only if it watches the committed offset, not just consumer heartbeat health. Both signals belong on your dashboard.

Alerting in Time-to-Catch-Up, Not Raw Offsets

A threshold of "alert if lag exceeds 10,000 messages" immediately runs into a problem: 10,000 offsets represents two seconds of work on a high-throughput topic and three hours of backlog on a low-throughput one. The same number is both too sensitive and not sensitive enough.

The fix is to express alerting thresholds in time-to-catch-up: how long would it take, at current consume throughput, to drain the existing lag to zero?

time_to_catch_up_seconds = current_lag / consume_rate_messages_per_second

A lag of 50,000 messages with a consume rate of 5,000 msg/s gives 10 seconds — probably fine. The same lag with a consume rate of 50 msg/s gives 1,000 seconds — nearly 17 minutes behind and growing.

// Illustrative lag metric enrichment
public record LagMetric(
    TopicPartition TopicPartition,
    long CurrentLag,
    double ConsumeRatePerSecond)
{
    // Returns null if consume rate is zero (consumer is stalled)
    public double? TimeToCatchUpSeconds =>
        ConsumeRatePerSecond > 0
            ? CurrentLag / ConsumeRatePerSecond
            : null;
}

// Alert when TimeToCatchUpSeconds exceeds your SLO tolerance
// e.g., > 300 seconds = warning; > 900 seconds = page on-call

From Alert to Action: The Runbook Contract

A lag alert with no defined response is noise. The value of consumer lag as a signal comes from pairing it with concrete, pre-agreed responses tied to the pattern of lag:

📊 Pattern 🔍 Likely Cause 🔧 First Action
🔺 Steady monotonic growth Insufficient consumer throughput Scale consumer instances horizontally
⚡ Sharp spike, full recovery Rebalance or transient slowdown Check rebalance logs; no action if self-resolving
📌 Spike, partial recovery, plateau Poison-pill message in one partition Inspect DLT; investigate stalled offset range
🔴 Lag growing, consume rate = 0 Consumer group down Restart consumer service; check health endpoints
🐢 Slow creep over hours Processing cost increased Profile per-message processing time; check downstream latency

For horizontal scaling, the partition count is the ceiling on parallelism — you cannot usefully add more consumer instances than there are partitions. If you are already at partition count and lag is still growing, the next lever is reducing per-message processing cost or adding partitions (which requires careful planning, as it changes key routing).

Increasing max.poll.interval.ms is legitimate only when lag spikes correlate with genuinely long-running processing on non-poisoned messages. It also means a crashed consumer takes longer to be detected as dead — set it only as high as your worst-case legitimate processing time.

💡 Real-World Example: A team running an order-enrichment consumer noticed lag on partition 3 growing every Tuesday afternoon, then recovering by end of day. Investigation revealed a downstream inventory API ran a heavy report job every Tuesday, adding ~400ms of latency per call. The fix was adding a circuit-breaker timeout and routing to a retry topic when the API was slow — keeping per-message processing time bounded. The time-to-catch-up metric made the pattern visible; the runbook directed the team to check downstream latency rather than immediately scaling.

Instrumenting Lag in Your .NET Service

Embedding lag emission directly into your consumer service gives you lower-latency metrics and lets you co-locate them with business context. The following pattern uses a BackgroundService that periodically computes lag and emits it as an OpenTelemetry observable gauge:

using System.Diagnostics.Metrics;
using Confluent.Kafka;
using Confluent.Kafka.Admin;

public class LagReporterService : BackgroundService
{
    private static readonly Meter Meter = new("kafka.consumer");
    private readonly IAdminClient _adminClient;
    private readonly string _groupId;
    private readonly string _topic;
    private readonly int _partitionCount;
    // Thread-safe cache updated by the polling loop
    private readonly ConcurrentDictionary<int, long> _lagCache = new();

    public LagReporterService(
        IAdminClient adminClient, string groupId, string topic, int partitionCount)
    {
        _adminClient = adminClient;
        _groupId = groupId;
        _topic = topic;
        _partitionCount = partitionCount;

        // Observable gauge reads from cache; polling loop keeps it fresh
        Meter.CreateObservableGauge(
            "kafka_consumer_lag",
            ObserveLag,
            unit: "messages",
            description: "Consumer lag per partition");
    }

    private IEnumerable<Measurement<long>> ObserveLag()
    {
        foreach (var (partition, lag) in _lagCache)
        {
            yield return new Measurement<long>(
                lag,
                new KeyValuePair<string, object?>("partition", partition),
                new KeyValuePair<string, object?>("topic", _topic),
                new KeyValuePair<string, object?>("group", _groupId));
        }
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            // Use ListConsumerGroupOffsetsAsync + ListOffsetsAsync (shown above)
            // to compute per-partition lag and populate _lagCache here
            await Task.Delay(TimeSpan.FromSeconds(15), stoppingToken);
        }
    }
}

⚠️ Common Mistake: Emitting only a topic-level aggregate lag metric. When partition 1 is stalled and partitions 0 and 2 are healthy, an aggregate metric averages the problem away. Always emit per-partition lag, and sum at the dashboard level if you need a health overview.


Practical Patterns in .NET: Putting It Together

The concepts covered so far — idempotency, dead-letter routing, offset management, and the outbox pattern — only pay off when implemented correctly. Theory diverges from practice most sharply in the consume loop: the single while block where every production failure mode eventually surfaces. This section walks through the concrete C# patterns that translate those concepts into working Confluent.Kafka code, with particular attention to the ordering of operations.

Manual Offset Control: The Foundation Everything Else Builds On

Auto-commit is Confluent.Kafka's default: the library periodically flushes committed offsets to the broker on a timer, independently of whether processing succeeded. A message can fail and still have its offset advanced before you notice. Manual offset commit (EnableAutoCommit = false combined with explicit consumer.Commit(consumeResult)) makes the commit a deliberate act you control — the prerequisite for every other pattern in this section.

var config = new ConsumerConfig
{
    BootstrapServers = "localhost:9092",
    GroupId = "order-processor",
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false
};

using var consumer = new ConsumerBuilder<string, string>(config).Build();
consumer.Subscribe("orders");

while (!stoppingToken.IsCancellationRequested)
{
    var result = consumer.Consume(TimeSpan.FromSeconds(1));
    if (result is null) continue;

    try
    {
        await ProcessMessageAsync(result);
        consumer.Commit(result); // commit only after successful processing
    }
    catch (Exception ex)
    {
        // Do NOT commit — let error routing decide what happens next
        logger.LogError(ex, "Processing failed for {TopicPartitionOffset}",
            result.TopicPartitionOffset);
        throw;
    }
}

💡 Pro Tip: consumer.Commit(consumeResult) commits the specific offset you pass it. Passing the ConsumeResult directly is safer than relying on the overload that commits all currently-assigned partitions, because a rebalance mid-batch can change your partition assignments between the consume call and the commit.

The Idempotent Consumer Loop

Manual commit prevents lost offsets but does not prevent duplicate side effects. Every consumer restart replays from the last committed offset — if processing is not idempotent, those replays corrupt downstream state.

async Task ProcessWithIdempotencyAsync(
    ConsumeResult<string, string> result,
    IDatabase redis,
    IConsumer<string, string> consumer)
{
    var idempotencyKey = $"{result.Topic}:{result.Partition}:{result.Offset}";

    bool alreadyProcessed = await redis.KeyExistsAsync(idempotencyKey);
    if (alreadyProcessed)
    {
        logger.LogDebug("Skipping duplicate message at {Key}", idempotencyKey);
        consumer.Commit(result);
        return;
    }

    var order = JsonSerializer.Deserialize<OrderEvent>(result.Message.Value);
    await orderRepository.UpsertAsync(order!);

    // Record idempotency key AFTER the downstream write succeeds
    await redis.StringSetAsync(idempotencyKey, "1", TimeSpan.FromHours(72));

    consumer.Commit(result);
}

Correct operation order:

  Consume message
       │
       ▼
  Check idempotency store ──► Already exists? ──► Commit & skip
       │
       ▼ (not seen)
  Process (downstream write)
       │
       ▼
  Record idempotency key
       │
       ▼
  Commit offset

Error Routing: Distinguishing Transport Failures from Application Failures

A ConsumeException signals a Kafka transport-level problem — network timeout, broker unavailability, offset out of range. These are typically transient; the right response is to log and retry with backoff, not route to a DLT.

An application exceptionJsonException, DbException, domain validation failure — means the message or environment is the problem and may warrant DLT routing.

while (!stoppingToken.IsCancellationRequested)
{
    ConsumeResult<string, string>? result = null;
    try
    {
        result = consumer.Consume(TimeSpan.FromSeconds(1));
        if (result is null) continue;

        await ProcessWithIdempotencyAsync(result, redis, consumer);
    }
    catch (ConsumeException ex)
    {
        // Transport error: log and backoff; do NOT attempt to DLT
        logger.LogError(ex, "Kafka transport error: {Reason}", ex.Error.Reason);
        await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);
    }
    catch (Exception ex) when (result is not null)
    {
        // Application error: route the original message to the DLT
        logger.LogWarning(ex, "Application error — routing to DLT: {TopicPartitionOffset}",
            result.TopicPartitionOffset);
        await RouteToDeadLetterAsync(result, ex, dltProducer, consumer);
        // commit happens inside RouteToDeadLetterAsync after the DLT flush
    }
}

The when (result is not null) guard is not cosmetic — if Consume itself throws a ConsumeException, result remains null and there is nothing to route to the DLT. The two catch branches handle genuinely different situations and must not be collapsed.

Dead-Letter Routing: Order of Operations Under Failure

The ordering constraint is strict: produce to the DLT and flush before committing the original offset. Commit first and crash before the DLT write completes, and the message disappears entirely — neither processed nor inspectable.

async Task RouteToDeadLetterAsync(
    ConsumeResult<string, string> failed,
    Exception reason,
    IProducer<string, string> dltProducer,
    IConsumer<string, string> consumer)
{
    var dltMessage = new Message<string, string>
    {
        Key = failed.Message.Key,
        Value = failed.Message.Value,
        Headers = new Headers
        {
            { "dlt-original-topic",     Encoding.UTF8.GetBytes(failed.Topic) },
            { "dlt-original-partition", BitConverter.GetBytes(failed.Partition.Value) },
            { "dlt-original-offset",    BitConverter.GetBytes(failed.Offset.Value) },
            { "dlt-exception-type",     Encoding.UTF8.GetBytes(reason.GetType().Name) },
            { "dlt-exception-message",  Encoding.UTF8.GetBytes(reason.Message) }
        }
    };

    var delivery = await dltProducer.ProduceAsync($"{failed.Topic}.dlt", dltMessage);
    dltProducer.Flush(TimeSpan.FromSeconds(10));

    logger.LogInformation("DLT write confirmed at {TopicPartitionOffset}",
        delivery.TopicPartitionOffset);

    consumer.Commit(failed); // only after DLT write is confirmed
}

The Outbox Relay Worker in .NET

The outbox pattern solves the dual-write problem on the producer side. The .NET implementation is a BackgroundService that continuously polls an outbox table and relays pending rows to Kafka. The critical constraint: delete the outbox row only after the broker has acknowledged the produce.

public class OutboxRelayWorker : BackgroundService
{
    private readonly IServiceScopeFactory _scopeFactory;
    private readonly IProducer<string, string> _producer;
    private readonly ILogger<OutboxRelayWorker> _logger;

    public OutboxRelayWorker(IServiceScopeFactory scopeFactory, ILogger<OutboxRelayWorker> logger)
    {
        _scopeFactory = scopeFactory;
        _logger = logger;

        // EnableIdempotence ensures broker deduplicates retried produces
        var producerConfig = new ProducerConfig
        {
            BootstrapServers = "localhost:9092",
            EnableIdempotence = true,
            Acks = Acks.All,
            MessageSendMaxRetries = 5
        };
        _producer = new ProducerBuilder<string, string>(producerConfig).Build();
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            using var scope = _scopeFactory.CreateScope();
            var db = scope.ServiceProvider.GetRequiredService<AppDbContext>();

            var pendingRows = await db.OutboxMessages
                .OrderBy(m => m.CreatedAt)
                .Take(50)
                .ToListAsync(stoppingToken);

            foreach (var row in pendingRows)
            {
                try
                {
                    await _producer.ProduceAsync(
                        row.Topic,
                        new Message<string, string> { Key = row.AggregateId, Value = row.Payload },
                        stoppingToken);

                    // Delete only after broker ACK — separate SaveChangesAsync from business write
                    db.OutboxMessages.Remove(row);
                    await db.SaveChangesAsync(stoppingToken);
                }
                catch (ProduceException<string, string> ex)
                {
                    _logger.LogWarning(ex, "Produce failed for outbox row {Id}", row.Id);
                    // Leave row in place; next poll iteration will retry
                }
            }

            if (!pendingRows.Any())
                await Task.Delay(TimeSpan.FromMilliseconds(500), stoppingToken);
        }
    }
}

EnableIdempotence = true means that if ProduceAsync times out and the worker retries, the broker deduplicates the second attempt using the producer's sequence number — preventing duplicate Kafka messages even on relay retries.

Putting the Pieces Together: The Full Flow

Producer side (Outbox Worker):

  Business operation
  + DB write
  + Outbox row write  ──► Single DB transaction (atomic)
         │
         ▼
  Poll outbox rows
         │
         ▼
  ProduceAsync ──► Broker ACK
         │
         ▼
  Delete outbox row

Consumer side (BackgroundService):

  Consume message
         │
         ├── ConsumeException ──► log + backoff
         │
         ▼
  Check idempotency store
         ├── Seen ──► Commit & skip
         │
         ▼ (unseen)
  Process message
         │
         ├── Application exception ──► Produce to DLT
         │                            Flush
         │                            Commit original offset
         ▼
  Record idempotency key
         │
         ▼
  Commit offset

The ordering at every handoff — produce before delete, DLT before commit, idempotency key after downstream write — is what makes the full pipeline safe under process crashes, broker restarts, and rebalances.

📋 Pattern Ordering Rules

🔒 Pattern ✅ Correct Order ❌ Wrong Order
🔧 Idempotency Write downstream → record key → commit Record key → write downstream
📨 Dead-letter routing DLT produce → flush → commit original Commit → DLT produce
📤 Outbox relay ProduceAsync → await ACK → delete row Delete row → ProduceAsync
🎯 Manual commit Process → commit Commit → process

🧠 Mnemonic: "ACK before advance" — whether advancing a database row out of the outbox or advancing a Kafka offset, wait for acknowledgment from the next system before removing proof that the work needs doing.


Common Mistakes and What They Cost

Every mistake covered here has the same underlying shape: a gap between what the code appears to guarantee and what it actually guarantees under failure. Locally, these gaps are invisible — the happy path never exercises them. In production, they surface as silent data loss, phantom consistency, or monitoring that lies about system health.

Mistake 1: Committing the Offset Before the Downstream Write Completes

Premature offset commitment is the single most consequential sequencing error in consumer code, and it is easy to introduce accidentally — especially when enable.auto.commit=true is left in place or when a Commit() call is placed at the top of a retry block.

Message received (offset 47)
        │
        ▼
  consumer.Commit()  ◄── offset 47 now durable in __consumer_offsets
        │
        ▼
  downstream write   ◄── CRASH HERE
        │
        ▼
  (never completes)

After the crash, the consumer group restarts from offset 48. Offset 47 is gone — the broker will not redeliver a committed offset to the same group. If the downstream write was an order fulfillment record or a payment event, that event simply did not happen, with nothing in your system flagging it as missing.

The correct ordering is: receive → process → write downstream → commit. Auto-commit fires on a timer (auto.commit.interval.ms, default 5000 ms) — if your downstream write takes 3 seconds and a crash occurs at second 4, the auto-commit has already fired mid-batch. EnableAutoCommit = false removes this ambiguity entirely.

Mistake 2: Swallowing All Exceptions and Always Committing

This mistake usually starts with good intentions: keep the consumer moving so it doesn't fall behind.

// ❌ DO NOT DO THIS — silent data loss pattern
while (!stoppingToken.IsCancellationRequested)
{
    try
    {
        var result = consumer.Consume(stoppingToken);
        await ProcessMessageAsync(result.Message);
    }
    catch (Exception ex)
    {
        _logger.LogError(ex, "Error processing message, skipping.");
    }
    finally
    {
        consumer.Commit(); // always commit so we don't reprocess
    }
}

This code has three compounding problems. The commit always fires even when ProcessMessageAsync threw because the downstream API was temporarily unavailable — that message is now permanently gone. The DLT becomes unreachable because failed messages are never routed there before committing. And transient errors and poison pills look identical in the logs, destroying the signal that distinguishes a bad message from a bad environment.

🎯 Key Principle: A consumer that never stops but also never correctly processes is not more reliable than one that stops — it is less reliable, because it produces no error signal and silently discards work.

Mistake 3: Not Testing Consumer Behavior Under Partition Rebalance

A rebalance occurs when consumers join or leave a group, or when a consumer fails to poll within max.poll.interval.ms. During a rebalance, partition assignments are revoked and redistributed. Consider a consumer that accumulates 500 messages in memory, processes them all, and commits once at the end:

Consumer A owns partition 3
  - Reads messages 100–599
  - Processes messages 100–299
  - Rebalance triggered (another instance joined)
  - Partition 3 revoked from Consumer A — NO COMMIT YET
  - Consumer B assigned partition 3, resumes from offset 100
  - Messages 100–299 processed again ✓ (if idempotent) or duplicated ✗

The practical fix registers a rebalance handler that commits on revocation:

var consumer = new ConsumerBuilder<string, string>(config)
    .SetPartitionsRevokedHandler((c, partitions) =>
    {
        c.Commit();
        Console.WriteLine($"Partitions revoked: {string.Join(", ", partitions)}");
    })
    .SetPartitionsAssignedHandler((c, partitions) =>
    {
        Console.WriteLine($"Partitions assigned: {string.Join(", ", partitions)}");
    })
    .Build();

⚠️ Common Mistake: Relying on integration tests that run a single consumer instance against a single-partition topic. Rebalance behavior requires at least two consumer instances and a partition count that allows redistribution. Single-instance tests will never trigger a rebalance.

Mistake 4: Treating Zero Lag as Proof the Consumer Is Healthy

This mistake lives in the monitoring layer, but its consequences are just as severe.

Scenario A — healthy consumer:
  Latest offset produced:   1,000
  Last committed offset:    1,000
  Lag:                          0  ✓ Consumer is processing

Scenario B — crashed consumer group:
  Latest offset produced:   1,000
  Last committed offset:    1,000  (from before the crash)
  Lag:                          0  ✗ Consumer is NOT running

  [New messages produced: 1,001 ... 1,050]
  Latest offset produced:   1,050
  Last committed offset:    1,000  (unchanged — nobody is consuming)
  Lag:                         50  ← lag appears after a delay

There is a window — potentially minutes — where lag reads zero while the consumer group is completely dead and messages accumulate unprocessed. The additional signal you need is consumer group activity: is the group actively polling? The Kafka AdminClient exposes group state (Stable, Empty, Dead) alongside lag. The complete alert condition is: lag is growing or the consumer group is not in a Stable state with active members.

Mistake 5: Skipping the Outbox Pattern and Producing Inside a Database Transaction

This introduces a distributed consistency gap that is invisible in local development and intermittent in production.

// ❌ Dual-write anti-pattern
public async Task PlaceOrderAsync(Order order)
{
    using var transaction = await _db.Database.BeginTransactionAsync();

    _db.Orders.Add(order);
    await _db.SaveChangesAsync();

    // If this fails AFTER the DB commit, the event is lost
    await _producer.ProduceAsync("orders", new Message<string, string>
    {
        Key = order.Id.ToString(),
        Value = JsonSerializer.Serialize(order)
    });

    await transaction.CommitAsync();
    // DB row exists; Kafka event may not.
}

This gap cannot be closed by wrapping the produce in the same database transaction, because Kafka and relational databases do not share a transaction coordinator. The correct solution is the outbox pattern shown in the previous section: write the event to a database table in the same transaction as the business write, then have a separate relay worker produce to Kafka, committing the row as processed only after receiving a successful broker acknowledgment.

// ✅ Outbox pattern — event write is atomic with the business write
public async Task PlaceOrderAsync(Order order)
{
    using var transaction = await _db.Database.BeginTransactionAsync();

    _db.Orders.Add(order);

    _db.OutboxMessages.Add(new OutboxMessage
    {
        Id = Guid.NewGuid(),
        Topic = "orders",
        Key = order.Id.ToString(),
        Payload = JsonSerializer.Serialize(order),
        CreatedAt = DateTime.UtcNow,
        ProcessedAt = null
    });

    await _db.SaveChangesAsync();
    await transaction.CommitAsync();
    // Both Order and OutboxMessage are committed atomically.
    // The relay BackgroundService will produce OutboxMessages to Kafka.
}

Summary

Production Kafka demands that you reason about failure at the pipeline level, design for delivery semantics deliberately, and distinguish between failure types before deciding how to respond. The patterns in this lesson are the concrete application of that mindset.

📋 The Reliability Stack

🔧 Mechanism 🎯 Problem Solved ⚠️ Without It
🔑 Idempotency key + durable store Duplicate processing on redelivery Duplicate side effects
🕐 Process-before-commit ordering Crash between process and commit Silent data loss
☠️ Dead-letter topic Poison pills stalling the partition Partition starvation
📬 Outbox pattern Dual-write race between DB and Kafka Events lost on producer crash
📊 Per-partition lag alerting (time-to-catch-up) Silent pipeline degradation No warning before backlog becomes outage

📋 The Five Costly Mistakes

⚠️ Mistake 💥 Immediate Cost 🔍 How It Hides
🔒 Commit before downstream write Permanent message loss on crash Works perfectly until a crash
🔕 Swallow all exceptions + always commit Silent data loss, DLT bypassed Consumer stays healthy in metrics
🔄 No rebalance testing Unexpected duplicates in production Never triggered in single-instance tests
📊 Zero lag = healthy assumption Crashed consumers go undetected Lag metric is technically correct
🔀 Direct produce inside DB transaction DB and Kafka diverge silently Only fails on broker errors

Three of these five mistakes produce no error logs and no metric alerts under normal conditions. They are only visible when specific failure modes are exercised — crashes, broker unavailability, partition rebalances. Testing for them requires deliberately inducing those conditions.

Practical next steps:

🎯 Audit your offset commit placement. Verify every Commit() or StoreOffset() call is placed after the downstream write is acknowledged — not before, not in a finally block, not delegated to auto-commit.

🎯 Add consumer group state to your lag dashboard. Lag-only alerting has a blind spot for dead or paused consumer groups. Include group state alongside lag and write a runbook entry for the zero-lag-but-dead-group scenario.

🎯 Introduce a chaos step in your CI pipeline. Run consumer integration tests with a mid-test broker restart or a forced rebalance. If messages are lost or duplicated and your idempotency layer does not catch it, the test surface is incomplete.