KRaft & Modern Kafka
Understand KRaft mode: Kafka 4.x is post-ZooKeeper. New clusters should be KRaft-first. Also understand tiered storage splitting local hot and remote cold storage like S3.
Why Modern Kafka Changed the Rules for .NET Developers
If you've ever run docker-compose up on a Kafka project and watched two separate containers fight for memory before your producer could send a single message, you've felt the tax that ZooKeeper used to charge every Kafka deployment. For most of Kafka's history, a Kafka cluster was never really "a cluster" — it was two distributed systems duct-taped together, and .NET developers writing Confluent.Kafka code had to know at least a little about both. Why did the Kafka project decide an entire second coordination system was expendable? What actually breaks — and what stays exactly the same — in your C# code when that system disappears? And why does a change that sounds purely operational end up touching your docker-compose.yml, your CI pipeline, and your IAdminClient calls? This lesson answers those questions by walking through KRaft (Kafka Raft), the ZooKeeper-free architecture that is now the only way to run a new Kafka cluster, and by tracing exactly where that shift shows up in .NET application and tooling code.
The Two-System Problem ZooKeeper Created
In the original Kafka architecture, brokers didn't just talk to each other — they depended on an external coordination service, Apache ZooKeeper, for three critical jobs: electing a controller broker (the one responsible for managing partition leadership and cluster metadata), tracking which brokers were alive and registered, and storing the metadata describing every topic and partition in the cluster. ZooKeeper itself is a general-purpose distributed coordination service, not something purpose-built for Kafka, which meant every Kafka deployment was really an exercise in running and tuning two distributed systems side by side.
That had real, practical costs. ZooKeeper needed its own quorum of nodes, its own JVM tuning, its own monitoring, and its own upgrade cadence, often out of sync with the Kafka broker upgrade cycle. Controller failover went through a ZooKeeper-mediated election that involved watches and ephemeral znodes — a mechanism that worked, but added a layer of indirection between "a broker died" and "a new controller is active and metadata is consistent again." For a .NET developer this rarely showed up directly in producer or consumer code, but it showed up in the surrounding ecosystem: local dev tooling that shelled out to zkCli.sh, health checks that pinged ZooKeeper's client port, and runbooks with a whole section titled something like "if ZooKeeper is unhealthy, do not touch the brokers yet."
KRaft: One System Instead of Two
KRaft replaces ZooKeeper's job with a built-in Raft-based metadata quorum made up of Kafka brokers themselves (or a dedicated subset of them acting as controllers). Instead of an external system tracking cluster metadata in znodes, the metadata lives in a replicated log inside Kafka, managed by the Raft consensus protocol. As of the Kafka 4.0 release, ZooKeeper support has been removed entirely from Kafka — KRaft is no longer an opt-in alternative for the adventurous, it is the only supported mode for standing up a new cluster. If you're provisioning a fresh cluster today, whether that's a local container for integration tests or a production deployment, there is no ZooKeeper configuration to reach for.
🎯 Key Principle: KRaft doesn't change what Kafka's metadata is (which broker leads which partition, which topics exist, which configs apply) — it changes how that metadata is agreed upon and stored, replacing an external coordination service with a Raft log that is native to Kafka itself.
Here's a simplified before-and-after of where cluster metadata lived and how a new controller was chosen:
ZooKeeper-based cluster (removed starting with Kafka 4.0):
Kafka Broker 1 ↔ ZooKeeper ensemble ↔ Kafka Broker 2
Controller election, broker registration, and topic metadata
all live in ZooKeeper znodes, external to the brokers.
KRaft-based cluster (the only supported mode going forward):
Kafka Broker/Controller 1 ↔ Raft metadata log ↔ Kafka Broker/Controller 2
Controller election and metadata live in a Kafka-native
replicated log; no external coordination service exists.
This is a simplified picture that flattens a real distinction Kafka makes between broker nodes, controller nodes, and combined nodes that play both roles — the wiring of process.roles and controller.quorum.voters that makes this concrete for a docker-compose.yml file is covered in "Spinning Up and Targeting a KRaft Cluster from a .NET Project," and the deeper mechanics of how the Raft quorum elects a leader and replicates its log are the subject of a dedicated KRaft Architecture lesson.
What Actually Changes for a .NET Developer
Here's the part that matters most for your day job: the shift from ZooKeeper to KRaft is, in one sense, remarkably boring. If you open up a Confluent.Kafka producer or consumer loop, nothing about the message-passing code changes at all. Producing a message, consuming a batch, committing an offset — all of that talks to Kafka brokers over the Kafka wire protocol, and it never talked to ZooKeeper directly even in the old architecture. .NET client libraries were always insulated from ZooKeeper by design; only the brokers and a narrow set of admin-adjacent operations touched it.
What does change is everything around that message-passing core — the seams where your .NET project meets cluster infrastructure:
🔧 Cluster bootstrap — how a cluster is initialized and assigned an identity now happens through a Kafka-native storage format step rather than an auto-assignment from ZooKeeper.
🔧 AdminClient behavior — calls like DescribeCluster() now surface controller and quorum information that used to require reaching into ZooKeeper-adjacent tooling.
🔧 Local dev tooling — Docker Compose files and Testcontainers definitions used for integration tests need different environment variables and a different startup sequence than a ZooKeeper-plus-broker pair.
🔧 Operational runbooks — the failure modes, health checks, and recovery steps operators write down need to reflect a single coordinated system instead of two independently-monitored ones.
A minimal way to see this insulation in action is to look at how little a basic .NET consumer cares about any of this. The following configuration and consume loop is identical whether the brokers behind it are running KRaft or, on an older deployment, still coordinating through ZooKeeper — which is exactly the point:
using Confluent.Kafka;
var config = new ConsumerConfig
{
// Only the Kafka protocol endpoint matters here.
// There has never been a ZooKeeper setting on this config object.
BootstrapServers = "localhost:9092",
GroupId = "orders-processor",
AutoOffsetReset = AutoOffsetReset.Earliest
};
using var consumer = new ConsumerBuilder<string, string>(config).Build();
consumer.Subscribe("orders");
try
{
while (true)
{
var result = consumer.Consume(TimeSpan.FromSeconds(1));
if (result is not null)
{
Console.WriteLine($"Received: {result.Message.Value}");
}
}
}
finally
{
consumer.Close();
}
What is new is the kind of call that used to require ZooKeeper-side tooling to answer. Confirming which node is currently acting as controller, for instance, is now something your .NET code can ask the cluster directly:
using Confluent.Kafka;
using Confluent.Kafka.Admin;
var adminConfig = new AdminClientConfig
{
BootstrapServers = "localhost:9092"
};
using var admin = new AdminClientBuilder(adminConfig).Build();
// DescribeCluster() reaches the current controller through the
// Kafka protocol itself — no external ZooKeeper client required.
var metadata = admin.GetMetadata(TimeSpan.FromSeconds(5));
foreach (var broker in metadata.Brokers)
{
Console.WriteLine($"Broker {broker.BrokerId} at {broker.Host}:{broker.Port}");
}
This is a preview, not the full treatment — the exact DescribeCluster() API surface, its controller-identification fields, and a worked end-to-end producer/consumer/admin example against a live KRaft container are the focus of "Spinning Up and Targeting a KRaft Cluster from a .NET Project." The point to take from it here is narrower: the shape of what changed is that operations which used to require stepping outside the Kafka client entirely — into ZooKeeper CLI tools — now happen through the same IAdminClient your application already uses.
💡 Mental Model: Think of the ZooKeeper-to-KRaft transition less as "Kafka got a new feature" and more as "Kafka internalized a dependency it used to outsource." The job didn't disappear, it moved inside the system you're already talking to — which is exactly why your producer and consumer code doesn't need to change, but your infrastructure code does.
What This Lesson Covers, and What It Deliberately Doesn't
Given how much surface area this transition touches, it's worth being explicit about scope before going further. This lesson stays focused on the .NET-facing consequences: the configuration keys that replaced ZooKeeper settings, how to point local and CI tooling at a KRaft cluster, resilience patterns for AdminClient calls during topology changes, and the mistakes teams commonly carry over from ZooKeeper-era habits.
Two related topics are important enough to deserve their own dedicated treatment rather than a rushed summary here. The internal mechanics of the Raft-based controller quorum — how voters and observers are structured, how leader election within the quorum actually works, and how the metadata log is replicated — belongs to a dedicated KRaft Architecture lesson, because those internals matter for cluster operators more than for application developers, but they explain why the client-visible behaviors in this lesson hold. Similarly, tiered storage, which splits a topic's data between fast local disks and cheaper remote object storage such as S3, is a related-but-separate modernization in Kafka's architecture; it changes how retention and disk sizing work but not how your producer or consumer code is written, and it gets its own Tiered Storage Concepts lesson later in this course.
⚠️ Common Mistake: It's tempting to treat "ZooKeeper is gone" and "tiered storage exists" as the same kind of upgrade, since both get bundled into the phrase "modern Kafka." They're independent: a cluster can run KRaft with no tiered storage configured at all, and the disk-filling consequences of forgetting that distinction are covered later, in "Common Mistakes .NET Teams Make with Modern Kafka Clusters."
The route from here follows the seams just described: first the concrete configuration and behavioral differences KRaft introduces at the client and admin layer, then hands-on cluster setup from a .NET project, then resilience patterns for production code, then the mistakes teams make when old ZooKeeper-era assumptions linger, and finally a consolidated checklist tying it together. None of that requires you to become a cluster-internals expert — it requires knowing precisely where the ground shifted under the .NET code and tooling you already write.
From ZooKeeper Habits to KRaft Reality: What Client Code Needs to Know
If you've maintained a .NET service talking to Kafka for a few years, some of your instincts were formed by ZooKeeper's presence — even if your application code never called ZooKeeper directly. Config files, docker-compose definitions, admin scripts, and mental models about "where cluster state lives" all quietly assumed a second distributed system sitting behind the brokers. KRaft removes that assumption entirely, and the differences show up in exactly the places a .NET developer touches: connection strings, admin tooling, container definitions, and the error behavior your client sees during topology changes.
Connection Strings: bootstrap.servers Is the Only Door
Under the ZooKeeper-era architecture, two connection strings coexisted in most Kafka projects. Application producers and consumers used bootstrap.servers to reach brokers, but administrative and operational tooling frequently used zookeeper.connect — pointing at the ZooKeeper ensemble instead of the brokers themselves. Tools like zkCli.sh, and broker-side scripts invoked with a --zookeeper flag, could read and write cluster metadata by talking to ZooKeeper's znode tree directly, bypassing the Kafka protocol altogether.
That second door is gone. In a KRaft cluster, zookeeper.connect is not a deprecated-but-tolerated setting — it has no meaning at all, because there is no ZooKeeper ensemble to connect to. Every operation, whether it originates from a producer, a consumer, or an administrative script, goes through the Kafka protocol against bootstrap.servers. This is a simplification worth stating precisely: it doesn't just mean "use a different connection string," it means an entire class of tooling that read cluster state by inspecting znodes has no equivalent path anymore and must be rewritten against the Kafka Admin API.
// ZooKeeper-era .NET tooling often needed BOTH of these,
// because some operations only existed via ZK-aware scripts:
// var zkConnect = "zk1:2181,zk2:2181,zk3:2181"; // no longer applicable in KRaft
// KRaft-mode configuration: only the Kafka protocol endpoint matters
var adminConfig = new AdminClientConfig
{
BootstrapServers = "broker1:9092,broker2:9092,broker3:9092"
};
using var admin = new AdminClientBuilder(adminConfig).Build();
This snippet looks almost too simple to be the point — and that's exactly the point. The AdminClientConfig for a KRaft cluster carries no ZooKeeper-related properties at all, because Confluent.Kafka's admin client has only ever spoken the Kafka wire protocol; what changed is that every administrative capability formerly available only through ZK-aware scripts is now guaranteed to be reachable this same way.
Where Cluster Identity Comes From Now
In the ZooKeeper architecture, a cluster's identity was assigned automatically: the first broker to start would register itself in ZooKeeper, and a cluster ID would be generated and stored there for other brokers to discover. There was no explicit provisioning step — the cluster "became" a cluster the moment brokers connected to a shared ZooKeeper ensemble.
KRaft inverts this. Before any broker can start in KRaft mode, someone must run kafka-storage.sh format against each node's storage directory, supplying a cluster ID that was generated once (typically with kafka-storage.sh random-uuid) and reused across every broker and controller in that cluster. This is a deliberate, one-time provisioning act rather than an emergent side effect of brokers finding each other. If you skip it, brokers refuse to start rather than silently forming an ad hoc cluster — a meaningful behavioral difference for anyone scripting cluster bring-up.
The practical consequence for .NET tooling: anything that used to confirm "which cluster am I talking to" by reading a znode under /cluster/id has no ZooKeeper tree to inspect anymore. The Kafka protocol itself now exposes this information, and the correct replacement is a call your AdminClient already supports:
// Confirm cluster identity and membership via the Kafka protocol,
// replacing any tooling that used to read ZooKeeper's /cluster/id znode
var description = admin.DescribeCluster(TimeSpan.FromSeconds(10));
Console.WriteLine($"Cluster ID: {description.ClusterId}");
Console.WriteLine($"Controller: {description.Controller?.Host}:{description.Controller?.Port}");
foreach (var node in description.Nodes)
{
Console.WriteLine($"Broker/Node {node.Id}: {node.Host}:{node.Port}");
}
DescribeCluster() returns the cluster ID that was baked in at kafka-storage.sh format time, along with the current controller and the full node list — all delivered over the same protocol your producers and consumers already use. Any health-check or diagnostic .NET tooling that previously shelled out to ZooKeeper CLI commands to answer "is this the cluster I expect, and who's in charge" should be rewritten around this single call.
process.roles and controller.quorum.voters: Why This Matters for Your docker-compose.yml
ZooKeeper-era Kafka had a clean separation of concerns at the process level: ZooKeeper processes handled coordination, and Kafka broker processes handled data. KRaft collapses coordination into Kafka itself, but it does so by introducing an explicit role a given Kafka process plays, set via process.roles. A node can be started with process.roles=broker, process.roles=controller, or — common in small and local deployments — process.roles=broker,controller, meaning a single JVM process does both jobs.
Alongside process.roles, every node needs controller.quorum.voters, a list identifying which nodes participate in the Raft-based controller quorum, formatted as nodeId@host:port pairs. This setting is what lets a controller-role or combined-role node find the rest of the quorum at startup, in the same way zookeeper.connect used to let brokers find the ZooKeeper ensemble.
This distinction matters concretely the moment you write a docker-compose file or a Testcontainers definition for integration tests, because you now have to decide and declare a role rather than just pointing every broker at a shared coordination service:
## docker-compose.yml excerpt: a single combined broker+controller node,
## typical for local .NET integration test environments
services:
kafka:
image: apache/kafka:latest
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka:9093"
KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
ports:
- "9092:9092"
The two environment variables that didn't exist in a ZooKeeper-based compose file — KAFKA_PROCESS_ROLES and KAFKA_CONTROLLER_QUORUM_VOTERS — are exactly the ones replacing what ZooKeeper used to provide implicitly. Building and targeting a fuller version of this kind of setup from a .NET solution, including wiring it into Testcontainers and verifying the container is ready before tests run, is worked through in "Spinning Up and Targeting a KRaft Cluster from a .NET Project." The quorum voter mechanics — how many voters you need, how leader election among them actually works — belong to the dedicated KRaft Architecture lesson; what matters here is simply that these two settings now occupy the space zookeeper.connect used to fill in your infrastructure definitions.
Metadata Propagation: Fewer Stale-Metadata Surprises
One behavioral difference .NET developers notice even without changing a line of application code is how quickly the cluster's view of partition leadership becomes consistent after a change. In the ZooKeeper architecture, metadata changes — a new partition leader after a broker failure, for instance — were written to ZooKeeper and then propagated to brokers via ZooKeeper watches, with the controller broker responsible for pushing updates out to the rest of the cluster. This worked, but it introduced multiple hops and occasional windows where a client's cached metadata (from metadata.max.age.ms) didn't yet reflect reality, producing NOT_LEADER_OR_FOLLOWER or similar errors on the next produce or fetch request.
KRaft replaces the ZooKeeper watch mechanism with a single Raft-replicated metadata log that all brokers consume directly. Instead of a controller broker reading from ZooKeeper and republishing changes, every broker tails the same log of metadata records and applies them in order. Because there's one authoritative log rather than a separate coordination store plus a controller relay step, the propagation path is shorter and more uniform across the cluster, and disagreements between what a broker believes and what the controller has just committed tend to resolve faster.
The practical upshot for a .NET client library like Confluent.Kafka: you may notice fewer occurrences of the stale-metadata class of errors following a leadership change, and error windows around controller transitions tend to close more quickly than they did against ZooKeeper-coordinated clusters. This doesn't mean such errors disappear — a client can still hold cached metadata that's momentarily behind the cluster's true state, and correctly handling that is still the job of retry and timeout configuration, covered in depth in "Writing Resilient .NET Clients for KRaft-Based Clusters." What changes here is the frequency and duration of the underlying condition, not the existence of the condition itself; treating this improvement as a reason to skip resilience configuration entirely would be a mistake.
⚠️ Common Mistake: assuming that because KRaft's metadata propagation is faster, a .NET service no longer needs sensible metadata.max.age.ms or retry handling around admin calls. Faster convergence reduces the odds of hitting a stale-metadata window, particularly during routine leadership changes, but it does not eliminate transient errors during an active controller election, which is exactly the edge case explored later in this lesson.
What This Section Does Not Cover
It's worth being explicit about scope, because it's tempting to read "single Raft log" and want the full mechanics right away. How the controller quorum actually elects a leader, what distinguishes a voter from an observer, and how the Raft log itself achieves consensus among controller nodes are all substantial topics addressed in the dedicated KRaft Architecture lesson. What matters for this section is narrower and more immediately practical: the settings you type into a config file or a compose YAML, the API calls your .NET tooling now needs to use, and the class of errors you should expect to see less often as a result of the architecture change underneath your client.
Spinning Up and Targeting a KRaft Cluster from a .NET Project
Once you understand that a KRaft cluster no longer has a ZooKeeper ensemble sitting behind it, the next question is practical: how do you actually stand one up locally, and how does a .NET producer, consumer, or admin client find it? The good news is that the wiring on the client side barely changes. The work that does change is entirely in how you configure the broker container itself, because a KRaft node now has to know things ZooKeeper used to track on its behalf — its own identity, its role, and who else is voting on cluster metadata.
Configuring a Combined Broker+Controller Node for Local Development
For integration tests and local development, you don't need a multi-node Raft quorum — a single node acting as both broker and controller is the standard pattern, configured via process.roles=broker,controller. This single-process setup is a deliberate simplification for dev/test convenience; it gives you no controller fault tolerance at all, which is exactly why production clusters split these roles across multiple dedicated controller nodes (a topic the dedicated KRaft Architecture lesson covers in depth).
A combined node needs three pieces of identity information that ZooKeeper used to hand out automatically: a node ID (this broker/controller's unique integer identifier within the cluster), a cluster ID (a UUID stamped once into the node's storage directory), and controller.quorum.voters, which lists which node IDs are eligible to vote in controller elections along with their controller-listener addresses.
Here's a docker-compose.yml fragment for a single combined KRaft node suitable for a .NET integration test suite:
services:
kafka:
image: apache/kafka:latest
container_name: kraft-broker
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
# Single-node cluster: no replication headroom, fine for local dev/test only
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
The format 1@kafka:9093 in controller.quorum.voters means "node ID 1, reachable at host kafka on port 9093 for controller traffic." Notice that the broker's client-facing listener (PLAINTEXT on 9092) and its controller listener (CONTROLLER on 9093) are deliberately separate — application traffic and Raft consensus traffic never share a listener.
If you're wiring this into a .NET test project rather than a standalone compose file, Testcontainers gives you the same shape programmatically:
var kafkaContainer = new ContainerBuilder()
.WithImage("apache/kafka:latest")
.WithPortBinding(9092, true)
.WithEnvironment("KAFKA_NODE_ID", "1")
.WithEnvironment("KAFKA_PROCESS_ROLES", "broker,controller")
.WithEnvironment("KAFKA_LISTENERS",
"PLAINTEXT://:9092,CONTROLLER://:9093")
.WithEnvironment("KAFKA_ADVERTISED_LISTENERS",
"PLAINTEXT://localhost:9092")
.WithEnvironment("KAFKA_CONTROLLER_LISTENER_NAMES", "CONTROLLER")
.WithEnvironment("KAFKA_CONTROLLER_QUORUM_VOTERS", "1@localhost:9093")
.WithEnvironment("KAFKA_LISTENER_SECURITY_PROTOCOL_MAP",
"PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT")
.WithWaitStrategy(Wait.ForUnixContainer().UntilPortIsAvailable(9092))
.Build();
await kafkaContainer.StartAsync();
var bootstrapServers = $"localhost:{kafkaContainer.GetMappedPublicPort(9092)}";
The WaitStrategy here only checks that the port is open, not that the broker has finished self-formatting its storage and completed leader election for internal topics; for a test suite that hits Kafka immediately after startup, pair this with a short retry loop on the first admin call rather than assuming readiness the instant the port responds.
Producing and Consuming Against a KRaft Cluster
This is the reassuring part: Confluent.Kafka's ProducerConfig and ConsumerConfig only need BootstrapServers. There is no zookeeper.connect equivalent to set, no separate discovery step — the client asks whatever broker it can reach for the current cluster metadata, and that broker answers using information it learned from the Raft metadata log instead of from ZooKeeper watches. From the client's point of view, this is the same bootstrap-and-discover protocol flow Kafka has always used; KRaft only changes how the broker itself learned that metadata.
using Confluent.Kafka;
var bootstrapServers = "localhost:9092"; // from Testcontainers or docker-compose
// Produce a single message
var producerConfig = new ProducerConfig { BootstrapServers = bootstrapServers };
using (var producer = new ProducerBuilder<string, string>(producerConfig).Build())
{
var result = await producer.ProduceAsync("orders",
new Message<string, string> { Key = "order-42", Value = "placed" });
Console.WriteLine($"Produced to {result.TopicPartitionOffset}");
}
// Consume it back
var consumerConfig = new ConsumerConfig
{
BootstrapServers = bootstrapServers,
GroupId = "orders-test-group",
AutoOffsetReset = AutoOffsetReset.Earliest
};
using (var consumer = new ConsumerBuilder<string, string>(consumerConfig).Build())
{
consumer.Subscribe("orders");
var consumeResult = consumer.Consume(TimeSpan.FromSeconds(10));
Console.WriteLine($"Consumed: {consumeResult?.Message.Value}");
consumer.Close();
}
Nothing in this snippet is KRaft-specific — which is precisely the point the earlier section made: the message-passing surface area is untouched. What KRaft does change is what happens before this code runs (how the broker itself came up) and what tooling you use to inspect the cluster's state, which is where IAdminClient comes in.
Verifying Cluster State with IAdminClient.DescribeCluster()
Once your container is running, it's worth confirming — rather than assuming — that your .NET client can actually see the cluster the way you expect. IAdminClient.DescribeCluster() returns the current broker list, the cluster ID, and which node is currently acting as controller. This matters more under KRaft than it did under ZooKeeper-era tooling, because there's no external zkCli session to peek at separately; the AdminClient call is your window into cluster metadata now.
using Confluent.Kafka;
using Confluent.Kafka.Admin;
var adminConfig = new AdminClientConfig { BootstrapServers = bootstrapServers };
using var admin = new AdminClientBuilder(adminConfig).Build();
var clusterInfo = admin.DescribeCluster(
new DescribeClusterOptions { RequestTimeout = TimeSpan.FromSeconds(5) });
Console.WriteLine($"Cluster ID: {clusterInfo.ClusterId}");
Console.WriteLine($"Controller: {clusterInfo.Controller.Id} " +
$"({clusterInfo.Controller.Host}:{clusterInfo.Controller.Port})");
foreach (var node in clusterInfo.Nodes)
{
Console.WriteLine($"Node {node.Id}: {node.Host}:{node.Port}");
}
Against the single-node combined setup above, you'd expect Nodes to contain exactly one entry and Controller.Id to match that same node's ID, since it's playing both roles. If you scale up to a multi-voter quorum later, this same call is how you'd confirm — from application code, in a CI health check, or in a diagnostic script — which node currently holds the controller role without needing any ZooKeeper-era inspection tooling.
Checking Client/Broker API Version Compatibility
Confluent.Kafka wraps librdkafka, and librdkafka negotiates a supported API version range with the broker on connect — this is how the client and broker agree on which request/response formats to speak. ⚠️ Common Mistake: pinning an old Confluent.Kafka NuGet package (and the librdkafka version bundled with it) that predates broker-side changes introduced for KRaft-era clusters. An outdated client can still connect, but it may fall back to older protocol assumptions and mishandle scenarios like controller metadata responses or newer error codes, producing errors that look unrelated to versioning at first glance.
You can inspect what actually got negotiated by turning on librdkafka's debug logging rather than guessing:
var debugConfig = new AdminClientConfig
{
BootstrapServers = bootstrapServers,
Debug = "broker,protocol"
};
using var debugAdmin = new AdminClientBuilder(debugConfig)
.SetLogHandler((_, message) => Console.WriteLine($"[{message.Level}] {message.Message}"))
.Build();
debugAdmin.DescribeCluster(new DescribeClusterOptions { RequestTimeout = TimeSpan.FromSeconds(5) });
The protocol debug context logs the ApiVersionsRequest/Response exchange, including the version ranges each side advertises. Treat mismatches here as a signal to check your NuGet package version rather than assuming a network or configuration problem — a stale package pin is a far more common root cause than an actual protocol incompatibility with a current broker.
One Connection Configuration, Two Environments
Because BootstrapServers is the only mandatory setting, the cleanest pattern is to keep it — and any security settings layered on top of it — entirely in configuration, never hardcoded, so the same compiled service binary can point at a local KRaft container in dev and a managed KRaft-based cluster in production. This isn't unique to KRaft, but KRaft's simplified bootstrap model (no separate ZooKeeper connection string to also externalize) makes it easier to get right, since there's only one moving part instead of two.
{
"Kafka": {
"BootstrapServers": "localhost:9092",
"SecurityProtocol": "Plaintext"
}
}
public class KafkaOptions
{
public string BootstrapServers { get; set; } = string.Empty;
public string SecurityProtocol { get; set; } = "Plaintext";
}
// Program.cs / composition root
builder.Services.Configure<KafkaOptions>(
builder.Configuration.GetSection("Kafka"));
builder.Services.AddSingleton<IProducer<string, string>>(sp =>
{
var options = sp.GetRequiredService<IOptions<KafkaOptions>>().Value;
var config = new ProducerConfig
{
BootstrapServers = options.BootstrapServers,
SecurityProtocol = Enum.Parse<SecurityProtocol>(options.SecurityProtocol)
};
return new ProducerBuilder<string, string>(config).Build();
});
In production, an environment variable or secrets store overrides Kafka:BootstrapServers with the managed cluster's endpoints and typically flips SecurityProtocol to SaslSsl with accompanying credentials — the application code that builds the producer never changes. The resilience-oriented settings that matter once you're actually talking to a live multi-node quorum — timeouts, retry policy, idempotence — are addressed in "Writing Resilient .NET Clients for KRaft-Based Clusters"; here the goal is simply that your connection configuration is structured so environment is the only thing that varies.
Writing Resilient .NET Clients for KRaft-Based Clusters
A cluster that fails over its controller in well under a second sounds like it should make client-side resilience code less necessary. In practice the opposite happens: because the failure window is short, it's tempting to skip defensive timeout tuning and retry logic entirely, and then a deployment pipeline or a production producer trips over the one request that lands during that narrow window. The goal of this section is to make that window survivable by construction — through producer configuration, AdminClient retry wrappers, and metadata-refresh tuning that all assume a KRaft-based cluster underneath.
Faster Failover Doesn't Mean No Failover
Under the old ZooKeeper-coordinated design, a controller failover could take several seconds because a new controller had to win a ZooKeeper session-timeout-driven election and then re-read broker and partition state from ZooKeeper before it could act. KRaft's Raft-based controller quorum typically completes leader election in a fraction of that time because the new active controller already has the metadata log locally and doesn't need to rebuild state from an external system. That difference is covered in depth in the dedicated KRaft Architecture lesson, but the client-facing consequence is simple: .NET producers, consumers, and admin clients now see shorter error windows around controller changes, not zero-length ones.
A shorter window still needs a plan. If delivery.timeout.ms (the overall time budget librdkafka gives a message to be acknowledged, exposed in Confluent.Kafka as MessageTimeoutMs) is left at a value tuned for a network that never hiccups, a message that happens to be in flight during a controller or partition-leader transition can still expire and surface as a delivery failure in your application code. The fix isn't exotic — it's choosing values that assume brief disruptions will happen and giving in-flight work enough room to survive them without also making a genuinely stuck connection take forever to report an error.
| 🔧 Setting | 📉 Too tight | ✅ Defensive default | 💬 Why |
|---|---|---|---|
| message.timeout.ms | 5,000 | 30,000 | Covers a controller/leader transition plus retry backoff |
| socket.connection.setup.timeout.ms | 1,000 | 10,000 | Avoids false failures against a broker still rejoining |
| retries (librdkafka default) | 0 or 1 | infinite (bounded by message.timeout.ms) | Lets the timeout, not the retry count, be the real budget |
| retry.backoff.ms | 10 | 250–500 | Avoids hammering a broker mid-election |
🎯 Key Principle: treat message.timeout.ms as the single source of truth for "how long is this operation allowed to take," and let the retry count be effectively unbounded underneath it — that way a transient controller change is absorbed automatically instead of needing every retry knob tuned in lockstep.
Idempotent Producers and a Resilient AdminClient Wrapper
The producer side of resilience starts with the idempotent producer: a mode where the broker deduplicates retried writes using a producer ID and sequence number, so a retry after a timeout doesn't create duplicate messages. Pairing idempotence with acks=all (wait for all in-sync replicas to acknowledge) and a capped max.in.flight.requests.per.connection keeps message ordering intact even when retries fire.
using Confluent.Kafka;
var producerConfig = new ProducerConfig
{
BootstrapServers = "broker1:9092,broker2:9092,broker3:9092",
EnableIdempotence = true, // dedupes retried writes at the broker
Acks = Acks.All, // wait for all in-sync replicas
MaxInFlight = 5, // capped so ordering survives retries
MessageTimeoutMs = 30000, // overall delivery budget per message
RetryBackoffMs = 300 // spacing between retry attempts
};
using var producer = new ProducerBuilder<string, string>(producerConfig).Build();
MaxInFlight caps how many unacknowledged requests can be outstanding on one connection at once; keeping it at a small number (5 or fewer) is what preserves message ordering guarantees when idempotence is enabled, since a higher number risks reordering across retried batches. EnableIdempotence and Acks.All together are what make the retries triggered by a brief controller or leader change safe rather than dangerous — without idempotence, the same retry logic that makes a producer resilient could also silently duplicate a message.
AdminClient operations need the same defensive posture, but they fail differently than producer sends: instead of a message timing out, a call like CreateTopicsAsync can throw a KafkaException outright when it hits a broker that briefly can't answer authoritatively. A CI provisioning step that lets that exception bubble up and fail the pipeline is treating a recoverable, sub-second condition as a hard error. A small retry wrapper turns it into a non-event:
using Confluent.Kafka;
using Confluent.Kafka.Admin;
static async Task CreateTopicWithRetryAsync(
IAdminClient adminClient,
TopicSpecification spec,
int maxAttempts = 5)
{
var delay = TimeSpan.FromMilliseconds(500);
for (var attempt = 1; attempt <= maxAttempts; attempt++)
{
try
{
await adminClient.CreateTopicsAsync(new[] { spec });
return; // success
}
catch (CreateTopicsException ex) when (
ex.Results.Any(r => r.Error.Code == ErrorCode.TopicAlreadyExists))
{
return; // idempotent for a provisioning step: already exists is fine
}
catch (KafkaException ex) when (
(ex.Error.Code == ErrorCode.NotController ||
ex.Error.Code == ErrorCode.RequestTimedOut) &&
attempt < maxAttempts)
{
await Task.Delay(delay);
delay *= 2; // exponential backoff before the next attempt
}
}
throw new InvalidOperationException(
$"Failed to create topic '{spec.Name}' after {maxAttempts} attempts.");
}
The two catch clauses do different jobs: the first treats "topic already exists" as success, which matters for CI steps that might run more than once against the same environment; the second retries specifically on the two error codes that show up around a controller transition, rather than swallowing every possible failure. ⚠️ Common Mistake: catching KafkaException broadly and retrying on any error code — that also retries on genuine configuration errors (like an invalid replication factor) that will never succeed no matter how many times you retry them, turning a fast, clear failure into a slow, confusing one.
Tuning Metadata Refresh and Connection Setup for Fast Recovery
Retry logic only helps if the client's view of the cluster is current enough for the retry to succeed. Two settings control how quickly a .NET client notices that broker or controller topology has changed: metadata.max.age.ms, which bounds how long the client will keep using a cached metadata snapshot before forcing a refresh, and socket.connection.setup.timeout.ms, which bounds how long the client waits for a TCP-level connection to a broker before giving up and trying another one.
Left at defaults tuned for a comparatively static ZooKeeper-era topology, these values can make a .NET client slower than the cluster itself to recover from a broker restart or controller change — the broker has already elected a new controller in well under a second, but the client is still holding onto minutes-old metadata or waiting a long time for a socket to a broker that isn't coming back soon. Tightening both gives the client a recovery speed that matches the cluster's:
var adminConfig = new AdminClientConfig
{
BootstrapServers = "broker1:9092,broker2:9092,broker3:9092",
SocketConnectionSetupTimeoutMs = 10000 // fail fast on an unreachable broker
};
// metadata.max.age.ms has no strongly typed property on AdminClientConfig,
// so it's set through the underlying string-keyed configuration dictionary
adminConfig.Set("metadata.max.age.ms", "10000");
using var admin = new AdminClientBuilder(adminConfig).Build();
A shorter metadata.max.age.ms means slightly more background metadata-refresh traffic, which is a reasonable trade for a client that reacts to topology changes within seconds instead of minutes. 💡 Pro Tip: apply the same two settings to producer and consumer configs, not just AdminClient — a consumer holding stale metadata about partition leadership will keep sending fetch requests to a broker that's no longer the leader until its own refresh interval catches up.
Edge Case: Provisioning Topics During an Active Controller Election
The retry wrapper above exists specifically for this situation: a CI or deployment step calls CreateTopicsAsync at the exact moment a controller election is in progress, and the broker it happens to be talking to either isn't the controller anymore or can't answer authoritatively yet. The client sees this as a KafkaException carrying a not-controller or request-timeout error code — not a malformed request, not a permissions problem, just a broker saying "ask me again in a moment."
Without the wrapper, that single unlucky request fails an entire deployment pipeline over a condition that resolves itself in well under a second. With it, the same request is retried after a short backoff and almost always succeeds on the second or third attempt, because by then the election has completed and some broker can answer definitively. This is a case where the fix isn't clever engineering — it's simply refusing to treat a known, transient, well-documented error code as equivalent to a fatal one.
Harder Variant: Provisioning Topics Mid Rolling Upgrade
A tougher version of the same problem shows up when a .NET-based topic-provisioning CI step runs against a cluster that's in the middle of a rolling upgrade or a broker restart cycle — not a single election, but a sustained period where different brokers are at different points in catching up on the metadata log. In that state, CreateTopicsAsync can return success from the broker that answered the request, but a different broker that a subsequent step in the same pipeline talks to (say, one performing a post-deploy sanity check) may not have caught up on the metadata log yet and briefly reports the topic as not found.
The retry-on-error-code pattern from before doesn't fully cover this, because the create call itself can appear to succeed — the danger here is a false negative in a later verification step, not a thrown exception in the creation step. The fix is to make the provisioning step confirm convergence rather than trusting a single success response:
static async Task VerifyTopicVisibleAsync(
IAdminClient adminClient,
string topicName,
int maxAttempts = 6)
{
var delay = TimeSpan.FromMilliseconds(500);
for (var attempt = 1; attempt <= maxAttempts; attempt++)
{
var metadata = adminClient.GetMetadata(topicName, TimeSpan.FromSeconds(5));
var topic = metadata.Topics.FirstOrDefault(t => t.Topic == topicName);
if (topic != null && topic.Error.Code == ErrorCode.NoError)
{
return; // topic is visible with no per-topic error
}
await Task.Delay(delay);
delay *= 2;
}
throw new InvalidOperationException(
$"Topic '{topicName}' was created but is not yet visible cluster-wide.");
}
Calling CreateTopicWithRetryAsync followed by VerifyTopicVisibleAsync gives the pipeline two separate defenses: the first absorbs errors thrown during creation (the controller-election case), and the second absorbs the case where creation reports success but the cluster hasn't fully converged on that fact yet (the rolling-upgrade case). ⚠️ Skipping the verification step and treating a successful CreateTopicsAsync call as the end of the story is a common way CI pipelines pass even though a subsequent smoke test against a lagging broker fails moments later — the two failures look unrelated unless you know they trace back to the same metadata-log catch-up window.
Taken together, these patterns share one posture: assume the cluster will occasionally answer "not yet" rather than "yes" or "no," and write client code that treats "not yet" as a reason to wait and ask again, not as a reason to fail. That posture costs a handful of retry lines in producer, AdminClient, and CI code, and it's what turns KRaft's fast-but-nonzero failover into something a .NET service or deployment pipeline never even has to notice.
Common Mistakes .NET Teams Make with Modern Kafka Clusters
Migrating a cluster to KRaft is usually the easy part — running kafka-storage.sh format and setting process.roles is a well-documented mechanical step. The mistakes that actually bite .NET teams show up afterward, buried in manifests, package references, and CI pipelines that nobody revisits once the cluster "just works." This section walks through five of the most common failure patterns, each with the concrete fix.
Mistake 1: Zombie ZooKeeper Configuration ⚠️
The most common mistake isn't a KRaft misconfiguration at all — it's leftover ZooKeeper configuration that nobody deleted. Teams migrate bootstrap.servers in their .NET code, confirm producers and consumers still work, and call the migration done. But the docker-compose.yml, Helm chart, or Kubernetes manifest that stood up the old cluster often still contains a zookeeper.connect environment variable, a separate ZooKeeper container, and — critically — a readiness or liveness probe that pings ZooKeeper's port before marking the broker pod healthy.
❌ Wrong thinking: "The broker starts fine, so the leftover ZK block in the manifest is harmless dead weight." ✅ Correct thinking: A stale ZK health check is an active failure mode, not inert clutter — it can block pod readiness, fail CI health-check steps, or cause an orchestrator to restart a perfectly healthy broker because a service it no longer depends on isn't responding.
Concretely, this looks like a Kubernetes manifest with a leftover block like:
## ⚠️ Leftover from pre-KRaft manifest — this probe targets a ZooKeeper
## port that no longer exists once process.roles is broker,controller
readinessProbe:
tcpSocket:
port: 2181 # ZooKeeper's client port
initialDelaySeconds: 10
periodSeconds: 5
If the ZooKeeper container was removed from the compose file or StatefulSet but this probe wasn't, the pod never reports ready, and any .NET integration test or CI step that waits on pod readiness times out with an error that has nothing to do with Kafka itself — it looks like a networking problem. The fix is a deliberate audit: grep every compose file, Helm values file, and Kubernetes manifest for zookeeper, zkClient, and port 2181, and remove them alongside the --zookeeper flags in any shell scripts used by CI. Since the KRaft-vs-ZooKeeper client-facing differences are covered in "From ZooKeeper Habits to KRaft Reality," the point here is operational: these settings live outside your .NET code, in infrastructure files that are easy to forget during a migration.
Mistake 2: Treating controller.quorum.voters as a Formality ⚠️
It's tempting to fill in controller.quorum.voters with whatever number of nodes feels convenient — one for a quick dev setup, or an even number because it "seemed reasonable" — without registering that this setting is the fault-tolerance boundary of the cluster's metadata plane. The quorum mechanics themselves (Raft log, leader election, voters vs. observers) belong to the dedicated KRaft Architecture lesson, but the mistake worth flagging here is purely about the count teams configure.
A single controller voter gives you zero redundancy: if that one process dies, the cluster can't elect a new controller and metadata operations — topic creation, partition reassignment, ISR updates — stall until it's replaced. An even number of voters (say, four) doesn't buy you anything extra over three, because Raft-style quorums need a strict majority to make progress, and an even count only adds a node that raises the number of votes required for a majority without adding a corresponding tolerance improvement.
Concretely, three voters tolerate one failure, five tolerate two, but four voters tolerate only one failure — the same as three — while requiring an extra machine and an extra vote in every round. Teams sizing quorums by "more nodes must be safer" intuition, without checking that the number is odd, end up paying for hardware that doesn't buy them anything.
## ✅ Three-voter quorum: tolerates one controller failure
controller.quorum.voters=1@ctrl-1:9093,2@ctrl-2:9093,3@ctrl-3:9093
## ❌ Single voter: any controller loss halts metadata operations
controller.quorum.voters=1@ctrl-1:9093
💡 Pro Tip: Treat controller.quorum.voters the same way you'd treat a database replica count — pick an odd number (three is the common baseline for most clusters, five for clusters that need to tolerate two simultaneous failures), and revisit it any time you change the number of controller-eligible nodes.
Mistake 3: Pinning a Pre-KRaft Client Library Version ⚠️
.NET teams often pin Confluent.Kafka (and transitively, librdkafka) to a specific version in a .csproj file and leave it there for a long time, especially in services that "work fine" and don't get touched. When that pinned version predates the client library's KRaft-aware protocol negotiation, pointing it at a KRaft-mode broker can produce confusing symptoms: connection failures that look like network issues, UNSUPPORTED_VERSION errors from the AdminClient, or metadata requests that silently return stale or incomplete broker lists.
The underlying issue is API version negotiation. Kafka's wire protocol evolves per-API, and each client and broker advertise which versions of each API they support; the client picks the highest mutually supported version. Older librdkafka builds shipped before controller and admin API changes tied to KRaft-mode clusters existed, and may make assumptions about broker behavior that no longer hold — for example, expecting error codes or timing characteristics tied to ZooKeeper-era controller failover.
<!-- ❌ Old, unmaintained pin left in a .csproj for months -->
<PackageReference Include="Confluent.Kafka" Version="1.4.2" />
<!-- ✅ A currently maintained version, verified against your broker version -->
<PackageReference Include="Confluent.Kafka" Version="2.*" />
⚠️ Common Mistake: assuming that because the producer and consumer paths still work, the client library is fully compatible. Simple produce/consume paths are often the most backward-compatible part of the protocol; AdminClient operations and metadata-heavy paths are where version mismatches surface first. The negotiated-API-versions check described in "Spinning Up and Targeting a KRaft Cluster from a .NET Project" is the concrete way to confirm compatibility rather than guessing from symptoms.
💡 Real-World Example: A team running a nightly CI job that calls CreateTopicsAsync against a freshly upgraded KRaft cluster starts seeing intermittent KafkaException failures with obscure error codes, while the same team's produce/consume smoke tests pass without issue. The root cause traces back to a Confluent.Kafka version older than the cluster's Kafka release — the AdminClient's admin protocol assumptions don't match what the newer broker advertises, and updating the package resolves it.
Mistake 4: Sizing Local Retention as If Tiered Storage Is Already On ⚠️
A subtler mistake shows up when teams read about tiered storage — the split between fast local disk for recent segments and cheaper remote object storage (such as S3-compatible storage) for older ones — and assume it changes their retention math automatically. Tiered storage concepts and configuration are covered in their own dedicated lesson, but the .NET-facing mistake worth calling out here is specific: configuring retention.ms or retention.bytes as if remote offload is already happening, when the broker's tiered storage feature was never actually enabled or the remote storage plugin was never configured.
Concretely, this looks like a dev or staging environment where someone set a generous 30-day retention policy for a high-throughput topic, reasoning that "old segments will offload to cheap storage anyway," without checking that remote.log.storage.system.enable (or the equivalent config for the tiered storage implementation in use) was actually turned on for that topic. Every byte instead accumulates entirely on local disk, and the dev broker's volume fills up — often first noticed when a .NET producer starts getting MSG_SIZE_TOO_LARGE-adjacent write failures or the broker itself refuses new segments because the disk is full.
// A .NET health-check style diagnostic: don't assume tiered storage is
// active just because it's configured in topic config — verify broker-side
// state before trusting retention math that depends on offload.
using var admin = new AdminClientBuilder(new AdminClientConfig
{
BootstrapServers = "localhost:9092"
}).Build();
var topicConfigs = admin.DescribeConfigs(
new[] { new ConfigResource { Type = ResourceType.Topic, Name = "orders" } },
TimeSpan.FromSeconds(10));
// Inspect the returned config entries for the tiered-storage-enable flag
// before assuming local retention.ms accounts for remote offload.
foreach (var config in topicConfigs)
{
foreach (var entry in config.Entries.Values)
{
if (entry.Name.Contains("remote.storage", StringComparison.OrdinalIgnoreCase))
{
Console.WriteLine($"{entry.Name} = {entry.Value}");
}
}
}
This snippet doesn't turn tiered storage on — it's a verification step, checking what the broker actually reports rather than trusting what a teammate assumed was configured. The fix for the underlying mistake is procedural: size local retention (or local.retention.ms when tiered storage is genuinely active) based on confirmed broker-side state, not on the theoretical existence of the feature, and treat dev/CI environments — which frequently skip the object storage setup entirely for simplicity — as local-disk-only unless proven otherwise.
Mistake 5: Treating KRaft as Invisible Plumbing and Skipping Integration Tests ⚠️
The last mistake is the most conceptual, but it's the one that causes production incidents rather than dev annoyances. Because the client-facing API surface barely changes — BootstrapServers still works the same way, produce and consume calls look identical — it's easy to conclude that KRaft is purely an internal implementation detail that a .NET service never needs to test against directly.
❌ Wrong thinking: "We already have integration tests against a Kafka container from before the KRaft migration; since the client code didn't change, those tests still validate our production behavior." ✅ Correct thinking: Integration tests written and last validated against a ZooKeeper-mode cluster can pass while missing real behavioral differences that only appear under KRaft, particularly around AdminClient timing and error codes.
The concrete gap is in admin API error codes and timing, which the "Writing Resilient .NET Clients for KRaft-Based Clusters" section covers in depth for the retry patterns — but the mistake at the team-process level is skipping the step of actually running those tests against a KRaft-mode container at all. A team that never updated its Testcontainers image from an old ZooKeeper-based Kafka image to a KRaft-mode one may never observe the faster-but-different controller failover timing, or the specific exception shapes thrown when an admin call races an in-progress controller change, until that exact race happens in production during a rolling upgrade.
🎯 Key Principle: If your .NET service calls the AdminClient for anything beyond trivial reads — topic creation, partition reassignment, config updates — your integration test suite should run against the same KRaft-mode topology (single combined node is fine for most cases) that production actually uses, not a leftover ZK-era container image kept around out of inertia.
Each of these five mistakes shares a common shape: something that used to be true under ZooKeeper — a health check target, a client library assumption, a retention calculation, a belief that internals don't matter — quietly stops being true under KRaft, and nothing forces a team to notice until it fails. The fix in every case is the same discipline: audit infrastructure files explicitly rather than assuming they were updated, verify broker-side state rather than trusting configuration intent, and test against the topology you actually run in production.
Modern Kafka Mental Model: Recap and Where to Go Next
At this point you've walked through the config-level differences between ZooKeeper-era and KRaft-era clusters, spun up a KRaft node from a .NET solution, and hardened a producer/consumer/AdminClient trio against the new failure modes. The mental model shift underlying all of that work is simple to state but easy to forget under deadline pressure: your .NET code has exactly one door into the cluster, and ZooKeeper was never that door for application traffic — it's just gone as an option entirely now.
The One-Door Model
In the ZooKeeper era, a surprising amount of tooling and mental overhead existed because two systems needed to be reasoned about: the Kafka brokers themselves, and the ZooKeeper ensemble tracking controller election, broker registration, and topic metadata. Even though Confluent.Kafka producers and consumers never spoke the ZooKeeper protocol directly, plenty of adjacent work did — health checks, migration scripts, and diagnostic tooling that shelled out to zkCli.sh or passed --zookeeper flags. That second system is what's been removed, not a second connection string your client code used to maintain.
Under KRaft, there is exactly one entry point for everything a .NET service does against the cluster:
Your .NET service
↓
bootstrap.servers (initial connection, protocol-level only)
↓
Kafka protocol requests: produce, fetch, metadata, admin RPCs
↓
Brokers (some of which may also serve as controllers)
Every operation — producing a message, consuming a partition, or calling IAdminClient.DescribeCluster() — travels over this same path. There is no second protocol, no second port, and no second client library to add. If you find yourself reaching for a ZooKeeper client NuGet package or a zookeeper.connect setting in a modern Kafka project, that's a signal you're solving a problem that no longer exists in this architecture — the concept was retired, not merely deprecated, as covered in "From ZooKeeper Habits to KRaft Reality: What Client Code Needs to Know."
Quick-Reference Checklist for a .NET Project
Before treating a Kafka integration as production-ready, it's worth running down a short checklist that consolidates what changes because the cluster runs KRaft. This isn't a replacement for the deeper walkthroughs earlier in the lesson — it's the fast pass you run during a code review or a pre-deployment sanity check.
| ✅ Check | 🔍 What to verify | 🛠️ How to verify it from .NET |
|---|---|---|
| 🆔 cluster.id present | Cluster was formatted with kafka-storage.sh, not left default | IAdminClient.DescribeCluster().ClusterId |
| 🎭 process.roles correct | Broker/controller roles match topology intent | Docker/K8s manifest review, not a client call |
| 📦 Client version verified | Confluent.Kafka/librdkafka negotiates expected API versions | ApiVersionRequest trace or broker logs |
| ⏱️ Resilience timeouts reviewed | delivery.timeout.ms, metadata.max.age.ms set deliberately | Config review against defaults |
The first item, cluster.id, deserves a quick word here even though "From ZooKeeper Habits to KRaft Reality" owns the deep explanation: if DescribeCluster() ever returns an empty or unexpected cluster ID in a staging environment, that's usually a sign someone pointed a client at an unformatted or misconfigured node rather than a genuine client bug — worth ruling out before you spend time debugging retry logic.
Here's a small C# health-check style snippet that exercises three of the four checklist items in one shot — it's the kind of probe you might run in a startup diagnostic or a CI smoke test rather than in the hot path of message processing:
using Confluent.Kafka;
using Confluent.Kafka.Admin;
var adminConfig = new AdminClientConfig
{
BootstrapServers = "localhost:9092" // the one door: no zookeeper.connect exists here
};
using var admin = new AdminClientBuilder(adminConfig).Build();
// 1. Confirms cluster.id exists and the client can reach a controller-aware broker
var metadata = admin.GetMetadata(TimeSpan.FromSeconds(5));
Console.WriteLine($"Cluster ID: {metadata.OriginatingBrokerId}");
// 2. DescribeCluster gives an explicit, typed view of controller + cluster identity
var description = await admin.DescribeClusterAsync(new DescribeClusterOptions
{
RequestTimeout = TimeSpan.FromSeconds(5)
});
Console.WriteLine($"Cluster ID: {description.ClusterId}");
Console.WriteLine($"Controller: {description.Controller.Host}:{description.Controller.Port}");
// 3. Broker count and reachability double as a rough version/compatibility smoke test —
// a client built against an incompatible protocol version would fail here, not later
Console.WriteLine($"Reachable brokers: {description.Nodes.Count}");
This code doesn't replace the dedicated version-compatibility inspection shown in "Spinning Up and Targeting a KRaft Cluster from a .NET Project" — it's a lightweight gate you can drop into a startup routine or CI job so a misconfigured cluster fails fast with a clear message instead of surfacing as a mysterious timeout three services downstream.
Resilience Configuration Is the Default Posture, Not a Special Case
It's tempting to treat the retry/backoff and idempotent-producer patterns from "Writing Resilient .NET Clients for KRaft-Based Clusters" as extra hardening you add once a service is already important enough to justify the effort. That framing gets the risk backwards: those settings — enable.idempotence=true, deliberate delivery.timeout.ms and metadata.max.age.ms values, and a catch-and-retry wrapper around AdminClient calls — should be the starting configuration for any .NET service talking to a modern Kafka cluster, not an upgrade path reserved for services that have already had an incident.
The reasoning is architectural, not just cautious: KRaft's controller failover is fast, but "fast" still means a real window during which metadata can be stale or an AdminClient call can hit a not-controller error. A service that only adds retry logic after its first production outage has, by definition, already paid the cost that the default posture was designed to avoid. Applying the resilient configuration from the start costs a handful of config lines; retrofitting it after an incident costs an incident.
// This is the same defensive shape introduced in "Writing Resilient .NET Clients
// for KRaft-Based Clusters" — repeated here to underline that it's a baseline,
// not an advanced option reserved for high-traffic services.
var producerConfig = new ProducerConfig
{
BootstrapServers = "localhost:9092",
EnableIdempotence = true,
Acks = Acks.All,
MaxInFlight = 5,
MessageTimeoutMs = 30000,
// Shorter than the ZK-era defaults many teams copy from old blog posts —
// KRaft's faster controller election means clients can afford to give up
// and retry sooner rather than waiting out a long stale-metadata window.
MetadataMaxAgeMs = 60000
};
⚠️ Common Mistake: copying producer/consumer configuration values from an older ZooKeeper-era reference project without revisiting the timeout values. Those defaults were often tuned around ZooKeeper session timeouts that no longer apply, and leaving them unexamined means a KRaft cluster's faster recovery characteristics never actually benefit your service — the client is still waiting as long as it always did.
Where the Deeper Internals Live
This lesson deliberately stayed at the boundary between your .NET code and the cluster: configuration keys, client behavior, AdminClient calls, and the resilience settings that follow from KRaft's faster metadata propagation. Two pieces were flagged along the way as belonging to their own lessons, and it's worth being explicit about why that split makes sense rather than treating it as an arbitrary boundary.
The controller quorum itself — how controller.quorum.voters actually reach consensus, what a Raft log entry looks like for a partition reassignment, and how leader election unfolds among voters and observers — is genuinely a distributed-systems topic on its own, and it's covered in depth in the KRaft Architecture lesson. Understanding it deeply isn't required to configure a .NET client correctly, but it becomes valuable once you're debugging quorum sizing decisions or reasoning about failure tolerance during a rolling upgrade, which is why "Common Mistakes .NET Teams Make with Modern Kafka Clusters" flags an even-numbered or single-node voter set as a red flag without re-deriving the Raft mechanics behind why that matters.
Similarly, the split between local (hot) storage on broker disks and remote (cold) storage in an object store like S3 — commonly referred to as tiered storage — changes how you think about retention, disk sizing, and read latency for older data, but it doesn't change how your producer or consumer code is written. That full treatment, including how offload timing interacts with retention policy and how consumers transparently read from remote tiers, lives in the Tiered Storage Concepts lesson. The connection back to this lesson is narrow but important: a dev environment configured with local retention sized for a tiered-storage production cluster will fill its disk unexpectedly, because tiered storage was never actually enabled locally — a gap already called out in "Common Mistakes .NET Teams Make with Modern Kafka Clusters."
🎯 Key Principle: this lesson's checklist is a starting heuristic for catching the most common KRaft-related misconfigurations quickly — it is not an exhaustive audit, and cluster-specific topologies (multi-datacenter controller placement, custom authentication layers, managed-service abstractions) can introduce failure modes the four checklist items won't catch.
Practical Next Steps
Three concrete actions turn this recap into something you actually apply rather than just remember conceptually. First, audit one existing .NET service's Kafka client configuration against the four-item checklist above — most teams find at least one leftover ZooKeeper-era assumption, whether it's a stale health check, an unreviewed timeout, or an unpinned client library version. Second, if your integration tests still spin up a ZooKeeper container alongside a Kafka broker, replace that setup with a single combined process.roles=broker,controller node as shown in "Spinning Up and Targeting a KRaft Cluster from a .NET Project" — it's a smaller Docker Compose file and a faster test startup. Third, before your next production deployment touching Kafka, deliberately apply the idempotent-producer-plus-retry-wrapper pattern from "Writing Resilient .NET Clients for KRaft-Based Clusters" even if the service has never had a Kafka-related incident — treat it as the default, not the fix.
💡 Remember: the architectural headline of this lesson is that ZooKeeper's removal simplified the cluster's internals without changing the shape of your application code — the door your .NET service walks through was always bootstrap.servers and the Kafka protocol, and KRaft just closed off every other door that used to exist alongside it.