Build Kafka (13 scenes)
Scene 4.5 · Cluster, controller, and metadata
One controller per cluster; KRaft made the metadata itself a Raft log.
Previously

ISR membership, leader election, and topic config all need a single source of truth. One broker at a time is the controller — and KRaft made the metadata itself a replicated log, so Kafka has no external dependency.

Scene 4.5
Cluster, controller, and metadata
Diagram
A small cluster of brokers with leadership badges showing which one is the active controller. Partition replicas are scattered across brokers; the controller tracks the metadata (which replica leads which partition, who's in ISR) and replicates that metadata as a Raft log to the other controllers in KRaft mode.
Producertopic: eventsBroker 1LEADERLEO=0HW=0LEO=0 · HW=0fetchBroker 2followerISR ?LEO=0LEO=0fetchBroker 3followerISR ?LEO=0LEO=0ControllerKRaftbroker-1Failover candidatesBroker 2Broker 3METADATA LAYER — KRaft vs ZooKeeper (this is what KIP-500 replaced)currently: KRaftKRaft (KIP-500, modern)INSIDE the Kafka clusterKafka clusterC1C2C3Raft replication (same process tree)1 system to operateKafka only — no separate ensembleSub-second failovercontroller hop in ~millisecondsScales to millionsof partitions per clusterSingle protocolRaft consensus inside KafkaZooKeeper (legacy, pre-KIP-500)OUTSIDE the Kafka clusterZK ensembleKafkawatchertwo clusters, two ops teams2 systems to operateKafka + a 3–5 node ZK ensembleSeconds-level failovermetadata reload from ZK~200k partition ceilingZK watcher storms at high countsTwo protocolsKafka wire + ZAB (ZK's protocol)
Why this matters: a Kafka cluster is many brokers, but *somebody* has to decide which broker leads which partition, who's in the in-sync replica set, and what topics/configs even exist. That somebody is the controller — a ROLE one broker holds at a time. The top-right badge shows Broker 1 currently holds the controller role. The producer writes a few records to the partition leader (data plane); the controller (control plane) is along for the ride. The big panel at the bottom shows the two ways Kafka has ever stored its metadata side-by-side. KRaft (left card) keeps it INSIDE the Kafka cluster as a Raft log replicated across 3 controllers — one system, one protocol. ZooKeeper (right card) kept it OUTSIDE the cluster in a 3-to-5-node ZK ensemble that you had to operate separately — two systems, two protocols. KIP-500 (2020) replaced ZK with KRaft for exactly the reasons listed on the cards.
Implementation
Controller.onBrokerFail
the active controller reacts when a broker stops heartbeating
1def onBrokerFail(brokerId):
2 affected = [
3 p for p in partitions
4 if p.leader == brokerId
5 ]
6 for p in affected:
7 newLeader = electLeader(p)
8 record = PartitionChangeRecord(
9 partition = p.id,
10 leader = newLeader,
11 leaderEpoch = p.epoch + 1,
12 )
13 metadataLog.append(record)
14 # brokers fetch the new record and update their metadata cache
Controller.electLeader
pick a new leader from replicas still in the ISR
1def electLeader(p):
2 for replicaId in p.isr:
3 if replicaId in liveBrokers:
4 return replicaId
5 # ISR is empty — only an out-of-sync replica is left
6 if unclean.leader.election.enable:
7 return any(p.replicas & liveBrokers)
8 return NO_LEADER # partition goes offline
MetadataLog.append
KRaft: metadata changes are a Raft log on the controller quorum
1def append(record): # record is a MetadataRecord
2 if mode == 'kraft':
3 # __cluster_metadata: 3-5 controller voters
4 offset = raft.appendToQuorum(record)
5 raft.waitForCommit(offset)
6 else: # zookeeper
7 zk.write(path, record.payload)
8 zk.notifyWatchers()
9 # every broker tails the log via the fetch protocol
10 broadcastToBrokerCaches(record)
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.