Build Raft — consensus you can defend (12 scenes)
Scene 09 · Reads — ReadIndex and lease
Naive leader-reads break linearizability under partition. ReadIndex (commit barrier with no-op-on-election precondition) restores it; lease reads buy back the heartbeat round in exchange for a bounded-clock-skew assumption.
Previously
Scene 8 closed the safety arc — committed entries survive crashes, membership changes are safe, and log truncation preserves Log Matching. The READ path looks like it should already be safe (it's the leader answering, after all) — but the next surprise is that a naive read from 'the leader' silently breaks linearizability under a network partition.
Scene 09
Reads — ReadIndex and lease
Diagram
A 3-server cluster (S1 = the partitioned old leader, S2 = the new real leader, S3 = follower) with a partition wall between S1 and {S2, S3}. The slider picks which read protocol is in use. NAIVE: S1 answers from local memory — STALE chip lights up. READINDEX: S2 captures its commitIndex as a barrier, runs a heartbeat round to prove it's still leader, waits for its state machine to catch up, then answers — SAFE chip; gated on the no-op-on-election bar that scene 10 explains. LEASE: S2 holds a green time-bar lease; reads inside the bar cost zero RPCs, but a GC-pause toggle lets the bar overshoot real wall-clock and a stale read sneaks through.
Sources
- paperIn Search of an Understandable Consensus Algorithm — §8
- paperConsensus: Bridging Theory and Practice — Chapter 6 (§6.4)
- paperLinearizability: A Correctness Condition for Concurrent Objects
- blogLeader Leases — efficient linearizable reads in CockroachDB
- blogTiKV — Lease Read
- codeetcd-io/raft — ReadIndex + ReadOnly
Here's something that catches even experienced engineers off guard. Suppose you're using etcd to look up a config value, and you ask the leader. The leader answers from its own memory. Sounds safe — it's the leader, right?
Now imagine the network partitioned the leader (call it S1) from a majority of the cluster, and a new leader (S2) has already been elected on the other side. Our partitioned 'leader' S1 doesn't know it's been deposed yet — its election timeout hasn't fired, so by its own clock it's still in charge. It happily answers our read with a stale value while a newer write has already been committed by the new leader. We just got a **stale-leader read** — and we broke **linearizability** (the property that every read sees the result of every write that completed before it, as if the whole system were a single machine processing one operation at a time). Watch S1 hand the client `x=1` while S2 has already committed `x=2` on {S2, S3}.
Implementation
Leader.read (naive)
the broken baseline — read from local state, no quorum check
1# NAIVE: leader serves reads from local state with no quorum check.2def on_client_read(key) (leader):3 return state_machine.get(key)45# BUG: a partitioned old leader still believes it leads.6# Its election timeout hasn't fired yet, so it has no way to7# know a new leader has already been elected and committed8# a newer write. Stale read served. Linearizability broken.
Leader.read (ReadIndex)
§6.4 commit-barrier read, gated on no-op-on-election
1def on_client_read(key) (leader):2 # precondition: leader must have committed an entry of its3 # CURRENT term (the no-op-on-election). Otherwise commitIndex4 # may be stale and ReadIndex would under-report.5 if not has_committed_entry_in_current_term():6 return defer # buffered in pendingReadIndexMessages7 read_index = commit_index # 1. snapshot barrier8 acks = { self } # 2. confirm leadership9 for peer in cluster_minus_self:10 send AppendEntries(heartbeat) -> peer11 wait until |acks| >= majority12 wait until last_applied >= read_index # 3. apply barrier13 return state_machine.get(key) # 4. serve locally
Leader.read (lease)
§6.4.1 trade an RPC round for a clock-bound assumption
1# Lease refresh: at every successful heartbeat-round ack.2def on_heartbeat_round_acked() (leader):3 lease_expires_at = monotonic_now()4 + election_timeout5 - clock_skew_bound67def on_client_read(key) (leader):8 if monotonic_now() < lease_expires_at:9 return state_machine.get(key) # zero RPCs10 else:11 return read_index_serve(key) # fall back1213# ASSUMPTION: clock skew is bounded. A GC pause, fsync stall,14# or VM steal can let one replica's monotonic clock outrun15# another's — a new leader is elected BEFORE the old leader's16# lease expires from its own perspective, and a stale read17# sneaks through. CockroachDB ties leases to Raft leadership18# specifically to bound this risk.