Reads — ReadIndex and lease — Build Raft — consensus you can defend

Build Raft — consensus you can defend (12 scenes)

Scene 09 · Reads — ReadIndex and lease

Naive leader-reads break linearizability under partition. ReadIndex (commit barrier with no-op-on-election precondition) restores it; lease reads buy back the heartbeat round in exchange for a bounded-clock-skew assumption.

Previously

Scene 8 closed the safety arc — committed entries survive crashes, membership changes are safe, and log truncation preserves Log Matching. The READ path looks like it should already be safe (it's the leader answering, after all) — but the next surprise is that a naive read from 'the leader' silently breaks linearizability under a network partition.

Scene 09

Reads — ReadIndex and lease

Watch

Diagram

A 3-server cluster (S1 = the partitioned old leader, S2 = the new real leader, S3 = follower) with a partition wall between S1 and {S2, S3}. The slider picks which read protocol is in use. NAIVE: S1 answers from local memory — STALE chip lights up. READINDEX: S2 captures its commitIndex as a barrier, runs a heartbeat round to prove it's still leader, waits for its state machine to catch up, then answers — SAFE chip; gated on the no-op-on-election bar that scene 10 explains. LEASE: S2 holds a green time-bar lease; reads inside the bar cost zero RPCs, but a GC-pause toggle lets the bar overshoot real wall-clock and a stale read sneaks through.

Sources

Here's something that catches even experienced engineers off guard. Suppose you're using etcd to look up a config value, and you ask the leader. The leader answers from its own memory. Sounds safe — it's the leader, right? Now imagine the network partitioned the leader (call it S1) from a majority of the cluster, and a new leader (S2) has already been elected on the other side. Our partitioned 'leader' S1 doesn't know it's been deposed yet — its election timeout hasn't fired, so by its own clock it's still in charge. It happily answers our read with a stale value while a newer write has already been committed by the new leader. We just got a **stale-leader read** — and we broke **linearizability** (the property that every read sees the result of every write that completed before it, as if the whole system were a single machine processing one operation at a time). Watch S1 hand the client `x=1` while S2 has already committed `x=2` on {S2, S3}.

Implementation

Leader.read (naive)

the broken baseline — read from local state, no quorum check

1# NAIVE: leader serves reads from local state with no quorum check.
2def on_client_read(key) (leader):
3    return state_machine.get(key)
4 
5# BUG: a partitioned old leader still believes it leads.
6# Its election timeout hasn't fired yet, so it has no way to
7# know a new leader has already been elected and committed
8# a newer write. Stale read served. Linearizability broken.

Leader.read (ReadIndex)

§6.4 commit-barrier read, gated on no-op-on-election

1def on_client_read(key) (leader):
2    # precondition: leader must have committed an entry of its
3    # CURRENT term (the no-op-on-election). Otherwise commitIndex
4    # may be stale and ReadIndex would under-report.
5    if not has_committed_entry_in_current_term():
6        return defer  # buffered in pendingReadIndexMessages
7    read_index = commit_index           # 1. snapshot barrier
8    acks = { self }                     # 2. confirm leadership
9    for peer in cluster_minus_self:
10        send AppendEntries(heartbeat) -> peer
11    wait until |acks| >= majority
12    wait until last_applied >= read_index   # 3. apply barrier
13    return state_machine.get(key)           # 4. serve locally

Leader.read (lease)

§6.4.1 trade an RPC round for a clock-bound assumption

1# Lease refresh: at every successful heartbeat-round ack.
2def on_heartbeat_round_acked() (leader):
3    lease_expires_at = monotonic_now()
4                       + election_timeout
5                       - clock_skew_bound
6 
7def on_client_read(key) (leader):
8    if monotonic_now() < lease_expires_at:
9        return state_machine.get(key)   # zero RPCs
10    else:
11        return read_index_serve(key)    # fall back
12 
13# ASSUMPTION: clock skew is bounded. A GC pause, fsync stall,
14# or VM steal can let one replica's monotonic clock outrun
15# another's — a new leader is elected BEFORE the old leader's
16# lease expires from its own perspective, and a stale read
17# sneaks through. CockroachDB ties leases to Raft leadership
18# specifically to bound this risk.

PreviousSnapshots — compact without violating consistency NextOperational reality — pipelining, batching, and the no-op trick