Snapshots — compact without violating consistency — Build Raft

Build Raft — consensus you can defend (12 scenes)

Scene 08 · Snapshots — compact without violating consistency

Per-replica snapshots at applied index. (lastIncludedIndex, lastIncludedTerm) substitute for the truncated tail in the AppendEntries consistency check, so Log Matching survives compaction. InstallSnapshot ships the prefix to far-behind followers.

Previously

Scene 7 made cluster membership safe to change: every transitional decision still passes through a majority that overlaps both the old and the new rosters. So now the cluster can grow and shrink without breaking Election Safety. The next operational reality is that the log itself keeps growing — and truncating it has to preserve Log Matching just as carefully as reconfiguration preserved Election Safety.

Scene 08

Snapshots — compact the log without breaking Log Matching

Watch

Diagram

Three Raft servers in a row. Each cell shows role badge, currentTerm, votedFor, and a horizontal log strip — at most 8 entries are visible, and a '+N more' chip stands in for the rest. commitIndex (▲, the highest log index known to be replicated on a majority) and lastApplied (▽, the highest index fed into the state machine) sit beneath each strip. RPCs are color-coded: AppendEntries blue, RequestVote amber, InstallSnapshot violet — the fourth RPC kind in Raft, used only when a follower has fallen below the leader's truncation point.

Sources

log already overflowing

By scene 8, our log keeps growing. Every replica has to hold it on disk forever — that's untenable for a long-running cluster. The solution: occasionally take a snapshot of the state machine's current state, then throw away the log entries that built up to it. Here are three servers, all caught up at applied index 200, with a log strip that's already overflowing. Watch the next three captions — they install the three new terms this scene needs: snapshot, the (lastIncludedIndex, lastIncludedTerm) pair, and InstallSnapshot RPC.

Implementation

Replica.takeSnapshot

local operation: serialize state machine, persist metadata, truncate log prefix

1def takeSnapshot(self):
2    # 1. serialize state machine at lastApplied
3    snap = self.stateMachine.serialize()
4    snap.lastIncludedIndex = self.lastApplied
5    snap.lastIncludedTerm  = self.log[self.lastApplied].term
6    # 2. include latest committed configuration in the snapshot
7    snap.config = self.latestCommittedConfig()
8    # 3. fsync the snapshot file durably
9    persist(snap)
10    # 4. truncate log prefix — entries 1..lastIncludedIndex go away
11    self.log.discardThrough(snap.lastIncludedIndex)
12    # NOTE: no RPC, no quorum, no leader involvement.

Leader.replicateTo(follower)

AppendEntries when in-range; InstallSnapshot when below truncation

1def replicateTo(self, F):
2    if self.nextIndex[F] >= self.logStartIndex:
3        # in-range: ordinary AppendEntries
4        prev = self.nextIndex[F] - 1
5        send(AppendEntries(
6            term=self.currentTerm, prevLogIndex=prev,
7            prevLogTerm=self.termAt(prev),
8            entries=self.log[self.nextIndex[F]:],
9            leaderCommit=self.commitIndex), to=F)
10    else:
11        # F has fallen below our truncation point
12        send(InstallSnapshot(
13            term=self.currentTerm, leaderId=self.id,
14            lastIncludedIndex=self.snap.lastIncludedIndex,
15            lastIncludedTerm=self.snap.lastIncludedTerm,
16            data=self.snap.bytes, done=True), to=F)
17        # AppendEntries resume next tick from lastIncludedIndex+1

Follower.handleInstallSnapshot

adopt if past commitIndex; the metadata is the synthetic prevLog

1def handleInstallSnapshot(self, msg):
2    if msg.term < self.currentTerm: return Reject
3    self.stepDownIfHigherTerm(msg.term)
4    if msg.lastIncludedIndex <= self.commitIndex:
5        return Ok  # snapshot is older than what we have; ignore
6    # 1. install the snapshot bytes into the state machine
7    self.stateMachine.restore(msg.data)
8    # 2. record the synthetic prevLog: any future AppendEntries with
9    #    prevLogIndex == lastIncludedIndex matches lastIncludedTerm
10    self.log.resetTo(msg.lastIncludedIndex, msg.lastIncludedTerm)
11    self.commitIndex = msg.lastIncludedIndex
12    self.lastApplied = msg.lastIncludedIndex
13    return Ok

PreviousMembership changes — joint consensus and single-server NextReads — ReadIndex and lease