Replication is async — acked writes can vanish

Scene 06 · Replication is async — acked writes can vanish

PSYNC, replication backlog, and the AP-not-CP gotcha: WAIT doesn't fix it; min-replicas-to-write does (at the cost of unavailability).

Previously

One node, however well-tuned, is a single point of failure. The first answer is to keep a copy on another node — but that copy lags behind, and that lag is the whole story of this scene.

Scene 06

Replication is async — acked writes can vanish

Watch

Diagram

Master on the left, two replicas on the right. Writes flow into the master, get acked to the client immediately, then propagate down a replication stream to each replica with a small lag offset shown above each link. A backlog buffer on the master shows what a reconnecting replica can pull (partial resync) before falling back to a full RDB transfer.

Sources

Master accepts writes; the replication backlog ring fills; PSYNC streams keep both replicas' offsets within a tick of the master. This is the steady state — and the trap.

Implementation

Replica.psync

(replication-id, offset) handshake; partial vs full

1def psync():
2    repl_id, offset = self.replicationId, self.offset
3    resp = master.PSYNC(repl_id, offset)
4    if resp == CONTINUE:
5        # backlog still covers our gap
6        partial_resync()       # stream missed bytes
7    else:  # FULLRESYNC <new_id> <new_offset>
8        self.replicationId = resp.new_id
9        full_resync_via_rdb()  # fork + ship RDB
10    stream_replication_link()

Master.handleWrite

append locally, ack the client, then replicate

1def handleWrite(cmd):
2    self.applyToDataset(cmd)
3    self.offset += len(cmd)
4    self.appendToBacklog(cmd)
5    replyOK(client)            # ack now
6    for r in self.replicas:
7        r.send(cmd)            # fire-and-forget

PreviousTTL and cleanup — lazy, active, and the freer thread NextSentinel — quorum detects, majority elects