Build a wide-column store (Cassandra / DynamoDB family) (13 scenes)
Scene 10 · Hold the write until it wakes up
When a replica is briefly unreachable, the coordinator stashes the write and replays it on return.
Previously
Partitions are dramatic; the everyday case is a single replica blipping for a few seconds — for that the cluster has a softer trick.
Scene 10
Hold the write until it wakes up
Diagram
A 5-node consistent-hash ring with replica **E** marked DOWN. The featured key's writes normally fan to RF=3 successors, but one of those successors is E. The healthy coordinator sprouts an orange **hinted handoff** envelope — a coordinator-side stash of writes destined for a temporarily unreachable replica, replayed when the replica returns. Inside the envelope, two meters: the **hints disk** (fills as the outage drags on; coordinator pays in disk pressure) and the **hint TTL** age bar — the maximum time a hint will be retained, default 3h in Cassandra. Past it, hints are discarded — replica returns 'stale forever' for those writes.
Replica E just went DOWN. Watch the simulated outage clock tick — for every missed write, the coordinator stashes a hint envelope. The age meter creeps toward the 3h TTL. Within TTL, the envelope replays when E returns; past it, the envelope dissolves.
Implementation
Coordinator.put
the write path with the dead-replica hint branch
1def put(key, value):2 replicas = ring.replicas_for(key)3 for r in replicas:4 if r.alive:5 send_async(r, Write(key, value))6 else:7 # stash on local disk for later replay8 hint_store.append(9 Hint(target=r, write=(key, value),10 created_at=now()),11 )12 return Ack # to W replicas (best-effort hints)
HintStore.replay_on_return
fires when the down replica gossips back as alive
1def replay_on_return(replica):2 for hint in hint_store.for(replica):3 age = now() - hint.created_at4 if age <= max_hint_window_in_ms:5 send(replica, hint.write)6 hint_store.remove(hint)7 else:8 # past TTL — replica returns 'stale forever'9 hint_store.remove(hint)
HintStore.sweep
background TTL sweep — discard hints past max_hint_window
1# runs continuously on the coordinator2def sweep():3 for hint in list(hint_store):4 age = now() - hint.created_at5 if age > max_hint_window_in_ms:6 hint_store.remove(hint) # silent loss7 sleep(hints_flush_period_ms)