The payoff: durability is repair winning a race

Build an S3-style distributed object store (12 scenes)

Scene 07 · The payoff: durability is repair winning a race

Eleven nines is a rate equation: scrub finds rot, repair rebuilds lost fragments faster than disks destroy them.

Previously

Now that the index plane tracks every fragment and the data plane holds them, a background process can watch for losses and fix them — which is exactly the answer to the mystery we left open at the start: how do eleven nines survive disks failing every week?

Scene 07

The payoff: durability is repair winning a race

Watch

Diagram

One object's 17 data + 3 parity fragments live on 20 distinct disks inside a fleet of millions. A scrub worker (amber dashed) sweeps fragments at rest, recomputing checksums to catch silent bit rot; a repair worker (green glow) reads the survivors, rebuilds any lost fragment, and writes it to a fresh disk — restoring full redundancy. The REDUNDANCY DEFICIT gauge counts how many of the 20 are currently missing or bad.

Sources

Disks die on the fleet cadence — every few minutes, somewhere, one fails and takes a fragment with it. Watch the two background workers keep the object whole: the scrub worker sweeps fragments at rest checking they haven't silently gone bad, and the repair worker reads the survivors and rebuilds each lost fragment onto a fresh disk. The deficit gauge keeps drifting back to zero.

Implementation

Scrub.sweep

the background sweep that catches silent rot at rest

1# runs continuously on the background-ops fleet
2def sweep():
3    for frag in fleet.fragments_at_rest():
4        stored = read(frag.location)
5        if checksum(stored) != frag.expected_checksum:
6            # silent bit rot: a live disk, a bad fragment
7            mark_lost(frag)        # hand to Repair
8        advance_cursor()

Repair.rebuild

reads k survivors, reconstructs the lost fragment

1def rebuild(obj, lost_frag):
2    survivors = read_any(obj.fragments, count = k)
3    rebuilt = reed_solomon_decode(survivors)
4    fresh = pick_disk(distinct_failure_domain)
5    write(fresh, rebuilt)          # full k+m restored
6    index.point(obj, lost_frag, fresh)
7    # MTTR = wall-clock this rebuild took

Durability.estimate

why repair time, not fragment count, sets the nines

1def annual_loss_probability(failure_rate, mttr):
2    # one fragment down opens a window of length mttr
3    window = failure_rate * mttr
4    # need MORE THAN m losses inside one window to lose obj
5    p_lose = window ** (m + 1)
6    # m (code) sets the exponent; mttr scales the base
7    return p_lose

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

PreviousTwo planes: the index says where, storage holds what NextWhy strong consistency took S3 fourteen years