Build an S3-style distributed object store (12 scenes)
Scene 02 · Disks fail weekly — so what does durable mean?
Millions of disks at a few % AFR: one dies every few minutes. Eleven nines — and why copies alone are too costly.
Previously

We cracked the bucket open and asked what 'forever' has to survive; the answer starts at the hardware — and the hardware is millions of disks that are dying continuously.

Scene 02
Disks fail weekly — so what does durable mean?
Diagram
A grid of disk tiles standing in for the whole fleet (green = alive, red = dead). A wall clock ticks and disks flip red on a cadence set by the fleet size and the annual failure rate. One outlined tile is a tracked object that lives on a single disk with no backup; if that disk dies the object is gone. The header readout names the target — 'eleven nines' means about one object lost per 10,000 years per 10 million stored objects, essentially never — and frames it as a question of whether your bytes still EXIST on some disk (durable), which is a separate promise from whether you can REACH them this instant (available).
DISK FLEET2,400,000 disks · AFR 2%🕑 00:00:00a disk dies every 11 minutesELEVEN NINES TARGETlose ~1 object / 10,000 yrsshowing 200 of 2,400,000 disksalivedeadtracked object (single copy)tracked object: intactMillions of disks. A death every few minutes — a lone copy is doomed. So what does 'durable' mean?
Eleven nines = the designed-for target: ~1 object lost per 10,000 years per 10 million stored. Essentially never.
Watch the fleet. The clock ticks and disks flip from green to dead-red on a steady beat. One outlined tile holds a tracked object — its only copy. Keep watching that tile.
Implementation
Fleet.deathCadence
the arithmetic the two sliders drive
1def death_cadence(fleet_size, afr):
2 # afr = per-disk annual failure rate (1%..4%)
3 deaths_per_year = fleet_size * afr
4 seconds_per_year = 365 * 24 * 3600
5 # the gap between two disk deaths, in seconds
6 gap = seconds_per_year / deaths_per_year
7 return gap
8
9# one disk: gap is months — a death is rare
10# millions of disks: gap collapses to minutes
11# the per-disk rate never changed; the fleet did
Durability.target
what 'eleven nines' names — a target, not a guarantee
1# a designed-for probabilistic target, not a measured SLA
2target = 0.99999999999 # eleven nines / year
3expected_lost_fraction = 1 - target # 1e-11
4
5# stored 10_000_000 objects?
6# expect to lose ~1 object every 10_000 years
7objects = 10_000_000
8years_per_loss = 1 / (objects * expected_lost_fraction)
9
10# this is DURABLE: do the bytes still EXIST.
11# separate from AVAILABLE: can you REACH them now.
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.