Build an S3-style distributed object store (12 scenes)
Scene 5.5 · An m-tolerant code dies if m+1 fragments share a rack
The code's m-fault tolerance is real only if placement spreads fragments across independent failure domains.
Previously

The k+m code tolerates m losses only when those losses are independent — so we have to confront where the fragments actually sit, because real disks fail together.

Scene 5.5
An m-tolerant code dies if m+1 fragments share a rack
Diagram
The 17+3 stripe from the previous scene, now scattered across labelled racks. A failure domain is the unit that fails together — a rack, a power feed, a switch — drawn here as one column whose disks all die at once. A correlated failure is one event that takes out many fragments together. Under SPREAD no rack holds more than m=3 fragments, so killing a rack stays within the code; under CO-LOCATED one rack holds more than m, so killing it exceeds the code and the object is UNRECOVERABLE.
placement across failure domains · 17+3SPREAD — at most m=3 fragments per rackplacement: SPREADrack 13 fragsd0d7d14rack 23 fragsd1d8d15rack 33 fragsd2d9d16rack 43 fragsd3d10p0rack 53 fragsd4d11p1rack 63 fragsd5d12p2rack 72 fragsd6d13DURABILITY11 nines (on paper)FAULT TOLERANCEtolerates 3 lossesPick a rack to kill — every disk in it fails at once.
Here is the exact 17+3 stripe from before, but now we ask the question the durability math quietly skipped: where do the 20 fragments actually sit? Under SPREAD placement no single rack holds more than m=3 of them. Watch one whole rack lose power — every disk in it dies at once — and the object still rebuilds.
Implementation
Placer.assign(fragments, domains, m)
the placement rule — no domain may hold more than m fragments
1def assign(fragments, domains, m):
2 per_domain = {d: 0 for d in domains}
3 for frag in fragments: # k + m of them
4 d = pick_domain(domains, per_domain)
5 if per_domain[d] >= m:
6 continue # full — keep faults independent
7 per_domain[d] += 1
8 frag.domain = d
9 return fragments
Cluster.failDomain(dead)
one power event = every fragment in that domain, at once
1def failDomain(dead):
2 lost = [f for f in fragments if f.domain == dead]
3 for f in lost:
4 f.alive = False # one event, correlated loss
5 return len(lost) # all gone simultaneously
Reader.reconstruct(k)
any k survivors rebuild; fewer than k is unrecoverable
1def reconstruct(k):
2 survivors = [f for f in fragments if f.alive]
3 if len(survivors) >= k:
4 return decode(survivors[:k]) # inverse-solve
5 raise Unrecoverable(survivors=len(survivors),
6 needed=k)
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.