Arqly — System Design Mastery

#52Cache Invalidation Across a Fleet

Write-through vs write-behind. Two generals.

Saved on this device

A cache only earns its keep if readers trust it. The instant a write lands in the database, every cached copy of the affected key — across hundreds of cache hosts, in several regions, plus the CDN at the edge — is wrong. Cache invalidation across a fleet is the problem of propagating "this key changed" to all of those copies, quickly, reliably, and provably enough that users do not see stale data.

The hard part is not the happy path. It is that you can never be certain an invalidation was delivered (the Two Generals problem), so a system that assumes "I sent the delete, therefore the cache is correct" is wrong by construction. Production systems instead make invalidation idempotent, replayable from a durable log, version-guarded, and continuously measured — and accept that "correct" means "inconsistent for less than X milliseconds, less than one time in ten billion," not "never stale." This canonical models the look-aside fleet that Meta (memcache + mcsqueal + leases + Polaris), Netflix (EVCache), and Uber (CacheFront + Flux) actually run.

Reading: Scaling Memcache at Facebook (NSDI 2013) · TAO: Facebook's Distributed Data Store for the Social Graph (ATC 2013) · Cache Made Consistent / Polaris (Meta, 2022) · Netflix EVCache global replication · Uber CacheFront

write-through

write-behind

pub/sub invalidation

look-aside leases

binlog-tailed invalidation

cross-region staleness

consistency monitoring