#11WhatsApp / Messenger
Hundreds of millions of long-lived sockets, sub-second 1:1 + group delivery, E2E-encrypted, multi-device, multi-region active-active.
Build the production reference architecture for a 1:1 + group messaging service at hyperscale (think WhatsApp / Messenger / Discord / Signal). The shape: hundreds of millions of long-lived TLS sockets, sub-second 1:1 + group delivery, end-to-end encrypted with multi-device fanout, multi-region active-active with regional-failover RPO=0 on metadata. Every component below is specified well enough that a staff SRE could file the implementation tickets — and would defend the choices in a 5-year incident retro.
Reading: Rick Reed — That's 'Billion' with a B (Erlang Factory 2014) · Migrating Messenger storage to optimize performance — engineering.fb.com (Iris/MyRocks) · How Discord stores trillions of messages — discord.com/blog (ScyllaDB migration) · Slack's Outage on January 4th 2021 — slack.engineering · Deploying Key Transparency at WhatsApp — engineering.fb.com (2023) · Signal Protocol whitepaper · Cloudflare — October 2021 Facebook outage post-mortem · Google SRE Workbook, Ch. 5 (alerting on SLOs)
long-lived TLS sockets
fanout-on-write vs read
outbox pattern
idempotency / dedup
Signal Protocol (X3DH + Double Ratchet)
key transparency
ScyllaDB time-bucketed partitioning
presence at scale
push-on-disconnect
multi-region active-active