Problems
#12Slack / Discord
Channels, presence, history. Push or pull — and how a hot-channel fanout doesn't melt the gateway.
Saved on this device
Build a production reference architecture for channel-based real-time messaging at hyperscale (think Slack workspaces or Discord guilds). The shape: tens of millions of long-lived TLS sockets, sub-second channel-broadcast delivery, server-side history that survives restarts, and a presence subsystem that does *not* melt when 50M users come online at once. Every component below is specified well enough that an SRE team would defend the choices in a 5-year incident retro — Slack and Discord have both already had those retros, public, and the canonical bakes their lessons in.
Reading: Discord — How Discord Stores Trillions of Messages (Cassandra → ScyllaDB) · Discord — How Discord Scaled Elixir to 5,000,000 Concurrent Users · Discord — Maxjourney: 1M+ Online in a Single Server (relay tier) · Discord — Using Rust to Scale Elixir for 11M Concurrent Users (SortedSet NIF) · Slack — Flannel: an Application-Level Edge Cache to Make Slack Scale · Slack — Real-time messaging (Channel Server + Gatewayserver) · Slack — Scaling Datastores at Slack with Vitess · Slack — Slack's Outage on January 4th, 2021 (TGW saturation) · Slack — Slack's Incident on 2-22-22 (Consul / Vitess feedback) · Slack — A Terrible, Horrible, No-Good, Very Bad Day (May 2020 HAProxy) · Slack — Migration to a Cellular Architecture (InfoQ 2024) · Slack — Tracing Notifications & How Slack Rebuilt Notifications · Slack — Migrating Millions of Concurrent WebSockets to Envoy · elixir-lang.org — Real-Time Communication at Scale with Elixir at Discord (2020) · Cloudflare — July 17 2020 BGP outage post-mortem · Google SRE Workbook, Ch. 5 (alerting on SLOs)
long-lived WebSocket gateway
channel server (consistent-hash by channel_id)
mega-channel relay tier (Discord Maxjourney)
lazy presence + selective subscription
request coalescing on history reads
ScyllaDB time-bucketed partitioning
Vitess sharded MySQL by channel_id
fanout-on-write with passive-session filter
outbox to Kafka for offline push
cell-based DR (Slack 2024)