Arqly — System Design Mastery

#14Live Comments / Score Updates

Pub/sub at scale. WebSocket vs SSE vs long polling. Approximate by design — mega-rooms drop comments on purpose.

Saved on this device

MrBeast is live and 5 million phones are open at once. A goal scores in the World Cup final and 30 million pulse-vibrate. Somebody types "BRO" and a hundred thousand people see it in the same second — except, by design, the other four million nine hundred thousand don't. This is the dominant shape of "live comments + score updates": a write-explosive, read-explosive workload where O(n) publishers create O(n²) potential delivery work, where the right answer to "deliver every comment to every viewer" is no — sampling is the design, not a degradation.

This canonical is about the actual production fabric: where the long-lived socket terminates, where the sampler lives, how a per-room fanout actor survives a hot-room of 5M concurrent without a single shard saturating, how the publish leg never lets a moderation outage corrupt the user's send button, and how cross-region active-active maintains a single writer per room so the replay buffer doesn't fork on failover. It is also explicitly about the things that DON'T work at this scale — topic-per-room Kafka above 50K rooms, naïve EventSource 3-second reconnect, exact XTRIM MAXLEN, "dual-write for safety" during failover, the SSE-over-HTTP/2 HOL-blocking trap.

The two distinguishing design pressures, separate from any other real-time problem in the catalog, are (1) bidirectional: every viewer is both a publisher AND a subscriber on the same socket — not the asymmetric "broadcast-only" of sports scores — and (2) approximate by design: in a mega-room you SHIP a sampler that drops 99.99% of comments on a deliberate, ranked, fair-by-window policy, and the SLO for "did the user see this specific comment" is intentionally not 100%. Score events, conversely, are on a privileged lane that bypasses both the sampler and the moderation pipeline — they're low-volume, authoritative, and never sacrificed.

Reading: Slack — Flannel: an application-level edge cache · Discord — Scaling Elixir to 5M concurrent + Manifold + Maxjourney · Cloudflare — Durable Objects WebSocket Hibernation · WhatsApp — 2M sockets/box on Erlang BEAM (FreeBSD tuning) · LINE LIVE — sub-room sharding for celebrity streams · Twitch — chat architecture (Room + Clue, Go rewrite) · Fastly Fanout / Pushpin — GRIP proxy-and-hold · Redis Streams — XADD MAXLEN ~ N approximate trim · AWS Builders' Library — avoid retry storms / decorrelated jitter · SRE Workbook Ch.22 — Addressing cascading failures

pub/sub fanout

websocket vs sse vs long-poll

approximate-by-design sampling

edge-terminated long-lived sockets

moderation pipeline

thundering-herd reconnect

single-writer-per-room fencing

hash-routed shared topics

manifold relay tree

last-event-id replay