Problems
#32Online Indicator
Green dot for contacts. Mind the N² watch problem. Approximate by design — never quote presence more precisely than reality.
Saved on this device
Build the Online Indicator — the green dot next to every contact's name in your chat app, plus the "last seen 12 min ago" line that follows it. Used by Slack, Discord, WhatsApp, Messenger, Teams, Instagram, LinkedIn — every product with a contact list. Sounds simple. It is not.
Three things make this problem hard at production scale:
- N² watch problem. Every user is both a presence publisher and a presence subscriber. At 1B DAU each with ~500 contacts, the naive eager-broadcast design produces 500B subscription edges and ~250M frames/s of fanout. Physically impossible.
- Ephemeral state at billion-user RAM scale. ~500M concurrent online users × ~100 B/record = ~50 GB per region — hot, in RAM, all the time, fully replicated.
- Approximate by design. The product line "last seen 12 min ago" is a privacy + capacity invariant, not a UX choice. Quoting
last_seenprecisely (±1 s) leaks PII and forces hot-path persistence; quantizing to ±60 s solves both.
The architecture is the canonical Discord/Slack/WhatsApp shape: a long-lived WebSocket to a stateful gateway, a sharded actor tier holding hot presence in RAM, a separate fanout relay tier as the N² solver, an ephemeral cache for cross-shard reads, a durable spine for the audit trail, and a quantized last-seen store for the offline case.
Reading: How Discord Scaled Elixir to 5,000,000 Concurrent Users (discord.com/blog, 2017) · How Discord Handles Push Request Bursts with Elixir's GenStage (discord.com/blog) · How Discord Stores Trillions of Messages (discord.com/blog, 2023) · Slack's Disasterpiece Theater — Approachable Chaos Engineering (slack.engineering) · Slack's Outage on January 4th 2021 (slack.engineering) · Slack's Incident on 2-22-22 (slack.engineering) · Rick Reed — WhatsApp Scaling, Erlang Factory 2014 (erlang-factory.com) · WhatsApp — Giving You More Control Over Your Privacy (blog.whatsapp.com, 2012/2018) · Facebook TAO: A Distributed Data Store for the Social Graph (USENIX ATC 2013) · Google SRE Workbook — Alerting on SLOs (burn-rate alerts) · AWS Builders' Library — Timeouts, Retries and Backoff with Jitter · Cloudflare 2020-07-17 BGP-withdrawal post-mortem (blog.cloudflare.com) · Datadog 2023-03-08 multi-region connectivity outage retro (datadoghq.com)
viewport-bounded subscription (the N² solver)
lazy fanout vs eager broadcast
heartbeat TTL economics (set EX > heartbeat × 2 + jitter)
outbox: ephemeral cache + durable transition trail
approximate by design (±60s last_seen buckets)
privacy as first-class (everyone / contacts / nobody / invisible)
fail closed on 'unknown' — never render fake online
broadcast amplification storm on region heal
WS reconnect storm tolerance