Arqly — System Design Mastery

#47Distributed Cron — Mass Scheduled Email

Single trigger, 50M recipient idempotency, catch-up.

Saved on this device

A campaign-owner uploads a 50-million-recipient list and schedules a single send for 9:00 AM tomorrow. At 9:00 AM, one scheduler tick fires — and that single event must become 50 million emails leaving SES within 60 minutes, with the strict invariant that no recipient receives the same campaign twice, ever, regardless of scheduler crashes, region failovers, Kafka rebalances, fan-out worker deaths, SES throttles, or operator-initiated catch-up replays.

This is the simplest-sounding distributed-systems problem that turns out to be one of the hardest. Three things define it: (1) at-least-once everything except the very tail — the scheduler, the fan-out controller, the Kafka pipeline, every retry, every catch-up replay is at-least-once because making them exactly-once is intractable; (2) a durable per-recipient idempotency tail that collapses at-least-once into at-most-once-per-recipient at the SES send boundary; (3) a rate-paced catch-up controller that knows the difference between "fire the missed run now" and "the email is too stale to send" — a 9 AM "good morning" email arriving at 1 PM is reputation damage, not delivery.

The architecture comes straight from Google's Borgcron paper (Sundheim, ACM Queue 2015) for the scheduler tier, LinkedIn ATC and Pinterest NEP for the fan-out tier, Stripe / Brandur's idempotency-key contract for the dedupe tier, and DoorDash's Cadence-as-fallback pattern for the catch-up reconciler. Every load-bearing decision below has a published precedent — this is not speculation.

Reading: Sundheim — Reliable Cron across the Planet (ACM Queue 2015, Borgcron paper) · LinkedIn — Air Traffic Controller: Member-First Notifications (2016) · LinkedIn — Hermes mass email retros (eng blog) · Pinterest — NEP Notification System and Relevance (Medium 2017) · Uber — Cherami: Uber's durable distributed task queue · Uber — Announcing Cadence (Temporal predecessor) · DoorDash — Cadence as a Fallback for Event-Driven Processing · Discord — How we store trillions of messages (Cassandra → ScyllaDB) · Stripe — Designing robust idempotency keys (Brandur) · Brandur — Implementing Stripe-like Idempotency Keys in Postgres · AWS Builders' Library — Idempotency at Scale (re:Invent ARC403 2021) · AWS — Handling SES throttling (Maximum sending rate exceeded) · AWS — SES sending quotas & dedicated IP warmup · Mailgun — Bulk email sending with queue management · Slack — Tracing notifications (Go → Kafka → Elasticsearch) · Shopify — High availability background jobs · Vallery Lancey — Kubernetes CronJob Failed For 24 Days (case study against naive cron) · Quartz Scheduler — JDBC JobStore clustering docs (the misfire-threshold story) · Cloudflare — Cron Triggers internals (Nomad-distributed schedulers) · Apache Kafka — KIP-429 cooperative incremental rebalancing · Beyer et al. — SRE Workbook ch. 21–23 (overload, cascading failures, critical state) · Campbell & Majors — Database Reliability Engineering ch. 9 (fencing tokens)

Borgcron-style at-least-once scheduler

per-recipient idempotency at fan-out tail

outbox (durable trigger log + publish)

saga (campaign create + bulk recipient upload)

active-region fencing token

two-tier dedupe (hot Redis + durable Cassandra)

rate-paced catch-up reconciler

per-tenant Kafka bulkheading

SES warm-pool reputation isolation

TCPA opt-out linearizability