A campaign-owner uploads a 50-million-recipient list and schedules a single send for 9:00 AM tomorrow. At 9:00 AM, one scheduler tick fires — and that single event must become 50 million emails leaving SES within 60 minutes, with the strict invariant that no recipient receives the same campaign twice, ever, regardless of scheduler crashes, region failovers, Kafka rebalances, fan-out worker deaths, SES throttles, or operator-initiated catch-up replays.
This is the simplest-sounding distributed-systems problem that turns out to be one of the hardest. Three things define it: (1) at-least-once everything except the very tail — the scheduler, the fan-out controller, the Kafka pipeline, every retry, every catch-up replay is at-least-once because making them exactly-once is intractable; (2) a durable per-recipient idempotency tail that collapses at-least-once into at-most-once-per-recipient at the SES send boundary; (3) a rate-paced catch-up controller that knows the difference between "fire the missed run now" and "the email is too stale to send" — a 9 AM "good morning" email arriving at 1 PM is reputation damage, not delivery.
The architecture comes straight from Google's Borgcron paper (Sundheim, ACM Queue 2015) for the scheduler tier, LinkedIn ATC and Pinterest NEP for the fan-out tier, Stripe / Brandur's idempotency-key contract for the dedupe tier, and DoorDash's Cadence-as-fallback pattern for the catch-up reconciler. Every load-bearing decision below has a published precedent — this is not speculation.