Build a workflow engine (Temporal / Airflow / Cadence style) (13 scenes)
Scene 06 · Retries and exponential backoff
The engine retries a failed activity on its own, widening the gap between attempts so a sick downstream can recover instead of being pinned down by a retry storm.
Previously

Retries let a transient 503 self-heal — but retries also mean an activity can RUN more than once. Picture the worst crash yet: the worker charges the card, then dies before recording the result. The engine, seeing no result, retries — and charges again. History and replay can't help here, because the effect already happened outside the recorded boundary. What stops THAT double-charge?

Scene 06
Retries and exponential backoff
Diagram
The timeline is the ChargeCard activity re-attempting against a flaky downstream. A 'retry policy' is the rule that schedules those re-attempts — initial interval, backoff coefficient, max interval, max attempts. 'Exponential backoff' means each gap grows (1s, 2s, 4s, 8s), so retries don't hammer a sick service; 'jitter' adds a small random offset so many workflows don't retry in lockstep. The offered-load meter on the downstream shows why: tight retries keep it overwhelmed, stretched retries let it recover. A non-retryable error (CardDeclined) fails fast instead of retrying forever. Red = the activity running for real; blue = the workflow/engine side scheduling the durable timer+task for each retry.
RETRY the ChargeCard ACTIVITY — not the whole workflowRETRY POLICYactivity: ChargeCardinitial 1s · coeff ×2.0max gap 8s · max attempts 5OFFERED LOAD · Payment API (sick)recover ↓falling — recoveringChargeCard attempts over time →t=0attempt 1✗ 503activityattempt 2✗ 503activity⏱ timer+taskattempt 3✗ 503activity⏱ timer+taskattempt 4✓ okactivity⏱ timer+task1s2s4sChargeCard self-healed @ attempt 4 · service recoveredthe cron retried the WHOLE job @ fixed 5 minEach gap doubles: the Payment API gets room to drain and recover
↖ this rule — initial / coeff / cap / max attempts — is the retry policy
ORDER #1001 reaches step 1, ChargeCard $42 — and the Payment API returns 503: briefly overloaded, not refusing the card. A naive system would fail the whole order. Instead the engine re-attempts just the ACTIVITY, on a growing schedule: attempt 2 after 1s, attempt 3 after 2s, attempt 4 after 4s. That rule — initial gap, how fast the gap grows, the cap, how many attempts — is the activity's **retry policy** (read it off the chip top-left). And those gaps don't stay flat; each is bigger than the last, so the re-tries stop pounding a service that's already struggling. Growing the wait between attempts like 1s, 2s, 4s, 8s is **exponential backoff** (a small random offset, called *jitter*, is added so thousands of workflows don't all re-fire on the same tick). Watch the blip self-heal on a later attempt — and notice each retry is scheduled as a durable timer+task, so even a crash mid-wait can't lose it.
Implementation
Engine.executeActivity
retry the activity on a schedule until it succeeds
1def executeActivity(activity, policy):
2 for attempt in 1 .. policy.maximumAttempts:
3 result = try_run(activity)
4 if result.ok:
5 return result # blip self-healed
6 if result.errorType in policy.nonRetryableErrorTypes:
7 raise result.error # fail fast, no schedule
8 delay = policy.nextDelay(attempt)
9 scheduleRetry(activity, after = delay)
RetryPolicy.nextDelay
the exponentially growing gap, with a hard cap
1def nextDelay(attempt):
2 raw = initialInterval * backoffCoefficient ** (attempt - 1)
3 gap = min(raw, maximumInterval) # cap a single wait
4 return gap + jitter() # desync the herd
Engine.scheduleRetry
each retry is a durable timer+task, so a crash can't lose it
1def scheduleRetry(activity, after):
2 fireAt = now() + after
3 history.append(TimerStarted(fireAt)) # durable
4 # ...engine may crash here; on recovery the timer
5 # is replayed from history and still fires...
6 on fireAt:
7 enqueue(activity, taskQueue) # worker re-runs it
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.