Retries and exponential backoff

Build a workflow engine (Temporal / Airflow / Cadence style) (13 scenes)

Scene 06 · Retries and exponential backoff

The engine retries a failed activity on its own, widening the gap between attempts so a sick downstream can recover instead of being pinned down by a retry storm.

Previously

Retries let a transient 503 self-heal — but retries also mean an activity can RUN more than once. Picture the worst crash yet: the worker charges the card, then dies before recording the result. The engine, seeing no result, retries — and charges again. History and replay can't help here, because the effect already happened outside the recorded boundary. What stops THAT double-charge?

Scene 06

Retries and exponential backoff

Watch

Diagram

The timeline is the ChargeCard activity re-attempting against a flaky downstream. A 'retry policy' is the rule that schedules those re-attempts — initial interval, backoff coefficient, max interval, max attempts. 'Exponential backoff' means each gap grows (1s, 2s, 4s, 8s), so retries don't hammer a sick service; 'jitter' adds a small random offset so many workflows don't retry in lockstep. The offered-load meter on the downstream shows why: tight retries keep it overwhelmed, stretched retries let it recover. A non-retryable error (CardDeclined) fails fast instead of retrying forever. Red = the activity running for real; blue = the workflow/engine side scheduling the durable timer+task for each retry.

Sources

docTemporal — Retry Policies

↖ this rule — initial / coeff / cap / max attempts — is the retry policy

ORDER #1001 reaches step 1, ChargeCard $42 — and the Payment API returns 503: briefly overloaded, not refusing the card. A naive system would fail the whole order. Instead the engine re-attempts just the ACTIVITY, on a growing schedule: attempt 2 after 1s, attempt 3 after 2s, attempt 4 after 4s. That rule — initial gap, how fast the gap grows, the cap, how many attempts — is the activity's **retry policy** (read it off the chip top-left). And those gaps don't stay flat; each is bigger than the last, so the re-tries stop pounding a service that's already struggling. Growing the wait between attempts like 1s, 2s, 4s, 8s is **exponential backoff** (a small random offset, called *jitter*, is added so thousands of workflows don't all re-fire on the same tick). Watch the blip self-heal on a later attempt — and notice each retry is scheduled as a durable timer+task, so even a crash mid-wait can't lose it.

Implementation

Engine.executeActivity

retry the activity on a schedule until it succeeds

1def executeActivity(activity, policy):
2    for attempt in 1 .. policy.maximumAttempts:
3        result = try_run(activity)
4        if result.ok:
5            return result            # blip self-healed
6        if result.errorType in policy.nonRetryableErrorTypes:
7            raise result.error       # fail fast, no schedule
8        delay = policy.nextDelay(attempt)
9        scheduleRetry(activity, after = delay)

RetryPolicy.nextDelay

the exponentially growing gap, with a hard cap

1def nextDelay(attempt):
2    raw = initialInterval * backoffCoefficient ** (attempt - 1)
3    gap = min(raw, maximumInterval)   # cap a single wait
4    return gap + jitter()            # desync the herd

Engine.scheduleRetry

each retry is a durable timer+task, so a crash can't lose it

1def scheduleRetry(activity, after):
2    fireAt = now() + after
3    history.append(TimerStarted(fireAt))   # durable
4    # ...engine may crash here; on recovery the timer
5    # is replayed from history and still fires...
6    on fireAt:
7        enqueue(activity, taskQueue)        # worker re-runs it

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

PreviousTask queues: workers pull, so redeploys are safe NextIdempotency keys: the last hole in the double-charge