Retries: idempotency and a token budget

Build a gRPC-style RPC framework (14 scenes)

Scene 07 · Retries: idempotency and a token budget

Only an idempotent method is safe to auto-retry, and even then a token-bucket budget must cap retries — or a brownout turns into a self-sustaining retry storm.

Previously

We learned to abort doomed work with deadlines and cancellation; now the mirror problem — work that FAILED and might be worth re-sending. But scene 1 warned we often can't tell whether the first attempt already ran, so a retry is a loaded gun.

Scene 07

Retries: idempotency and a token budget

Watch

Diagram

On the left a client calls a single backend (GreeterService) that is in a brownout (slow/failing). The arrows are the original call plus one per retry. The big top meter is the OFFERED LOAD on the backend — 1.0× means exactly its capacity; past ~4× it turns into a storm and the backend flatlines, latching a METASTABLE badge when it stays down after the original trigger clears. Idempotency: a method is idempotent when re-running it has no extra effect (greet) — its retry arrows are green (safe); a non-idempotent method (charge) turns them red because every retry risks doing the work twice. Retry budget (token bucket): the lower-left bucket holds tokens; each failure drains one, each success refills tokenRatio; once tokens drop below half the bucket, retries PAUSE and the offered-load meter caps near 1.0×.

Sources

The backend is browning out — slow and dropping calls. With no budget and 3 retries per failed call, every client re-sends at once. Watch the OFFERED-LOAD meter on the backend climb. Partway through, the original slowness clears (the trigger turns off) — but the load DOESN'T drop, because the retries are now feeding themselves. That self-sustaining overload, where the service stays down after its own cause is gone, is a *retry storm*: blind retries pile on exactly when a service can least afford it. This scene is about the two things that make retrying safe.

Implementation

Client.callWithRetry

the retry loop wrapping every outbound call

1def callWithRetry(method, req):
2    attempt = 0
3    while attempt < maxAttempts:   # slider: retries + 1
4        status = send(method, req)
5        if status == OK:
6            budget.onSuccess()     # refill tokenRatio
7            return
8        if status not in retryableStatusCodes:
9            raise               # e.g. not UNAVAILABLE
10        budget.onFailure()         # drain one token
11        if not budget.allow():     # bucket below half
12            raise
13        sleep(backoffWithJitter(attempt))
14        attempt += 1

RetryBudget.allow

token bucket capping retries as a fraction of traffic

1tokens = maxTokens          # full bucket
2 
3def onFailure():
4    tokens = max(0, tokens - 1)
5 
6def onSuccess():
7    tokens = min(maxTokens, tokens + tokenRatio)
8 
9def allow():
10    if not enabled:
11        return True          # no budget: never pause
12    return tokens >= maxTokens / 2

Method.execute

why a replay is safe only for an idempotent method

1# greet is idempotent: re-running returns the same value
2def greet(name):
3    return 'hello ' + name
4 
5# charge is NOT: each call moves money
6def charge(name, amount):
7    account[name].balance -= amount   # replay double-bills
8    return receipt()
9 
10# the leak: a retry can't tell if the first attempt ran

PreviousCancellation: an event, not a clock NextInterceptors: the middleware onion