Build a Service Mesh (Envoy / Istio style) (13 scenes)
Scene 08 · Rate limiting — the token bucket
Per-client token bucket: each request takes a token, an empty bucket returns 429. Local is cheap and drifts; global stays exact via a coordinator.
Previously

Ejecting a bad replica handles broken backends; the next failure shape is a HEALTHY backend being asked to do more than it can. The proxy needs a way to say 'no, slow down' — that policy is rate limiting.

Scene 09
Rate limiting — the token bucket
Diagram
**Rate limiting** caps the rate at which a proxy accepts requests — over the cap the proxy returns a `429` (the badge on bounced arrows). The mechanism is the **token bucket**: a literal bucket of N circular tokens with a faucet dripping +1 every (1/R) seconds; each request grabs one token to pass green, or bounces 429 when the bucket is empty. Below, the *scope* toggle swaps the topology — **local rate limit** gives every sidecar its own bucket (cheap, but the effective fleet cap drifts to `cap × sidecars`); **global rate limit** routes every sidecar through one shared bucket on an external rate-limit service (accurate, but adds an RPC hop).
SCOPE · LOCAL (per-sidecar bucket)sidecarcheckout-sc-1req+1 / 0.50sR = 2 rpsbucket · checkout-sc…3 / 5 tokensCONFIGURED PER-BUCKET1,000 rps× 1 sidecars (independent)fleet cap driftsEFFECTIVE FLEET RPS1,000 rpsOne proxy, one bucket — configured cap and fleet cap are the same number.
↓ token — one allowed request
← faucet drips at the refill rate R
empty bucket → 429 bounce →
One sidecar, one bucket. The faucet drips +1 token every 0.5s (R = 2 rps). Green arrows are requests that grabbed a token and passed; the 429 badge is what bounces back when the bucket is empty. That policy — capping the accept rate — is *rate limiting*. The mechanism on screen — N tokens + refill — is the *token bucket*.
Implementation
TokenBucket.on_request
the per-bucket filter: refill, then take one or 429
1bucket = { tokens: CAPACITY, last_refill: now() }
2
3def on_request():
4 elapsed = now() - bucket.last_refill
5 bucket.tokens = min(
6 CAPACITY,
7 bucket.tokens + elapsed * RATE,
8 )
9 bucket.last_refill = now()
10 if bucket.tokens >= 1:
11 bucket.tokens -= 1
12 return ALLOW
13 return DENY_429 # bucket empty
Local scope — each proxy keeps its own bucket
fleet cap = configured cap × sidecar count (drift)
1# configured: 1000 rps per proxy
2# 1 sidecar -> effective fleet = 1000 rps
3# 3 sidecars -> effective fleet = 3000 rps
4# 15 sidecars -> effective fleet = 15000 rps
5
6def on_request(): # runs in every sidecar
7 return bucket.on_request() # no RPC hop
Global scope — one shared bucket on a side service
every request makes one gRPC hop to the RL service
1def on_request(): # runs in every sidecar
2 resp = grpc.call(
3 RATE_LIMIT_SERVICE, # e.g. Lyft ratelimit / Redis
4 descriptors = [
5 ('client_id', req.client_id),
6 ],
7 )
8 if resp.code == OVER_LIMIT:
9 return DENY_429
10 return ALLOW
11
12# fleet cap == configured cap, regardless of sidecar count
Choosing scope
the one-line rule of thumb
1# Local: cheap, no extra hop;
2# drifts under autoscaling (cap * sidecars).
3# Good for coarse 'be polite' spike absorbers.
4
5# Global: exact aggregate; adds 1 RPC per request and a
6# dependency whose outage matters at every hop.
7# Required for per-tenant SLA / quota enforcement.