Build a Service Mesh (Envoy / Istio style) (13 scenes)
Scene 08 · Rate limiting — the token bucket
Per-client token bucket: each request takes a token, an empty bucket returns 429. Local is cheap and drifts; global stays exact via a coordinator.
Previously
Ejecting a bad replica handles broken backends; the next failure shape is a HEALTHY backend being asked to do more than it can. The proxy needs a way to say 'no, slow down' — that policy is rate limiting.
Scene 09
Rate limiting — the token bucket
Diagram
**Rate limiting** caps the rate at which a proxy accepts requests — over the cap the proxy returns a `429` (the badge on bounced arrows). The mechanism is the **token bucket**: a literal bucket of N circular tokens with a faucet dripping +1 every (1/R) seconds; each request grabs one token to pass green, or bounces 429 when the bucket is empty. Below, the *scope* toggle swaps the topology — **local rate limit** gives every sidecar its own bucket (cheap, but the effective fleet cap drifts to `cap × sidecars`); **global rate limit** routes every sidecar through one shared bucket on an external rate-limit service (accurate, but adds an RPC hop).
↓ token — one allowed request
← faucet drips at the refill rate R
empty bucket → 429 bounce →
One sidecar, one bucket. The faucet drips +1 token every 0.5s (R = 2 rps). Green arrows are requests that grabbed a token and passed; the 429 badge is what bounces back when the bucket is empty. That policy — capping the accept rate — is *rate limiting*. The mechanism on screen — N tokens + refill — is the *token bucket*.
Implementation
TokenBucket.on_request
the per-bucket filter: refill, then take one or 429
1bucket = { tokens: CAPACITY, last_refill: now() }23def on_request():4 elapsed = now() - bucket.last_refill5 bucket.tokens = min(6 CAPACITY,7 bucket.tokens + elapsed * RATE,8 )9 bucket.last_refill = now()10 if bucket.tokens >= 1:11 bucket.tokens -= 112 return ALLOW13 return DENY_429 # bucket empty
Local scope — each proxy keeps its own bucket
fleet cap = configured cap × sidecar count (drift)
1# configured: 1000 rps per proxy2# 1 sidecar -> effective fleet = 1000 rps3# 3 sidecars -> effective fleet = 3000 rps4# 15 sidecars -> effective fleet = 15000 rps56def on_request(): # runs in every sidecar7 return bucket.on_request() # no RPC hop
Global scope — one shared bucket on a side service
every request makes one gRPC hop to the RL service
1def on_request(): # runs in every sidecar2 resp = grpc.call(3 RATE_LIMIT_SERVICE, # e.g. Lyft ratelimit / Redis4 descriptors = [5 ('client_id', req.client_id),6 ],7 )8 if resp.code == OVER_LIMIT:9 return DENY_42910 return ALLOW1112# fleet cap == configured cap, regardless of sidecar count
Choosing scope
the one-line rule of thumb
1# Local: cheap, no extra hop;2# drifts under autoscaling (cap * sidecars).3# Good for coarse 'be polite' spike absorbers.45# Global: exact aggregate; adds 1 RPC per request and a6# dependency whose outage matters at every hop.7# Required for per-tenant SLA / quota enforcement.