Rate limiting — the token bucket — Build a Service Mesh (Envoy / Istio style)

Build a Service Mesh (Envoy / Istio style) (13 scenes)

Scene 08 · Rate limiting — the token bucket

Per-client token bucket: each request takes a token, an empty bucket returns 429. Local is cheap and drifts; global stays exact via a coordinator.

Previously

Ejecting a bad replica handles broken backends; the next failure shape is a HEALTHY backend being asked to do more than it can. The proxy needs a way to say 'no, slow down' — that policy is rate limiting.

Scene 09

Rate limiting — the token bucket

Watch

Diagram

**Rate limiting** caps the rate at which a proxy accepts requests — over the cap the proxy returns a `429` (the badge on bounced arrows). The mechanism is the **token bucket**: a literal bucket of N circular tokens with a faucet dripping +1 every (1/R) seconds; each request grabs one token to pass green, or bounces 429 when the bucket is empty. Below, the *scope* toggle swaps the topology — **local rate limit** gives every sidecar its own bucket (cheap, but the effective fleet cap drifts to `cap × sidecars`); **global rate limit** routes every sidecar through one shared bucket on an external rate-limit service (accurate, but adds an RPC hop).

↓ token — one allowed request

← faucet drips at the refill rate R

empty bucket → 429 bounce →

One sidecar, one bucket. The faucet drips +1 token every 0.5s (R = 2 rps). Green arrows are requests that grabbed a token and passed; the 429 badge is what bounces back when the bucket is empty. That policy — capping the accept rate — is *rate limiting*. The mechanism on screen — N tokens + refill — is the *token bucket*.

Implementation

TokenBucket.on_request

the per-bucket filter: refill, then take one or 429

1bucket = { tokens: CAPACITY, last_refill: now() }
2 
3def on_request():
4    elapsed = now() - bucket.last_refill
5    bucket.tokens = min(
6        CAPACITY,
7        bucket.tokens + elapsed * RATE,
8    )
9    bucket.last_refill = now()
10    if bucket.tokens >= 1:
11        bucket.tokens -= 1
12        return ALLOW
13    return DENY_429  # bucket empty

Local scope — each proxy keeps its own bucket

fleet cap = configured cap × sidecar count (drift)

1# configured: 1000 rps per proxy
2#   1 sidecar  -> effective fleet = 1000 rps
3#   3 sidecars -> effective fleet = 3000 rps
4#  15 sidecars -> effective fleet = 15000 rps
5 
6def on_request():  # runs in every sidecar
7    return bucket.on_request()  # no RPC hop

Global scope — one shared bucket on a side service

every request makes one gRPC hop to the RL service

1def on_request():  # runs in every sidecar
2    resp = grpc.call(
3        RATE_LIMIT_SERVICE,  # e.g. Lyft ratelimit / Redis
4        descriptors = [
5            ('client_id', req.client_id),
6        ],
7    )
8    if resp.code == OVER_LIMIT:
9        return DENY_429
10    return ALLOW
11 
12# fleet cap == configured cap, regardless of sidecar count

Choosing scope

the one-line rule of thumb

1# Local:  cheap, no extra hop;
2#         drifts under autoscaling (cap * sidecars).
3#         Good for coarse 'be polite' spike absorbers.
4 
5# Global: exact aggregate; adds 1 RPC per request and a
6#         dependency whose outage matters at every hop.
7#         Required for per-tenant SLA / quota enforcement.

PreviousOutlier detection — eject one bad replica NextmTLS — identity for both sides