Build a Service Mesh (Envoy / Istio style) (13 scenes)
Scene 07 · Circuit breaker — the state machine
Closed → open → half-open. Fast-fail to a known-broken dependency and periodically probe for recovery, so callers stop wasting resources on guaranteed failures.
Previously
A retry budget caps how much extra load the fleet generates, but each individual caller is still spending its own time and connections waiting for a backend it could already tell is broken. What's missing is a switch that says 'stop calling this dependency for a while' — and that switch is the circuit breaker.
Scene 07
Circuit breaker — the state machine
Diagram
A **circuit breaker** is a three-state machine that fast-fails requests to a known-broken dependency: **closed** (let through), **open** (reject in milliseconds), **half-open** (let exactly one probe through to test recovery). The left panel shows the three labeled circles and the transitions between them; the right panel shows what that state means for live traffic — requests either reach the backend, bounce off the breaker, or one probe leaks through. Note: Envoy implements this pattern per-replica via outlier detection (next scene); Envoy's own `circuit_breakers` config is connection-pool ceilings, a separate concept.
closed: requests flow through →
Watch the breaker trip. Errors climb past the threshold; CLOSED hands off to OPEN; new requests fast-fail in milliseconds. After a cooldown, HALF-OPEN admits exactly one probe — and the result decides whether the breaker closes or stays open.
Implementation
Breaker.onRequest
the three-state machine: closed, open, half-open
1state = CLOSED2error_rate_window = sliding(60s)3open_started_at = 045def on_request(req):6 if state == CLOSED:7 outcome = forward(req, timeout=T)8 error_rate_window.record(outcome)9 if error_rate_window.error_rate() > THRESHOLD:10 state = OPEN11 open_started_at = now()12 return outcome13 if state == OPEN:14 if now() - open_started_at > COOLDOWN:15 state = HALF_OPEN16 else:17 return fast_fail_503() # ~1ms, no backend call18 if state == HALF_OPEN:19 outcome = forward(req, timeout=T) # single PROBE20 state = CLOSED if outcome.ok else OPEN21 open_started_at = now()22 return outcome
Why OPEN protects the caller
the asymmetric payoff: 1ms reject vs full-timeout wait
1# Without breaker: every caller waits the full timeout.2# N callers * T seconds = N*T thread-seconds blocked.3# Caller's thread pool fills with stuck requests.45# With breaker OPEN: every caller returns in ~1ms.6# Caller frees the resource and degrades gracefully.7# Traffic to OTHER (healthy) dependencies keeps flowing.
Envoy.cluster.circuit_breakers (footnote)
connection-pool ceilings — NOT the state machine above
1# Envoy's `circuit_breakers` config is connection-pool2# ceilings per cluster, not closed/open/half-open.3# Per-replica state-machine semantics live in4# outlier_detection (next scene).5cluster:6 circuit_breakers:7 max_connections: 10248 max_pending_requests: 10249 max_requests: 102410 max_retries: 3