Build a Service Mesh (Envoy / Istio style) (13 scenes)
Scene 7.5 · Outlier detection — eject one bad replica
Don't trip the whole cluster — pull just the misbehaving replica from the pool. Passive (real 5xx) catches what active /healthz probes miss.
Previously
A circuit breaker tripping the whole cluster is too coarse when only replica R3 is broken. The proxy needs a finer move: pull just R3 out of the pool.
Scene 08
Outlier detection — eject one bad replica
Diagram
A load balancer fans round-robin to a cluster of 4 replicas; only in-pool replicas receive traffic. The Pool Manager on the left runs **outlier detection** — it ejects a replica from the load-balancing pool when its real-traffic results cross a threshold (consecutive 5xx defaults to 5). Two signals feed it: red dots ABOVE each replica accumulate as that replica returns 5xx on real traffic (passive **health check**); a periodic probe arrow BELOW pings each replica's `/healthz` endpoint and lights green/red (active **health check**). When R3 trips its threshold it greys out, the LB stops routing to it, and an ejection timer ticks down before re-admission.
passive outlier detection — real traffic 5xx accumulation
active health check — scheduled /healthz probe
ejected for 10s, then re-admitted
A cluster of 4 replicas. R3 is the bad apple: it returns 200 on /healthz but 500s on real /checkout traffic. Watch the red dots above R3 climb to 5 — that's the consecutive-5xx threshold — then R3 grays out and the LB stops routing to it. The active probe arrow keeps pinging /healthz on every replica; R3's probe never went red.
Implementation
PoolManager.on_real_response
passive outlier detection — trips on consecutive 5xx of real traffic
1def on_real_response(replica, response):2 if response.status >= 500:3 replica.consecutive_5xx += 14 else:5 replica.consecutive_5xx = 06 if replica.consecutive_5xx >= consecutive_5xx_threshold:7 eject(replica,8 for=base_ejection_time * replica.ejection_count)9 replica.ejection_count += 110 # max_ejection_percent guards against ejecting everyone.
PoolManager.probe_loop
active health check — periodic /healthz probe on every replica
1every interval seconds:2 for replica in cluster.replicas:3 resp = http.get(4 replica.address + '/healthz',5 timeout=2s,6 )7 if resp.status != 200:8 eject(replica)9 else:10 readmit_if_ejected(replica)
Why both signals coexist
the failures each one misses on its own
1# Active alone misses:2# /healthz returns 200, but /checkout returns 500.3# (probe path is fine; business path is broken.)4# Passive alone misses:5# a replica with no real traffic yet has zero6# 5xx samples — but its probe will show it failing.7# Use both. They catch disjoint failures and share8# one eject() path into the load-balancing pool.