Build a Service Mesh (Envoy / Istio style) (13 scenes)
Scene 7.5 · Outlier detection — eject one bad replica
Don't trip the whole cluster — pull just the misbehaving replica from the pool. Passive (real 5xx) catches what active /healthz probes miss.
Previously

A circuit breaker tripping the whole cluster is too coarse when only replica R3 is broken. The proxy needs a finer move: pull just R3 out of the pool.

Scene 08
Outlier detection — eject one bad replica
Diagram
A load balancer fans round-robin to a cluster of 4 replicas; only in-pool replicas receive traffic. The Pool Manager on the left runs **outlier detection** — it ejects a replica from the load-balancing pool when its real-traffic results cross a threshold (consecutive 5xx defaults to 5). Two signals feed it: red dots ABOVE each replica accumulate as that replica returns 5xx on real traffic (passive **health check**); a periodic probe arrow BELOW pings each replica's `/healthz` endpoint and lights green/red (active **health check**). When R3 trips its threshold it greys out, the LB stops routing to it, and an ejection timer ticks down before re-admission.
SERVICE CLUSTER · 4 REPLICASLoad Balancerround-robin · in-pool onlyPool Manageroutlier detectionpassivereal-traffic 5xxactive/healthz probesmode: both0/5 5xxR1traffic: ok/healthz: pass/healthz0/5 5xxR2traffic: ok/healthz: pass/healthz5/5 5xxR3traffic: 5xx/healthz: passEJECTED/healthzeject 7.0s0/5 5xxR4traffic: ok/healthz: pass/healthzboth: passive catches R3's real-traffic 5xx; active would catch a replica with no traffic.
passive outlier detection — real traffic 5xx accumulation
active health check — scheduled /healthz probe
ejected for 10s, then re-admitted
A cluster of 4 replicas. R3 is the bad apple: it returns 200 on /healthz but 500s on real /checkout traffic. Watch the red dots above R3 climb to 5 — that's the consecutive-5xx threshold — then R3 grays out and the LB stops routing to it. The active probe arrow keeps pinging /healthz on every replica; R3's probe never went red.
Implementation
PoolManager.on_real_response
passive outlier detection — trips on consecutive 5xx of real traffic
1def on_real_response(replica, response):
2 if response.status >= 500:
3 replica.consecutive_5xx += 1
4 else:
5 replica.consecutive_5xx = 0
6 if replica.consecutive_5xx >= consecutive_5xx_threshold:
7 eject(replica,
8 for=base_ejection_time * replica.ejection_count)
9 replica.ejection_count += 1
10 # max_ejection_percent guards against ejecting everyone.
PoolManager.probe_loop
active health check — periodic /healthz probe on every replica
1every interval seconds:
2 for replica in cluster.replicas:
3 resp = http.get(
4 replica.address + '/healthz',
5 timeout=2s,
6 )
7 if resp.status != 200:
8 eject(replica)
9 else:
10 readmit_if_ejected(replica)
Why both signals coexist
the failures each one misses on its own
1# Active alone misses:
2# /healthz returns 200, but /checkout returns 500.
3# (probe path is fine; business path is broken.)
4# Passive alone misses:
5# a replica with no real traffic yet has zero
6# 5xx samples — but its probe will show it failing.
7# Use both. They catch disjoint failures and share
8# one eject() path into the load-balancing pool.