Outlier detection — eject one bad replica — Build a Service Mesh (Envoy / Istio style)

Build a Service Mesh (Envoy / Istio style) (13 scenes)

Scene 7.5 · Outlier detection — eject one bad replica

Don't trip the whole cluster — pull just the misbehaving replica from the pool. Passive (real 5xx) catches what active /healthz probes miss.

Previously

A circuit breaker tripping the whole cluster is too coarse when only replica R3 is broken. The proxy needs a finer move: pull just R3 out of the pool.

Scene 08

Outlier detection — eject one bad replica

Watch

Diagram

A load balancer fans round-robin to a cluster of 4 replicas; only in-pool replicas receive traffic. The Pool Manager on the left runs **outlier detection** — it ejects a replica from the load-balancing pool when its real-traffic results cross a threshold (consecutive 5xx defaults to 5). Two signals feed it: red dots ABOVE each replica accumulate as that replica returns 5xx on real traffic (passive **health check**); a periodic probe arrow BELOW pings each replica's `/healthz` endpoint and lights green/red (active **health check**). When R3 trips its threshold it greys out, the LB stops routing to it, and an ejection timer ticks down before re-admission.

passive outlier detection — real traffic 5xx accumulation

active health check — scheduled /healthz probe

ejected for 10s, then re-admitted

A cluster of 4 replicas. R3 is the bad apple: it returns 200 on /healthz but 500s on real /checkout traffic. Watch the red dots above R3 climb to 5 — that's the consecutive-5xx threshold — then R3 grays out and the LB stops routing to it. The active probe arrow keeps pinging /healthz on every replica; R3's probe never went red.

Implementation

PoolManager.on_real_response

passive outlier detection — trips on consecutive 5xx of real traffic

1def on_real_response(replica, response):
2    if response.status >= 500:
3        replica.consecutive_5xx += 1
4    else:
5        replica.consecutive_5xx = 0
6    if replica.consecutive_5xx >= consecutive_5xx_threshold:
7        eject(replica,
8              for=base_ejection_time * replica.ejection_count)
9        replica.ejection_count += 1
10    # max_ejection_percent guards against ejecting everyone.

PoolManager.probe_loop

active health check — periodic /healthz probe on every replica

1every interval seconds:
2    for replica in cluster.replicas:
3        resp = http.get(
4            replica.address + '/healthz',
5            timeout=2s,
6        )
7        if resp.status != 200:
8            eject(replica)
9        else:
10            readmit_if_ejected(replica)

Why both signals coexist

the failures each one misses on its own

1# Active alone misses:
2#   /healthz returns 200, but /checkout returns 500.
3#   (probe path is fine; business path is broken.)
4# Passive alone misses:
5#   a replica with no real traffic yet has zero
6#   5xx samples — but its probe will show it failing.
7# Use both. They catch disjoint failures and share
8# one eject() path into the load-balancing pool.

PreviousCircuit breaker — the state machine NextRate limiting — the token bucket