Build a Service Mesh (Envoy / Istio style) (13 scenes)
Scene 01 · Fifty services, fifty broken retry policies
Every team picks its own retry, timeout, breaker, and mTLS library. One slow dependency turns into a fleet-wide outage.
Scene 01
Fifty services, fifty broken retry policies
Diagram
Twelve service boxes on a grid, with arrows showing who calls whom. Each box's badge is the retry/timeout library that team picked — and they're all different. One box in the middle (S7) is slow; the meter inside each box is how full that service's thread pool is.
FLEET CALL GRAPH12 services · 12 different retry/timeout libraries · S7 just got slowDISTINCT RETRY POLICIES12hover boxesS1checkoutokhttp retry=3healthyS2cartaxios retry=5 exphealthyS3inventorygrpc-go defaulthealthyS4pricingfetch no retryhealthyS5searchrequests retry=∞healthyS6rankingnet/http no retryhealthyS7profilerest-template defaultSLOWdownS8sessionfeign retry=5healthyS9recshttp.client retry=3healthyS10notifyktor retry=2healthyS11billingnode-fetch retry=4healthyS12ledgeruplink no retryhealthystage 0 · S7 is slow; everyone else looks fine
Look at the badges. Twelve services, twelve different retry libraries — okhttp, axios, requests with unbounded retries, fetch with none. S7 in the middle just got slow (its box is red, tagged SLOW). Its three callers — S2, S5, S9 — feed straight into it. Nothing has cascaded yet.
Implementation
Team checkout (Java) — retry in the app
OkHttp, exponential backoff, three tries — somebody's flavor
1Response call_profile(Request req):
2 for attempt in 1..3:
3 try:
4 return okhttp.newCall(req).execute()
5 except IOException:
6 sleep((2 ** attempt) * 100ms)
7 raise UpstreamFailed()
8# no shared budget · no breaker · no deadline
Team search (Python) — retry in the app
requests, bare except, infinite retries — different shop
1def call_profile(req):
2 while True:
3 try:
4 return requests.get(req.url, timeout=None)
5 except Exception:
6 continue # try again forever
7# no backoff · no cap · no jitter · no timeout
Team recs (Node) — retry in the app
axios, fixed retry=5, no timeout — common footgun
1async function callProfile(req) {
2 return axios.request({
3 url: req.url,
4 // timeout: undefined // forgot to set one
5 'axios-retry': { retries: 5 },
6 })
7}
8// fires 5 extra calls at a peer that's already slow
After the mesh — what the app keeps
the app becomes trivial · policy lives somewhere else
1async function callProfile(req) {
2 // localhost · the same in every language
3 return fetch('http://localhost/profile' + req.path)
4}
5# retry / timeout / breaker / mtls / tracing —
6# owned by the thing sitting next to the app