All scenes
Build a distributed logging stack (ELK / Loki)
12 scenes · ~84 min · build the primitive
Build your own distributed logging stack (ELK / Loki)
Ship lines off N hosts, choose what to index, age data through tiers, retain or delete on schedule, and survive a chatty service — built one decision at a time.
- 01grep + ssh stops workingOnce your fleet has more than a handful of hosts, the only way to answer 'which host saw this error' is to ship lines centrally — local files plus ssh is an O(N) dead end.~7 min
- 02An agent on every hostA shipping agent tails each log file from a saved offset, batches lines, and POSTs them at-least-once — duplicates on retry are normal, not a bug.~7 min
- 03Block, drop, or spillWhen the backend stalls, the agent must block, drop, or spill to disk — and `when_full=block` plus a synchronous logger is how a logging outage takes the application down with it.~7 min
- 04String vs map — fields and labelsA log line is either a string parsed at read-time or a typed map parsed at write-time, and the two systems we'll meet attach different names to the same idea — fields in ELK, labels in Loki.~7 min
- 05Inverted index vs labels-only indexELK tokenises every value into a per-term posting list; Loki hashes the label-set into a stream id and appends the line verbatim — heavy index + small body vs tiny index + verbatim body.~7 min
- 05aSame query, two execution plansELK answers via posting-list intersection — milliseconds; Loki resolves labels to chunks, fetches them from S3, and greps in-process — seconds to minutes. Opposite ends of the same trade-off curve.~7 min
- 06Cardinality is the killerEvery unique label-set is a Loki stream; every dynamic key is an ELK mapping field. Putting request_id in either kills the index in minutes — low-card → labels, high-card → body.~7 min
- 07Hot, warm, cold, frozenTwo orders of magnitude in cost between NVMe and Deep Archive force tiering. ELK has four ILM phases; Loki collapses to S3 from day one.~7 min
- 08Retention vs deletion — the index has to forgetRetention is when the system stops promising you can read; deletion is when bytes are physically gone — and the gap is where compliance bugs live.~7 min
- 09Sampling — head, tail, and the only errorHead sampling decides at emit (cheap, blind); tail sampling decides at the collector (can keep all errors, costs buffer). A uniform 1% sample drops the only error you needed.~7 min
- 10Distributors, ingesters, queriers (briefly)Both ELK and Loki are sharded write-paths plus sharded read-paths plus a backing store; they differ in the partition key — hash(doc_id) vs hash(label-set).~7 min
- 11Design your logging stackCapstone: agent + buffer policy + structuring + index strategy + tiers + retention + sampling — the verifier traces every choice back to the scene that earned it.~7 min