Build your own distributed logging stack (ELK / Loki)

Ship lines off N hosts, choose what to index, age data through tiers, retain or delete on schedule, and survive a chatty service — built one decision at a time.

01
grep + ssh stops working
Once your fleet has more than a handful of hosts, the only way to answer 'which host saw this error' is to ship lines centrally — local files plus ssh is an O(N) dead end.
~7 min
02
An agent on every host
A shipping agent tails each log file from a saved offset, batches lines, and POSTs them at-least-once — duplicates on retry are normal, not a bug.
~7 min
03
Block, drop, or spill
When the backend stalls, the agent must block, drop, or spill to disk — and `when_full=block` plus a synchronous logger is how a logging outage takes the application down with it.
~7 min
04
String vs map — fields and labels
A log line is either a string parsed at read-time or a typed map parsed at write-time, and the two systems we'll meet attach different names to the same idea — fields in ELK, labels in Loki.
~7 min
05
Inverted index vs labels-only index
ELK tokenises every value into a per-term posting list; Loki hashes the label-set into a stream id and appends the line verbatim — heavy index + small body vs tiny index + verbatim body.
~7 min
05a
Same query, two execution plans
ELK answers via posting-list intersection — milliseconds; Loki resolves labels to chunks, fetches them from S3, and greps in-process — seconds to minutes. Opposite ends of the same trade-off curve.
~7 min
06
Cardinality is the killer
Every unique label-set is a Loki stream; every dynamic key is an ELK mapping field. Putting request_id in either kills the index in minutes — low-card → labels, high-card → body.
~7 min
07
Hot, warm, cold, frozen
Two orders of magnitude in cost between NVMe and Deep Archive force tiering. ELK has four ILM phases; Loki collapses to S3 from day one.
~7 min
08
Retention vs deletion — the index has to forget
Retention is when the system stops promising you can read; deletion is when bytes are physically gone — and the gap is where compliance bugs live.
~7 min
09
Sampling — head, tail, and the only error
Head sampling decides at emit (cheap, blind); tail sampling decides at the collector (can keep all errors, costs buffer). A uniform 1% sample drops the only error you needed.
~7 min
10
Distributors, ingesters, queriers (briefly)
Both ELK and Loki are sharded write-paths plus sharded read-paths plus a backing store; they differ in the partition key — hash(doc_id) vs hash(label-set).
~7 min
11
Design your logging stack
Capstone: agent + buffer policy + structuring + index strategy + tiers + retention + sampling — the verifier traces every choice back to the scene that earned it.
~7 min