Build a distributed logging stack (ELK / Loki)
12 scenes · ~84 min · build the primitive

Build your own distributed logging stack (ELK / Loki)

Ship lines off N hosts, choose what to index, age data through tiers, retain or delete on schedule, and survive a chatty service — built one decision at a time.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 05a
  7. 06
  8. 07
  9. 08
  10. 09
  11. 10
  12. 11
  1. 01
    grep + ssh stops working
    Once your fleet has more than a handful of hosts, the only way to answer 'which host saw this error' is to ship lines centrally — local files plus ssh is an O(N) dead end.
    ~7 min
  2. 02
    An agent on every host
    A shipping agent tails each log file from a saved offset, batches lines, and POSTs them at-least-once — duplicates on retry are normal, not a bug.
    ~7 min
  3. 03
    Block, drop, or spill
    When the backend stalls, the agent must block, drop, or spill to disk — and `when_full=block` plus a synchronous logger is how a logging outage takes the application down with it.
    ~7 min
  4. 04
    String vs map — fields and labels
    A log line is either a string parsed at read-time or a typed map parsed at write-time, and the two systems we'll meet attach different names to the same idea — fields in ELK, labels in Loki.
    ~7 min
  5. 05
    Inverted index vs labels-only index
    ELK tokenises every value into a per-term posting list; Loki hashes the label-set into a stream id and appends the line verbatim — heavy index + small body vs tiny index + verbatim body.
    ~7 min
  6. 05a
    Same query, two execution plans
    ELK answers via posting-list intersection — milliseconds; Loki resolves labels to chunks, fetches them from S3, and greps in-process — seconds to minutes. Opposite ends of the same trade-off curve.
    ~7 min
  7. 06
    Cardinality is the killer
    Every unique label-set is a Loki stream; every dynamic key is an ELK mapping field. Putting request_id in either kills the index in minutes — low-card → labels, high-card → body.
    ~7 min
  8. 07
    Hot, warm, cold, frozen
    Two orders of magnitude in cost between NVMe and Deep Archive force tiering. ELK has four ILM phases; Loki collapses to S3 from day one.
    ~7 min
  9. 08
    Retention vs deletion — the index has to forget
    Retention is when the system stops promising you can read; deletion is when bytes are physically gone — and the gap is where compliance bugs live.
    ~7 min
  10. 09
    Sampling — head, tail, and the only error
    Head sampling decides at emit (cheap, blind); tail sampling decides at the collector (can keep all errors, costs buffer). A uniform 1% sample drops the only error you needed.
    ~7 min
  11. 10
    Distributors, ingesters, queriers (briefly)
    Both ELK and Loki are sharded write-paths plus sharded read-paths plus a backing store; they differ in the partition key — hash(doc_id) vs hash(label-set).
    ~7 min
  12. 11
    Design your logging stack
    Capstone: agent + buffer policy + structuring + index strategy + tiers + retention + sampling — the verifier traces every choice back to the scene that earned it.
    ~7 min