Build a Prometheus-style time-series database
12 scenes · ~84 min · build the primitive

Build your own Prometheus-style time-series database

The simplest database that can absorb a 1M-points-per-second firehose and still answer `sum(rate(http_requests_total{status="500"}[5m]))` in milliseconds — built bit by bit, literally.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06
  7. 07
  8. 08
  9. 09
  10. 10
  11. 10a
  12. 11
  1. 01
    What a metric point actually is
    A point is (metric, label_set, ts, value). The first two parts are the series identity — change one label and you've named a different stream.
    ~7 min
  2. 02
    Why one row per point is wrong
    Storing each point as a SQL row spends 60+ bytes of metadata to carry an 8-byte float — the labels JSON repeats on every row of the firehose.
    ~7 min
  3. 03
    Delta-of-delta crushes timestamps
    Scrapers run on a fixed cadence, so the second derivative of timestamps is almost always zero — encoding ~96% of timestamps in a single bit.
    ~7 min
  4. 04
    XOR crushes adjacent floats
    Adjacent float64s share most of their IEEE-754 bits. XOR them and the leftover bits are tiny — an unchanged value costs 1 bit.
    ~7 min
  5. 05
    A chunk: 120 points, packed
    Bundle ~120 consecutive points into a single bit-packed blob. Gorilla: 16 B/point → 1.37 B/point — about 12× compression.
    ~7 min
  6. 06
    Head chunks, WAL, and flushing
    Active chunk lives in RAM (the head); a write-ahead log on disk catches every sample so a crash mid-chunk loses nothing.
    ~7 min
  7. 07
    How a query becomes points
    A read is four stages — parse, resolve label-selectors to series IDs, decompress the matching chunks, then aggregate. Stage 3 dominates.
    ~7 min
  8. 08
    Inverted index — labels to series
    For each (label, value) pair the database stores a sorted list of series IDs (a postings list). A multi-label query is the intersection.
    ~7 min
  9. 09
    Cardinality is the killer
    Each unique label-set is one series with its own head chunk in RAM. Add an unbounded label like user_id and you OOM in minutes.
    ~7 min
  10. 10
    Downsampling — a retention pyramid
    Aggregate old chunks into 5-minute, then 1-hour buckets, dropping originals as you go. Recording rules materialize — they cost storage.
    ~7 min
  11. 10a
    Single-node by design; HA is somebody else's problem
    The TSDB itself isn't replicated. HA = two parallel scrapers; durability = ship every sample to a remote-write receiver that dedups.
    ~7 min
  12. 11
    Design canvas: pick a workload, ship a config
    Capstone: alerting, tracing, or business KPIs — the verifier turns scrape interval, label set, retention, and rules into projected RAM, disk, and a fits/refuses verdict.
    ~7 min