All scenes
Build a Prometheus-style time-series database
12 scenes · ~84 min · build the primitive
Build your own Prometheus-style time-series database
The simplest database that can absorb a 1M-points-per-second firehose and still answer `sum(rate(http_requests_total{status="500"}[5m]))` in milliseconds — built bit by bit, literally.
- 01What a metric point actually isA point is (metric, label_set, ts, value). The first two parts are the series identity — change one label and you've named a different stream.~7 min
- 02Why one row per point is wrongStoring each point as a SQL row spends 60+ bytes of metadata to carry an 8-byte float — the labels JSON repeats on every row of the firehose.~7 min
- 03Delta-of-delta crushes timestampsScrapers run on a fixed cadence, so the second derivative of timestamps is almost always zero — encoding ~96% of timestamps in a single bit.~7 min
- 04XOR crushes adjacent floatsAdjacent float64s share most of their IEEE-754 bits. XOR them and the leftover bits are tiny — an unchanged value costs 1 bit.~7 min
- 05A chunk: 120 points, packedBundle ~120 consecutive points into a single bit-packed blob. Gorilla: 16 B/point → 1.37 B/point — about 12× compression.~7 min
- 06Head chunks, WAL, and flushingActive chunk lives in RAM (the head); a write-ahead log on disk catches every sample so a crash mid-chunk loses nothing.~7 min
- 07How a query becomes pointsA read is four stages — parse, resolve label-selectors to series IDs, decompress the matching chunks, then aggregate. Stage 3 dominates.~7 min
- 08Inverted index — labels to seriesFor each (label, value) pair the database stores a sorted list of series IDs (a postings list). A multi-label query is the intersection.~7 min
- 09Cardinality is the killerEach unique label-set is one series with its own head chunk in RAM. Add an unbounded label like user_id and you OOM in minutes.~7 min
- 10Downsampling — a retention pyramidAggregate old chunks into 5-minute, then 1-hour buckets, dropping originals as you go. Recording rules materialize — they cost storage.~7 min
- 10aSingle-node by design; HA is somebody else's problemThe TSDB itself isn't replicated. HA = two parallel scrapers; durability = ship every sample to a remote-write receiver that dedups.~7 min
- 11Design canvas: pick a workload, ship a configCapstone: alerting, tracing, or business KPIs — the verifier turns scrape interval, label set, retention, and rules into projected RAM, disk, and a fits/refuses verdict.~7 min