Inverted index — labels to series — Build a Prometheus-style time-series database

Build a Prometheus-style time-series database (12 scenes)

Scene 08 · Inverted index — labels to series

For each (label, value) pair the database stores a sorted list of series IDs (a postings list). A multi-label query is the intersection.

Previously

Stage 2 is a search engine in disguise — postings lists and intersections, just like Lucene.

Scene 08

Inverted index: labels to series

Watch

Diagram

Left: a column of (label = value) keys, each with its postings list — a sorted row of series-ID tiles, with a size badge. Right: an intersection lane that takes the selected lists and animates the merge-walk; the smallest list is highlighted as the driver of cost. Bottom: a result chip with the surviving series IDs.

How does the database go from `{method=POST, status=500}` to series IDs without scanning every series? It keeps a tiny sorted list of series IDs for every (label, value) pair — a postings list. The query becomes an intersection of two such lists. Watch status=500 (the smaller list) drive the walk; each of its 2 IDs is probed against POST.

Implementation

Index.addSeries

every label of every series fans out into one postings list

1def addSeries(seriesId, labels):
2    for (k, v) in labels:
3        # postings: (label, value) -> sorted [seriesId, ...]
4        postings[(k, v)].sortedInsert(seriesId)
5    # special list for the no-selector query {}
6    postings[allPostingsKey].sortedInsert(seriesId)

Index.resolveSeries

selectors -> postings lists, smallest first

1def resolveSeries(selectors):
2    if len(selectors) == 0:
3        return postings[allPostingsKey]
4    lists = [postings[(k, v)] for (k, v) in selectors]
5    # planner's only heuristic: drive the walk by the
6    # smallest list — the corpus size never appears.
7    lists.sort(key = len)
8    if len(lists) == 1:
9        return lists[0]
10    return intersect(lists)

Index.intersect

walk the smallest list, probe the others

1def intersect(lists):
2    driver = lists[0]      # smallest after sort
3    others = lists[1:]
4    out = []
5    for candidate in driver:
6        # gallop / binary search — never a linear scan
7        if all(other.contains(candidate)
8               for other in others):
9            out.append(candidate)
10    return out             # cost = O(|driver|)

PreviousHow a query becomes points NextCardinality is the killer