Search indexes bolted on the side

Build a graph database (Neo4j / Dgraph-style) (16 scenes)

Scene 12 · Search indexes bolted on the side

Index-free adjacency only serves EXPAND, so finding start nodes by value — full-text, range, geo, vector — is served by classic secondary indexes (Lucene, B-tree) maintained as separate structures riding shotgun, with the usual write-amplification and staleness costs.

Previously

Sharding placed the edges; it never solved finding the anchor by value. That job goes to classic indexes — full-text, B-tree, vector — bolted on the side, feeding start nodes into the native pointer-chase, with their own staleness costs. So now we have the complete picture for LOCAL work: SEEK an anchor, then EXPAND cheaply. But every query we've built lights up only a few nodes. What about the questions that need the WHOLE graph?

Scene 12

Search indexes bolted on the side

Watch

Diagram

The native pointer-chase store is in the CENTER — that's index-free adjacency, which only serves EXPAND (following edges you already hold). Bolted on the SIDE are separate structures that find a start node by VALUE: a full-text index (Lucene) for free-text search, a B-tree for exact/range lookups, and a vector index for nearest-by-embedding. Each one is a secondary index: it answers 'which node matches this value?' and hands ONE start-node id into the core's SEEK; the graph engine then expands natively. A write into the core has to be re-applied to every side box too — boxes that don't update on the write path lag behind and go STALE.

Sources

The native pointer-chase store sits in the center — that's the index-free adjacency you built: follow edges you already hold, O(1) per hop. But it can ONLY expand from a node you already have. To start 'from the product whose description says cordless drill', or 'from users aged 30-40', or 'from the photo most similar to this embedding', the engine asks a SEPARATE box on the side. Watch each search box light up and hand exactly one start-node id into the core's SEEK — then the native walk takes over.

Implementation

Query.run

value-search SEEKs an anchor (side index), then EXPANDs natively

1def run(query):
2    # SEEK: find the start node BY VALUE — not possible natively
3    index = pick_secondary_index(query.predicate)  # lucene | btree | vector
4    start_id = index.lookup(query.value)   # value -> node id
5    # EXPAND: index-free adjacency, O(1) per hop
6    node = store.load(start_id)
7    return traverse(node, query.pattern)   # follow pointers

Store.write

a write must fan out to every secondary index (amp / staleness)

1def write(node, change):
2    store.apply(change)            # the core mutation
3    for index in secondary_indexes:
4        if index.sync:
5            index.reindex(node)    # on the write path -> write-amp
6        else:
7            enqueue_async(index, node)  # lags -> stale window

Planner.pickIndex

each value predicate routes to its OWN side structure — none of it is the native walk

1def pick_index(predicate):
2    # the core can only EXPAND, so route by value-kind
3    if predicate.kind == TEXT:
4        return lucene   # words -> node ids (free-text)
5    if predicate.kind == RANGE:
6        return btree    # value/range -> node ids
7    if predicate.kind == VECTOR:
8        return vector   # nearest embedding -> node ids
9    raise NoIndex       # else: full label scan

Index.lookup

a side index returns whatever it currently maps — fresh only if it re-applied the last write

1def lookup(value):
2    # this structure is maintained SEPARATELY from the core
3    node_id = self.map.get(value)   # value -> node id
4    # if a write hasn't been re-applied here yet,
5    # self.map still holds the pre-write entry
6    return node_id   # may point at a since-changed node

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

PreviousEdge-cut, vertex-cut, predicate sharding NextThink like a vertex: Pregel / BSP