All scenes
Build a graph database (Neo4j / Dgraph-style)
16 scenes · ~112 min · build the primitive
Build your own graph database (Neo4j / Dgraph-style)
When the workload is 'friends of friends', a relational join melts. Build a store where edges are first-class — index-free adjacency, traversals that follow pointers instead of joining tables, and a query language (Cypher / GraphQL+) that thinks in patterns. Feel why graph storage shines for traversal-heavy work and stumbles on full-graph aggregates.
- 01Friends of friends melts a joinThe canonical graph workload — 'friends of friends of friends' — is one self-join per hop, and the intermediate row count explodes combinatorially until a relational engine can't finish; the same query as a pointer walk visits only the people you actually reach.~7 min
- 02Nodes, relationships, propertiesThe property-graph model is four things — nodes, typed relationships, labels, and properties on both — and putting key/values directly ON the edge is the move that relational join-tables and RDF triples can't make cleanly.~7 min
- 03Index-free adjacency: follow pointersEach node stores a direct pointer to its relationships, so following an edge is a pointer dereference — O(1) per hop, independent of total graph size — not a B-tree seek that gets slower as the database grows.~7 min
- 04Store records and the relationship chainFixed-size store records (node ~15 B, relationship ~34 B) mean record-id × size = byte offset with no index, and each relationship sits in a doubly-linked list for BOTH its endpoints — so a node's edges are a chain you can walk, insert into, and delete from in place.~7 min
- 05You still need an index to startIndex-free adjacency only covers EXPAND (traversing from a node you hold); to SEEK the anchor — 'start from Alice' — you still need a regular index, and without one finding the start node is a full label scan over every :User.~7 min
- 06Linked list vs CSR adjacencyDoubly-linked relationship lists are cache-hostile but mutate in place; Compressed Sparse Row packs a node's neighbors into one contiguous slice that streams through cache but rebuilds the whole array to add a single edge — the OLTP-graph vs OLAP-graph storage fork.~7 min
- 07Cypher: describe a shapeA Cypher MATCH (a:User {name:'Alice'})-[:FRIEND]->(b)-[:FRIEND]->(c) is a declarative description of a pattern, and for expansion it maps directly onto pointer-chasing — no join planner needed.~7 min
- 08Anchor selection and cardinalityThe planner's real job is to anchor on the most selective node and estimate cardinality — start from the few, not the many — so flipping a WHERE filter can flip which end is cheapest and reverse the whole traversal direction.~7 min
- 09The supernode breaks the promiseA supernode is a vertex with a huge degree, and its relationship chain is so long that any traversal THROUGH it must scan millions of edges — so the O(1)-per-hop promise dies on that one node, and the fix is to expand from the low-degree side.~7 min
- 09aDon't block The RockConcurrent edge-adds to a supernode historically serialized on a whole-node lock; relationship-chain locks plus dense-node grouping let writers touch different parts of the chain at once, turning a serialized hotspot into parallel throughput.~7 min
- 10ACID on a single primaryA single-primary graph store gives full ACID — a write-ahead logical log for durability, write locks held to commit, and deadlock detection via a wait-for graph — at the cost of write throughput bounded by one machine.~7 min
- 11The cut turns hops into RPCsSplit a graph across two machines and any cut severs edges, so a traversal that crosses the cut turns each pointer dereference into a network round trip — the boundary is exactly where O(1) becomes O(network).~7 min
- 11aEdge-cut, vertex-cut, predicate shardingEdge-cut assigns each vertex to a machine and chokes on power-law hubs; vertex-cut splits the hub across machines to balance edges; predicate-sharding keeps all edges of one TYPE together so a common expand stays on one machine.~7 min
- 12Search indexes bolted on the sideIndex-free adjacency only serves EXPAND, so finding start nodes by value — full-text, range, geo, vector — is served by classic secondary indexes (Lucene, B-tree) maintained as separate structures riding shotgun, with the usual write-amplification and staleness costs.~7 min
- 13Think like a vertex: Pregel / BSPWhole-graph algorithms have no locality to exploit, so the model flips to Pregel/BSP — every vertex runs a small function, sends messages to neighbors, then all vertices wait at a synchronization barrier before the next superstep — and a supernode at the barrier stalls everyone.~7 min
- 14Design your graph databaseEvery graph-DB deployment is a deliberate set of choices — storage layout, index strategy, supernode handling, single-node ACID vs distributed, and traversal-heavy vs aggregate-heavy workload — and the right configuration for a fraud-ring traversal is wrong for a PageRank pipeline even though the primitives are identical.~7 min