Build a graph database (Neo4j / Dgraph-style)
16 scenes · ~112 min · build the primitive

Build your own graph database (Neo4j / Dgraph-style)

When the workload is 'friends of friends', a relational join melts. Build a store where edges are first-class — index-free adjacency, traversals that follow pointers instead of joining tables, and a query language (Cypher / GraphQL+) that thinks in patterns. Feel why graph storage shines for traversal-heavy work and stumbles on full-graph aggregates.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06
  7. 07
  8. 08
  9. 09
  10. 09a
  11. 10
  12. 11
  13. 11a
  14. 12
  15. 13
  16. 14
  1. 01
    Friends of friends melts a join
    The canonical graph workload — 'friends of friends of friends' — is one self-join per hop, and the intermediate row count explodes combinatorially until a relational engine can't finish; the same query as a pointer walk visits only the people you actually reach.
    ~7 min
  2. 02
    Nodes, relationships, properties
    The property-graph model is four things — nodes, typed relationships, labels, and properties on both — and putting key/values directly ON the edge is the move that relational join-tables and RDF triples can't make cleanly.
    ~7 min
  3. 03
    Index-free adjacency: follow pointers
    Each node stores a direct pointer to its relationships, so following an edge is a pointer dereference — O(1) per hop, independent of total graph size — not a B-tree seek that gets slower as the database grows.
    ~7 min
  4. 04
    Store records and the relationship chain
    Fixed-size store records (node ~15 B, relationship ~34 B) mean record-id × size = byte offset with no index, and each relationship sits in a doubly-linked list for BOTH its endpoints — so a node's edges are a chain you can walk, insert into, and delete from in place.
    ~7 min
  5. 05
    You still need an index to start
    Index-free adjacency only covers EXPAND (traversing from a node you hold); to SEEK the anchor — 'start from Alice' — you still need a regular index, and without one finding the start node is a full label scan over every :User.
    ~7 min
  6. 06
    Linked list vs CSR adjacency
    Doubly-linked relationship lists are cache-hostile but mutate in place; Compressed Sparse Row packs a node's neighbors into one contiguous slice that streams through cache but rebuilds the whole array to add a single edge — the OLTP-graph vs OLAP-graph storage fork.
    ~7 min
  7. 07
    Cypher: describe a shape
    A Cypher MATCH (a:User {name:'Alice'})-[:FRIEND]->(b)-[:FRIEND]->(c) is a declarative description of a pattern, and for expansion it maps directly onto pointer-chasing — no join planner needed.
    ~7 min
  8. 08
    Anchor selection and cardinality
    The planner's real job is to anchor on the most selective node and estimate cardinality — start from the few, not the many — so flipping a WHERE filter can flip which end is cheapest and reverse the whole traversal direction.
    ~7 min
  9. 09
    The supernode breaks the promise
    A supernode is a vertex with a huge degree, and its relationship chain is so long that any traversal THROUGH it must scan millions of edges — so the O(1)-per-hop promise dies on that one node, and the fix is to expand from the low-degree side.
    ~7 min
  10. 09a
    Don't block The Rock
    Concurrent edge-adds to a supernode historically serialized on a whole-node lock; relationship-chain locks plus dense-node grouping let writers touch different parts of the chain at once, turning a serialized hotspot into parallel throughput.
    ~7 min
  11. 10
    ACID on a single primary
    A single-primary graph store gives full ACID — a write-ahead logical log for durability, write locks held to commit, and deadlock detection via a wait-for graph — at the cost of write throughput bounded by one machine.
    ~7 min
  12. 11
    The cut turns hops into RPCs
    Split a graph across two machines and any cut severs edges, so a traversal that crosses the cut turns each pointer dereference into a network round trip — the boundary is exactly where O(1) becomes O(network).
    ~7 min
  13. 11a
    Edge-cut, vertex-cut, predicate sharding
    Edge-cut assigns each vertex to a machine and chokes on power-law hubs; vertex-cut splits the hub across machines to balance edges; predicate-sharding keeps all edges of one TYPE together so a common expand stays on one machine.
    ~7 min
  14. 12
    Search indexes bolted on the side
    Index-free adjacency only serves EXPAND, so finding start nodes by value — full-text, range, geo, vector — is served by classic secondary indexes (Lucene, B-tree) maintained as separate structures riding shotgun, with the usual write-amplification and staleness costs.
    ~7 min
  15. 13
    Think like a vertex: Pregel / BSP
    Whole-graph algorithms have no locality to exploit, so the model flips to Pregel/BSP — every vertex runs a small function, sends messages to neighbors, then all vertices wait at a synchronization barrier before the next superstep — and a supernode at the barrier stalls everyone.
    ~7 min
  16. 14
    Design your graph database
    Every graph-DB deployment is a deliberate set of choices — storage layout, index strategy, supernode handling, single-node ACID vs distributed, and traversal-heavy vs aggregate-heavy workload — and the right configuration for a fraud-ring traversal is wrong for a PageRank pipeline even though the primitives are identical.
    ~7 min