Build your own graph database (Neo4j / Dgraph-style)

When the workload is 'friends of friends', a relational join melts. Build a store where edges are first-class — index-free adjacency, traversals that follow pointers instead of joining tables, and a query language (Cypher / GraphQL+) that thinks in patterns. Feel why graph storage shines for traversal-heavy work and stumbles on full-graph aggregates.

01
Friends of friends melts a join
The canonical graph workload — 'friends of friends of friends' — is one self-join per hop, and the intermediate row count explodes combinatorially until a relational engine can't finish; the same query as a pointer walk visits only the people you actually reach.
~7 min
02
Nodes, relationships, properties
The property-graph model is four things — nodes, typed relationships, labels, and properties on both — and putting key/values directly ON the edge is the move that relational join-tables and RDF triples can't make cleanly.
~7 min
03
Index-free adjacency: follow pointers
Each node stores a direct pointer to its relationships, so following an edge is a pointer dereference — O(1) per hop, independent of total graph size — not a B-tree seek that gets slower as the database grows.
~7 min
04
Store records and the relationship chain
Fixed-size store records (node ~15 B, relationship ~34 B) mean record-id × size = byte offset with no index, and each relationship sits in a doubly-linked list for BOTH its endpoints — so a node's edges are a chain you can walk, insert into, and delete from in place.
~7 min
05
You still need an index to start
Index-free adjacency only covers EXPAND (traversing from a node you hold); to SEEK the anchor — 'start from Alice' — you still need a regular index, and without one finding the start node is a full label scan over every :User.
~7 min
06
Linked list vs CSR adjacency
Doubly-linked relationship lists are cache-hostile but mutate in place; Compressed Sparse Row packs a node's neighbors into one contiguous slice that streams through cache but rebuilds the whole array to add a single edge — the OLTP-graph vs OLAP-graph storage fork.
~7 min
07
Cypher: describe a shape
A Cypher MATCH (a:User {name:'Alice'})-[:FRIEND]->(b)-[:FRIEND]->(c) is a declarative description of a pattern, and for expansion it maps directly onto pointer-chasing — no join planner needed.
~7 min
08
Anchor selection and cardinality
The planner's real job is to anchor on the most selective node and estimate cardinality — start from the few, not the many — so flipping a WHERE filter can flip which end is cheapest and reverse the whole traversal direction.
~7 min
09
The supernode breaks the promise
A supernode is a vertex with a huge degree, and its relationship chain is so long that any traversal THROUGH it must scan millions of edges — so the O(1)-per-hop promise dies on that one node, and the fix is to expand from the low-degree side.
~7 min
09a
Don't block The Rock
Concurrent edge-adds to a supernode historically serialized on a whole-node lock; relationship-chain locks plus dense-node grouping let writers touch different parts of the chain at once, turning a serialized hotspot into parallel throughput.
~7 min
10
ACID on a single primary
A single-primary graph store gives full ACID — a write-ahead logical log for durability, write locks held to commit, and deadlock detection via a wait-for graph — at the cost of write throughput bounded by one machine.
~7 min
11
The cut turns hops into RPCs
Split a graph across two machines and any cut severs edges, so a traversal that crosses the cut turns each pointer dereference into a network round trip — the boundary is exactly where O(1) becomes O(network).
~7 min
11a
Edge-cut, vertex-cut, predicate sharding
Edge-cut assigns each vertex to a machine and chokes on power-law hubs; vertex-cut splits the hub across machines to balance edges; predicate-sharding keeps all edges of one TYPE together so a common expand stays on one machine.
~7 min
12
Search indexes bolted on the side
Index-free adjacency only serves EXPAND, so finding start nodes by value — full-text, range, geo, vector — is served by classic secondary indexes (Lucene, B-tree) maintained as separate structures riding shotgun, with the usual write-amplification and staleness costs.
~7 min
13
Think like a vertex: Pregel / BSP
Whole-graph algorithms have no locality to exploit, so the model flips to Pregel/BSP — every vertex runs a small function, sends messages to neighbors, then all vertices wait at a synchronization barrier before the next superstep — and a supernode at the barrier stalls everyone.
~7 min
14
Design your graph database
Every graph-DB deployment is a deliberate set of choices — storage layout, index strategy, supernode handling, single-node ACID vs distributed, and traversal-heavy vs aggregate-heavy workload — and the right configuration for a fraud-ring traversal is wrong for a PageRank pipeline even though the primitives are identical.
~7 min