Build a B-tree storage engine (SQLite-style)
11 scenes · ~77 min · build the primitive

Build your own B-tree storage engine (SQLite-style)

What actually happens when you run INSERT INTO users(...). One file of fixed-size pages, organized as B-trees, with a write-ahead log that turns commits into appends. Build it from a SQL writer's perspective and feel why every knob exists.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 05a
  7. 06
  8. 07
  9. 08
  10. 09
  11. 10
  1. 01
    Find the row — why a file scan won't do
    A SQL engine has to answer point, range, and indexed lookups against ONE file — the only family of structures that can do all three is an ordered tree.
    ~7 min
  2. 02
    The file is a strip of pages
    The DB file is a strip of identical fixed-size pages — every read/write is one whole page because the disk and OS work in pages.
    ~7 min
  3. 03
    Pages linked into a tree
    Pages are linked into a tree where leaves hold rows and interior pages hold keys + pointers — branching factor is hundreds, so 3 levels = ~64M rows.
    ~7 min
  4. 04
    Searching — a cursor descends the tree
    A point query reads exactly ONE page per tree level — three I/Os to find one row in a tree of millions, even on a miss.
    ~7 min
  5. 05
    Insert and split — the tree grows
    An INSERT writes one cell when there's room; when full, the leaf splits and a separator promotes to the parent — cascades grow the tree.
    ~7 min
  6. 05a
    Delete and VACUUM — the file doesn't shrink
    DELETE doesn't shrink the file — pages get freeblocks but never merge; only VACUUM rebuilds the file (at 2x disk cost).
    ~7 min
  7. 06
    Indexes — a second B-tree
    Each CREATE INDEX builds a second B-tree; reads do TWO descents, but every write must update every index.
    ~7 min
  8. 07
    The page cache — RAM is 1000x faster than disk
    The pager caches pages in RAM with LRU; you're fast iff the working set fits, slow the moment it doesn't.
    ~7 min
  9. 08
    WAL — durability without rewriting the file every commit
    Commits append page-images to the WAL and fsync; the main DB file is left alone until checkpoint — fast, and readers don't block.
    ~7 min
  10. 09
    Checkpoint — and the 20 GB WAL
    Checkpoints copy frames back to the DB file and rewind the WAL — but a long-lived reader can pin the WAL open forever, blowing it up.
    ~7 min
  11. 10
    Design canvas — size a SQLite deployment
    Workloads pick different knob combinations; each knob has a 'too low' and 'too high' failure mode you can name from earlier scenes.
    ~7 min