Archly — System Design Mastery

#06Web Crawler

Politely traverse the web at scale. Don't crawl yourself in circles.

Saved on this device

Build a production-grade web crawler that traverses the open web at hundreds of millions of pages per day, respects every host's politeness budget and `robots.txt`, never crawls itself in circles, and feeds a downstream search index without becoming the bug that pages someone else's on-call. The hard problems are not throughput. The hard problems are: politeness (one bad config rolls out, you DDoS Wikipedia and your IP range is blocked within an hour), de-duplication (without a seen-set you crawl forever), traps (calendar widgets and faceted search will eat 30% of your fetch budget if you let them), and making the multi-store page commit atomic enough that a parser crash doesn't leave the system with a URL marked seen but never enqueued. This canonical sizes for the **Mid tier**: 10 B URLs/month over 100 M hosts (~3.86 K sustained / 7.7 K peak fetch QPS, ~250 TB/month compressed WARC ingest). Numbers are derived in the **Capacity** section, not assumed.

Reading: Mercator: A Scalable, Extensible Web Crawler — Heydon & Najork (Compaq SRC, 1999) · Detecting Near-Duplicates for Web Crawling — Manku, Jain, Das Sarma (Google, WWW 2007) · Inside Googlebot — Google Search Central Blog (Mar 2026) · Heritrix Politeness Parameters — Internet Archive Wiki · Common Crawl monthly statistics & WARC file format · Cloudflare AI Audit — blocking AI crawlers with one click · Cloudflare on Perplexity stealth crawlers (Aug 2025) · RFC 9309 — Robots Exclusion Protocol · Google open-source robots.txt parser (google/robotstxt) · Apache StormCrawler vs Nutch (DZone)

Mercator back-queue scheduling

per-host / per-IP / per-AS politeness

Bloom-filter + RocksDB seen-set

robots.txt RFC 9309 parsing

WARC archival + revisit dedup

SimHash near-duplicate detection

outbox pattern for atomic page commit

headless Chromium rendering pool

crawl-trap detection (depth + simhash)

verifiable crawler identity (UA + reverse-DNS + IP allowlist)