Problems
#06Web Crawler
Politely traverse the web at scale. Don't crawl yourself in circles.
Saved on this device
Build a production-grade web crawler that traverses the open web at hundreds of millions of pages per day, respects every host's politeness budget and `robots.txt`, never crawls itself in circles, and feeds a downstream search index without becoming the bug that pages someone else's on-call.
The hard problems are not throughput. The hard problems are: politeness (one bad config rolls out, you DDoS Wikipedia and your IP range is blocked within an hour), de-duplication (without a seen-set you crawl forever), traps (calendar widgets and faceted search will eat 30% of your fetch budget if you let them), and making the multi-store page commit atomic enough that a parser crash doesn't leave the system with a URL marked seen but never enqueued.
This canonical sizes for the **Mid tier**: 10 B URLs/month over 100 M hosts (~3.86 K sustained / 7.7 K peak fetch QPS, ~250 TB/month compressed WARC ingest). Numbers are derived in the **Capacity** section, not assumed.
Reading: Mercator: A Scalable, Extensible Web Crawler — Heydon & Najork (Compaq SRC, 1999) · Detecting Near-Duplicates for Web Crawling — Manku, Jain, Das Sarma (Google, WWW 2007) · Inside Googlebot — Google Search Central Blog (Mar 2026) · Heritrix Politeness Parameters — Internet Archive Wiki · Common Crawl monthly statistics & WARC file format · Cloudflare AI Audit — blocking AI crawlers with one click · Cloudflare on Perplexity stealth crawlers (Aug 2025) · RFC 9309 — Robots Exclusion Protocol · Google open-source robots.txt parser (google/robotstxt) · Apache StormCrawler vs Nutch (DZone)
Mercator back-queue scheduling
per-host / per-IP / per-AS politeness
Bloom-filter + RocksDB seen-set
robots.txt RFC 9309 parsing
WARC archival + revisit dedup
SimHash near-duplicate detection
outbox pattern for atomic page commit
headless Chromium rendering pool
crawl-trap detection (depth + simhash)
verifiable crawler identity (UA + reverse-DNS + IP allowlist)