Build a production-grade web crawler that traverses the open web at hundreds of millions of pages per day, respects every host's politeness budget and robots.txt, never crawls itself in circles, and feeds a downstream search index without becoming the bug that pages someone else's on-call.
The hard problems are not throughput. The hard problems are: politeness (one bad config rolls out, you DDoS Wikipedia and your IP range is blocked within an hour), de-duplication (without a seen-set you crawl forever), traps (calendar widgets and faceted search will eat 30% of your fetch budget if you let them), and making the multi-store page commit atomic enough that a parser crash doesn't leave the system with a URL marked seen but never enqueued.
This canonical sizes for the Mid tier: 10 B URLs/month over 100 M hosts (~3.86 K sustained / 7.7 K peak fetch QPS, ~250 TB/month compressed WARC ingest). Numbers are derived in the Capacity section, not assumed.