Scraping a few hundred pages is a script. Scraping millions is a system. Once your target count crosses from "runs on my laptop overnight" to "needs to finish this week without melting," the hard part stops being the parsing and becomes everything around it: how you queue work, fan it out across workers, avoid getting blocked, retry the failures, and store and check the result. This is the flagship walkthrough of large scale web scraping as an architecture, not a snippet.

It is scoped to public data collected at volume: product listings, prices, search results, public profiles, and the like. The shape of the work is the same whatever the source, so the focus is the pipeline and the tradeoffs at each stage, with the parts you should not build yourself called out as you go.

What large scale web scraping actually means

Large scale web scraping is the practice of extracting data from millions of pages, across one enormous site or thousands of smaller ones at once. The jump from regular scraping is not just a bigger number, and one figure makes it concrete. Imagine a category with 20,000 listing pages, 20 items each, so 400,000 pages to fetch. At a realistic 2.5 seconds per page, a strictly sequential run is roughly 1,000,000 seconds, or about 11.5 days of waiting on page loads before you parse a single field. From here the numbers are illustrative, but they hold the right order of magnitude. That figure is the whole reason this article exists: at scale, time is the constraint, and concurrency is how you buy it back. Drive 200 pages in parallel and those 11.5 days collapse toward an hour of wall-clock time.

The architecture at a glance

A scraper that survives millions of pages is a small distributed system with a handful of named parts. Each one solves a problem that only appears at volume.

  • A queue holds the URLs still to fetch and decouples discovery from work, so producers and consumers run at their own pace.
  • Async or distributed workers pull from the queue and do the fetching concurrently. This is where the wall-clock savings come from.
  • A proxy and anti-bot layer rotates IPs and presents traffic that targets read as a real browser, so a single address never trips a rate limit.
  • Rendering, only when needed, runs a headless browser for JavaScript-heavy pages and skips it for static ones, because rendering is the most expensive thing you can do.
  • Retries with backoff catch the transient failures that are guaranteed at this volume.
  • Deduplication stops you from fetching or storing the same URL twice.
  • Storage takes the parsed rows and puts them somewhere queryable.
  • Monitoring and data-quality checks tell you the run is healthy and the output is trustworthy.

The sections below take these in order. The thread running through all of them is a tradeoff between control and operational burden: build each layer yourself, or hand the hardest ones (proxies, anti-bot, rendering, retries) to a managed layer and spend your time on the data.

Queue first: decouple discovery from fetching

The single most important structural decision is to put a queue between "what to scrape" and "doing the scrape." A producer enumerates URLs (from a sitemap, a search-result crawl, or a database of IDs) and pushes them onto the queue; a pool of workers pulls from it. Neither side has to know how fast the other is going, and you can add workers without touching the producer.

In Python this is commonly Celery or RQ over Redis; in Node a Bull or BullMQ queue over Redis; at larger scale a real broker like RabbitMQ or Kafka. A minimal worker sketch makes the pattern concrete.

python
import asyncio
import aiohttp

CONCURRENCY = 50
queue = asyncio.Queue()

async def worker(name, session):
    while True:
        url = await queue.get()
        try:
            async with session.get(url) as resp:
                html = await resp.text()
                parse_and_store(url, html)
        except Exception as err:
            print(f'failed {url}: {err}')
        finally:
            queue.task_done()

async def run(urls):
    for u in urls:
        queue.put_nowait(u)
    async with aiohttp.ClientSession() as session:
        workers = [asyncio.create_task(worker(i, session)) for i in range(CONCURRENCY)]
        await queue.join()
        for w in workers:
            w.cancel()

That is the whole idea in one file: a bounded pool of concurrent workers draining a shared queue. The knob that matters is CONCURRENCY. Too low wastes the parallelism that makes scale possible; too high overwhelms both the target and your own egress. You find the right value by watching error rates climb, which is exactly why monitoring is a first-class part of the system.

Async vs. distributed

Async concurrency (one machine, many in-flight requests) and distributed workers (many machines) solve different ceilings. Async gets you off the one-request-at-a-time floor cheaply. Distributed workers get you past the limits of a single box: CPU for rendering, memory for parsing, and outbound bandwidth. Most large jobs use both: async within each worker, many workers across machines.

Proxy rotation and anti-bot: the part that breaks first

At low volume you barely notice anti-bot defenses; at scale they break the run first. Send a few hundred thousand requests from one IP and you get rate-limited, then challenged, then blocked. The fix is rotation: spread requests across many addresses so no single one looks abusive.

Which kind of proxy matters. Datacenter IPs are cheap and fast but easy to fingerprint and block in bulk. Residential proxies route through real consumer connections and read as ordinary users, which is what hard commercial targets expect. For most large jobs the right default is a pool of rotating residential proxies, where each request or short session leaves from a fresh real-user IP. If you assemble this yourself, getting the rotation logic right (sticky sessions where a site needs them, fresh IPs where it does not) is most of the work; see how to use rotating proxies.

Rotation is necessary but not sufficient. Modern defenses also read TLS fingerprints, header order, and browser behavior. A managed layer like the Crawlbase Smart AI Proxy folds rotation and fingerprint handling into a single endpoint: you point your existing HTTP client at one proxy URL and it manages the pool, headers, and retries on blocks behind it. For the full defensive playbook, see how to scrape websites without getting blocked.

Render only when you have to

Rendering a page in a headless browser is the most expensive operation in the pipeline: it costs CPU, memory, and seconds per page, and at a million pages those seconds dominate everything. So render only when the data genuinely requires it.

Many sites still ship their data in the initial HTML, or expose it through a JSON endpoint the page calls. For those, a plain HTTP fetch plus a parser is an order of magnitude cheaper than a browser. Reserve rendering for pages that build content client-side, where a raw fetch returns an empty shell. The discipline is simple: try the cheap path first, confirm the fields are present, and escalate to rendering only for the pages that need it. Mixing both in one run (static fetch for catalog pages, render for the handful of JS-heavy detail pages) is normal and is where the savings live.

The managed scale layer: Crawling API and the async Crawler

Proxies, anti-bot, and rendering are the three layers hardest to build and keep healthy, and they are exactly what Crawlbase manages for you. The Crawling API is a single call that fetches a URL behind rotating residential IPs, handles the anti-bot challenge, optionally renders the page with a real browser, and returns finished HTML. You decide per request whether to render by adding a JavaScript token; static pages stay cheap and JS-heavy pages get a browser.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

options = {
    'ajax_wait': 'true',
    'page_wait': 3000,
    'country': 'US',
}

resp = api.get('https://www.example.com/products?page=42', options)
if resp['status_code'] == 200:
    parse_and_store(resp['body'])

Synchronous calls are perfect inside the worker pool above: each worker calls api.get and the API absorbs the proxy, anti-bot, and rendering concerns. But for truly large jobs there is a better pattern. The asynchronous Crawler inverts the flow: instead of holding a connection open while each page is fetched, you push URLs to it and it crawls them on its own schedule, then POSTs each finished page back to a webhook endpoint you control. You add two parameters to the Crawling API call, &callback=true&crawler=YourCrawlerName, and the Crawler takes over the queueing, scheduling, and retries.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

# Push as many URLs as you like; the Crawler queues and crawls them async,
# then POSTs each finished page to the webhook on your registered crawler.
for url in urls_to_crawl:
    api.get(url, {
        'callback': 'true',
        'crawler': 'my-products-crawler',
    })

The async model is the right call for millions of pages because it removes the part of the system you would otherwise babysit. You are not keeping connections open, running a render fleet, or managing a retry queue; you push and you receive. The Crawler even monitors your webhook: if your endpoint goes down it pauses, notifies you, retries the failed delivery, and resumes automatically when your server is back. That is queueing, scheduling, retries, and delivery reliability handled as a managed layer, which is most of what the rest of this article tells you to build.

Crawlbase Crawling API + async Crawler

Scale is mostly the parts that are no fun to build: rotating residential IPs, anti-bot, headless rendering, queues, and retries. The Crawling API folds the first three into one call, and the async Crawler takes your pushed URLs, crawls them on its own schedule, and POSTs finished pages to your webhook with automatic retries. Point it at a public target on the free tier first.

Retries and backoff: failure is the steady state

At a million requests, a 1% transient failure rate is 10,000 failed pages. Failure is not an edge case at this volume; it is the steady state, and your pipeline has to treat a failed fetch as routine rather than fatal. The pattern is retry with exponential backoff and a cap: wait a little, then more, then more, and after a few attempts move the URL to a dead-letter queue instead of blocking the run.

The nuance is reading why a request failed, because not every failure deserves a retry. A timeout or a 503 is worth retrying; a hard 404 is not. With proxied traffic you also get proxy-specific status signals that tell you whether to back off, rotate, or escalate the IP tier; treating those as signal rather than noise keeps a long run healthy. See how to solve proxy status error codes for the full mapping. A managed layer retries blocks internally, but you still own retries for your own logic and storage.

Deduplication: do not crawl the same page twice

Discovery at scale produces duplicates constantly: the same product reachable from three category paths, tracking parameters that make one page look like ten, pagination that loops. Without dedup you waste budget refetching pages and corrupt your dataset with repeated rows.

Two layers handle it. First, normalize URLs before they enter the queue: strip tracking parameters, lowercase the host, sort query keys, resolve relative links to a canonical form. Second, keep a seen-set (a Redis set, or a Bloom filter for very large runs) and skip any URL already in it. A Bloom filter trades a tiny false-positive rate for a massive memory saving, the right trade when seen-sets reach the hundreds of millions. Dedup the output too: key rows on a stable identifier so a page fetched twice does not become two records.

Storage: match the store to the access pattern

Where the data lands depends on what you do with it next. Flat files (CSV, JSONL) or object storage suit append-heavy archival and cheap bulk processing. A relational database fits when you need to query, join, and update rows. A document store fits semi-structured records whose shape varies by source. The mistake is forcing everything into one of these because it was first to hand.

Two scale-specific habits matter. Write in batches, not one row per request, so storage is not your bottleneck; the worker should buffer and flush. And separate raw from parsed: keep the original HTML (or a reference to it) so you can re-parse without re-crawling when selectors change or you find a new field. Crawlbase can deliver pages straight to Cloud Storage or your webhook, removing the ingestion plumbing from your side entirely.

Monitoring and data quality

A large run is opaque without instrumentation. You want live counters for pages fetched, success rate, error rate by type, queue depth, and throughput, so you can see a block storm or a stalled queue while it happens rather than in tomorrow's empty dataset. Queue depth climbing while throughput falls means workers are stuck; a spike in challenges means it is time to back off or rotate harder.

Data quality is the half of monitoring teams skip, and the half that determines whether the data is usable. A run can report 100% HTTP success and still produce garbage if the layout changed and your selectors now match nothing. Add cheap, continuous checks: assert that required fields are non-empty, that prices parse as numbers in a sane range, that row count per page is roughly what you expect. When a check fails across many pages at once, the markup drifted and your parser needs attention. Better to catch that on page 5,000 than after you have stored five million empty rows.

Where to draw the build-vs-buy line

Everything above is buildable, so the honest question is which parts are worth your engineering time. The data model, parsing logic, quality checks, and storage schema are specific to your project, and only you can build them well. The proxy pool, anti-bot handling, headless render fleet, and async retry-and-delivery queue are generic infrastructure that is expensive to build and a grind to keep healthy as targets evolve. That is the line a managed scale layer sits on: use the Crawling API or async Crawler for fetching, rendering, and anti-bot; the Smart AI Proxy to keep your own client and swap in a managed rotating endpoint; or the Crawling API for parsed JSON from supported sites so you skip selectors entirely. Spend your time on the data; rent the parts that are the same for everyone.

Recap

Key takeaways

  • Scale is concurrency, not a bigger loop. A sequential million-page run takes days; a queue feeding async or distributed workers collapses that to hours.
  • Proxies and anti-bot break first. Rotate through residential IPs and present real-browser traffic, or let a managed layer handle rotation and fingerprints for you.
  • Render only when you must. Headless rendering is the most expensive step; try a static fetch first and escalate only for client-side pages.
  • Failure is the steady state. Retry transient errors with backoff and a dead-letter queue, deduplicate URLs and rows, and read proxy status codes as signal.
  • Async beats synchronous at the top end. Push URLs to the Crawler and receive results on a webhook, so queueing, scheduling, retries, and delivery are handled for you.
  • Monitor success and data quality. 100% HTTP success with empty fields is still a failed run; assert on the data, not just the status code.

Frequently Asked Questions (FAQs)

What counts as large scale web scraping?

Roughly, any job large enough that a single sequential script is no longer viable, which in practice means hundreds of thousands to millions of pages, across one big site or many smaller ones. The defining trait is not the count but that you now need concurrency, proxy rotation, retries, and monitoring to finish in reasonable time without getting blocked. Below that threshold a simple loop is fine; above it you are running a small distributed system.

How do I scrape millions of pages without getting blocked?

Spread requests across many IPs so no single address looks abusive, prefer rotating residential proxies for hard targets, present traffic that reads as a real browser, pace your requests, and back off when challenges appear. Building all that yourself is significant work, so most teams route through a managed layer like the Crawling API or Smart AI Proxy that handles rotation, fingerprints, and challenge-solving behind one endpoint.

Should I use synchronous or asynchronous scraping at scale?

Asynchronous, for anything truly large. Synchronous fetching holds a connection open per page and ties up a worker until each request finishes. The async Crawler model lets you push URLs and receive finished pages on a webhook callback, so queueing, scheduling, and retries happen server-side and your application is not blocked waiting. You push and you receive, which is far easier to scale and operate.

Do I always need a headless browser for large scale scraping?

No, and avoid it where you can. Rendering is the most expensive step per page, so reserve it for sites that build content client-side and return an empty shell to a plain fetch. Many sites ship usable data in the initial HTML or expose a JSON endpoint, both far cheaper to fetch. Mixing a cheap static path with rendering only for the pages that need it is the cost-effective default.

How do I handle failures and duplicates across millions of requests?

Treat both as routine. Retry transient failures (timeouts, 503s) with exponential backoff and a cap, then send the stubborn ones to a dead-letter queue instead of blocking the run; do not retry hard 404s. For duplicates, normalize URLs before queueing, keep a seen-set or Bloom filter to skip URLs already fetched, and key stored rows on a stable identifier so a page fetched twice does not become two records.

Where should I store data from a large scrape?

Match the store to how you will use the data: object storage or JSONL for cheap archival, a relational database when you need to query and join, a document store for variable-shape records. Write in batches rather than one row per request so storage is not the bottleneck, and keep the raw HTML alongside the parsed output so you can re-parse without re-crawling when selectors change or you add a field.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available