Best Practices for Scaling Web Scraping

A scraper that pulls a hundred pages is a script. A scraper that pulls a few million is a system, and the two fail in completely different ways. Code that ran clean on a small job starts timing out, getting blocked, or silently dropping rows the moment you point it at real volume. The fix is rarely a faster machine; it is a set of habits around concurrency, rotation, retries, and observability that keep throughput high without getting you banned.

This is a field guide to the best practices for scaling web scraping projects: how to control request rate, rotate proxies sanely, survive anti-bot defenses, retry without amplifying failures, and actually see what your pipeline is doing. Where a managed layer earns its place, this guide points at Crawlbase, which folds proxy rotation, rendering, and retries into a single call so you do not have to build and babysit that infrastructure yourself.

Why scaling breaks naive scrapers

On a small scale, scraping is a loop: request a page, parse it, move on. That model has no slack in it. Every request blocks on network latency, one transient error halts the run, and the target site sees a steady, robotic cadence from a single IP. None of that matters at a hundred pages. All of it matters at a hundred thousand.

Scaling is not "do the same thing, more." It is doing it smart, which means designing around the failure modes that only show up under load:

Rate limits and blocks. A single IP firing fast gets throttled, challenged, or banned.
Concurrency contention. Too few workers and you crawl for days; too many and you overload the target or your own machine.
Transient failures. Timeouts, 5xx responses, and dropped connections are constant at scale, so a run with no retry logic never finishes.
Memory and storage pressure. Holding everything in RAM before a single write does not survive millions of rows.
Observability gaps. When you cannot see success rates per domain, a slow degradation looks identical to "still running."

The practices below address each of these. They are roughly ordered from the request layer outward, but they compound: rotation without rate control still gets you blocked, and retries without observability just hide the rot.

Control concurrency and request rate

The first lever is how many requests you run at once and how fast you fire them. These are two different knobs and people conflate them. Concurrency is how many requests are in flight simultaneously; rate is how many you start per second. You want high concurrency to hide network latency, but a rate ceiling so you do not hammer a single host into blocking you.

Use an async model rather than a thread-per-request loop. Asynchronous I/O lets one process keep hundreds of requests in flight while it waits on the network, which is where a synchronous scraper wastes almost all of its time. Cap the in-flight count with a semaphore, and pace new requests so a single domain never sees a flood.

python

import asyncio, aiohttp, random

MAX_CONCURRENCY = 20      # in-flight requests at once
PER_REQUEST_DELAY = 0.25  # seconds of jitter to spread load

async def fetch(session, url, sem):
    async with sem:
        await asyncio.sleep(random.random() * PER_REQUEST_DELAY)
        async with session.get(url) as resp:
            return url, resp.status, await resp.text()

async def crawl(urls):
    sem = asyncio.Semaphore(MAX_CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, u, sem) for u in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

Tune MAX_CONCURRENCY per target, not globally. A robust public API can take hundreds of concurrent requests; a fragile small site will fall over at ten. The jitter on each request matters more than people expect: a perfectly even cadence is itself a bot fingerprint, so a little randomness makes your traffic look more human and spreads load off any single second.

Rotate proxies with a residential and datacenter mix

Concurrency gets you speed; proxy rotation gets you past the rate limits that speed would otherwise trip. Sending every request from one IP is the single fastest way to get blocked at scale. Spreading requests across a pool of addresses means no single IP shows a suspicious pattern.

The two proxy types trade off cost against trust. Datacenter proxies are cheap and fast but easy to detect, since their IP ranges are known to belong to hosting providers. Residential proxies route through real consumer connections, so targets read them as ordinary visitors, but they cost more. The pragmatic move is a mix: lean on datacenter IPs for soft targets and reserve residential for the sites that fight back.

Whichever you use, rotation has its own rules. Rotate often enough that no IP accumulates a blockable history, but keep a session pinned to one IP when a site ties a flow to an address (a logged-in path or a multi-step form). Monitor proxy health and evict addresses that start returning errors. For the mechanics, how to use rotating proxies and rotating residential proxies cover the setup in depth.

Building and maintaining a healthy pool is real work, so this is a natural place to offload. Crawlbase Smart AI Proxy exposes a single endpoint that rotates through a large residential and datacenter pool behind the scenes, retries failed IPs, and handles geotargeting, so you point your existing HTTP client at one proxy URL instead of managing addresses yourself.

Rotation is not a cure-all

Rotating IPs defeats per-IP rate limits, but it does nothing for browser fingerprinting, TLS signatures, or JavaScript challenges. A site that profiles the request itself will still flag you even from a fresh residential IP. Treat rotation as one layer, paired with realistic headers, paced requests, and (where needed) real rendering.

Build for anti-bot resilience

Modern targets do far more than count requests per IP. They inspect headers, TLS handshakes, and browser fingerprints, and they serve CAPTCHAs or JavaScript challenges to traffic that looks automated. Scaling past those defenses means looking like a real browser, not just coming from a real IP.

The basics first: send a complete, consistent set of headers (a real User-Agent, Accept-Language, the works), keep cookies across a session, and never send a header combination no real browser would. Beyond that, the heavy challenges (CAPTCHAs, behavioral fingerprinting, Cloudflare-style interstitials) are an arms race you usually do not want to fight by hand at scale.

This is where a managed scraping layer pays off. The Crawlbase Crawling API handles the anti-bot stack for you: it rotates IPs, presents realistic browser fingerprints, solves the challenges that can be solved, and retries the ones that cannot, then returns clean HTML. For the broader playbook, how to scrape websites without getting blocked goes through the tactics in detail.

Render headless only when you have to

Headless browsers (Puppeteer, Playwright, Selenium) render JavaScript-heavy pages that a plain HTTP fetch cannot, but they are expensive: each instance is a full browser eating CPU and memory, which caps how many you can run in parallel and slows every request. At scale that cost is brutal, so the rule is simple: do not render unless you must.

Before reaching for a headless fleet, check whether the data is already available without rendering. Open the network tab and look for an internal JSON API the page calls; hitting that endpoint directly is faster and far more stable than parsing rendered HTML. Many "JavaScript sites" are really thin front-ends over an API you can query directly.

When you genuinely need rendering, do it selectively rather than running browsers for the whole crawl. The Crawling API lets you request rendering per call with a JavaScript token, so you pay the browser cost only on the pages that need it and take the cheap static path everywhere else. That keeps the expensive layer scoped to the minority of pages that actually require it.

Crawlbase Crawling API

Rotation, realistic fingerprints, optional JavaScript rendering, and automatic retries in one call. You send a URL and get back clean HTML, so you skip running a proxy pool and a headless fleet yourself. Most of the practices on this page come built in. Start on the free tier and point it at a real target.

Start free

Retry with exponential backoff and a budget

At scale, transient failures are not edge cases, they are the steady state. Timeouts, 429s, 503s, and dropped connections happen constantly, so a scraper without retry logic never finishes a large run. But naive retries are worse than none: hammering a struggling host the instant it errors just deepens the problem and looks exactly like an attack.

The right pattern is exponential backoff with jitter and a cap on total attempts. Wait longer after each failure, add randomness so a wave of failures does not retry in lockstep, and give up after a bounded number of tries so one dead URL cannot block the pipeline forever. Retry only what is worth retrying: a 503 or a timeout, yes; a 404 or a 403, no, since those will not change on the next attempt.

python

import time, random

RETRYABLE = {429, 500, 502, 503, 504}

def fetch_with_backoff(get, url, max_attempts=5, base=1.0, cap=30.0):
    for attempt in range(max_attempts):
        resp = get(url)
        if resp.status_code < 400:
            return resp
        if resp.status_code not in RETRYABLE:
            raise RuntimeError(f"non-retryable {resp.status_code}")
        sleep = min(cap, base * 2 ** attempt) + random.random()
        time.sleep(sleep)
    raise RuntimeError(f"gave up on {url}")

Pair retries with status-code literacy. A run that starts returning challenges or proxy errors is telling you the current rate or IP tier is no longer enough; back off and rotate rather than retrying blindly. Reading proxy status error codes as signal lets you adapt instead of just hammering.

Queue work and process it async

A single loop that fetches, parses, and writes in sequence cannot scale, because every stage blocks the next and one slow step stalls the whole thing. The architecture that does scale decouples those stages with a queue: producers push URLs onto it, a pool of workers pulls and processes them, and the queue absorbs bursts and spreads load.

This buys you several things at once. Workers scale horizontally, since you add machines that all pull from the same queue. Failed jobs go back on the queue for a later retry without blocking anything else. And the queue is your natural rate-control point, where you throttle how fast jobs are dispatched per domain. Redis, RabbitMQ, or a cloud queue all work; the pattern matters more than the tool.

Crawlbase offers this as a managed service. The async Crawler is a push-based queue: you submit URLs through the Crawling API, each gets a request ID for tracking, the system crawls them concurrently and retries failures automatically, then POSTs finished results to a webhook on your server. You get the queue, concurrency, and retry machinery without standing up the infrastructure, which is exactly the layer most teams burn weeks building.

Cache aggressively to avoid redundant work

The cheapest request is the one you never send. At scale, a surprising fraction of a crawl is redundant: pages you already fetched, content that has not changed, lookups you repeat across runs. Caching cuts request volume, which cuts cost, load on the target, and your block risk all at once.

Cache at more than one level. Skip URLs you have already crawled within a freshness window instead of re-fetching them. Respect HTTP cache headers (ETag and Last-Modified) so a conditional request returns a cheap 304 when nothing changed. And memoize expensive derived work, like parsed or normalized records, so a re-run does not redo it. A crawl that re-fetches unchanged pages every cycle is wasting most of its budget on data it already has.

Monitor everything and validate the data

At scale you cannot eyeball a run, so you have to instrument it. The metrics that matter are success and failure rates per domain, request latency, block and CAPTCHA rates, queue depth, and throughput over time. The point is to catch a slow degradation early: a creeping rise in 403s means a target is starting to block you, and you want to know that within minutes, not after a run finishes with half the rows missing.

Validation is the other half of "did it actually work." A request that returns 200 with an empty body or a CAPTCHA page is a silent failure, and at scale those poison your dataset quietly. So validate as you go: check that required fields are present and well-typed, sanity-check value ranges, deduplicate, and stream clean rows straight to storage rather than holding everything in memory until the end. Catching bad data while you crawl is far cheaper than discovering it in a downstream report.

If you run on Crawlbase, a lot of this comes for free. The dashboard surfaces success and failure counts, a live monitor shows real-time activity and queue size, and a retry monitor breaks down what is being retried, so the observability layer is built in rather than something you assemble from scratch. For structured output, the Crawling API returns parsed JSON for supported sites, which removes a class of brittle selector code and the validation headaches that come with it.

Respect robots.txt and terms of service

Scaling responsibly is not only an ethics question, it is an operational one. Aggressive scraping that ignores a site's stated limits gets you blocked faster and can expose you to legal risk, so restraint is part of staying online.

Stick to a few lines. Scrape only public data, the content anyone can see without an account, and never anything behind a login or anything that identifies a person. Read the target's robots.txt and its stated rate expectations, and keep your volume low enough that you are not straining anyone's servers. If you plan to reuse data commercially, get permission or an official data agreement rather than assuming silence is consent. A scraper that is a good citizen is also a scraper that stays unblocked longer.

Recap

Key takeaways

Separate concurrency from rate. Run many requests in flight with async I/O, but cap how fast you hit any single host and add jitter so your traffic is not robotic.
Rotate a residential and datacenter mix. Spread requests across a healthy pool, lean residential on hard targets, and remember rotation does not defeat fingerprinting on its own.
Render headless only when you must. Check for an internal JSON API first; reserve the expensive browser path for pages that truly need it.
Retry with capped exponential backoff. Back off with jitter, bound total attempts, and only retry transient codes, never a 404 or 403.
Queue, cache, and observe. Decouple stages with a queue, cache to skip redundant fetches, and instrument success rates and data validation so silent failures surface fast.
Let a managed layer carry the undifferentiated work. Crawlbase folds rotation, rendering, retries, queueing, and monitoring into a single API so you scale logic, not infrastructure.

Frequently Asked Questions (FAQs)

What are the best practices for scaling web scraping projects?

The core practices are: run requests concurrently with async I/O while capping the per-host rate, rotate across a healthy proxy pool that mixes residential and datacenter IPs, retry transient failures with capped exponential backoff, render with a headless browser only when a page truly needs it, decouple stages with a queue, cache to skip redundant fetches, and instrument success rates plus data validation so silent failures surface fast. A managed layer like the Crawlbase Crawling API gives you several of these out of the box.

How many concurrent requests should I run when scaling a scraper?

There is no single number, because it depends on the target. Tune concurrency per site: a robust public API can absorb hundreds of simultaneous requests, while a small or fragile site will fall over at ten. Start conservative, watch the error and block rates, and raise the in-flight cap only while success stays high. Pace new requests with a little jitter so even high concurrency does not read as robotic to the target.

Residential or datacenter proxies for large-scale scraping?

Use both, matched to the target. Datacenter proxies are cheap and fast but easy to detect, so they suit soft targets that do not fight back hard. Residential proxies route through real consumer connections and read as ordinary visitors, which makes them the choice for sites with aggressive anti-bot defenses, at a higher cost. A mixed pool keeps spend down while still getting you past the tough targets. Crawlbase Smart AI Proxy manages this mix for you behind one endpoint.

How do I avoid getting blocked when scaling up?

Blocks come from looking automated, not just from volume. Keep per-IP rate low and rotate across many addresses, send complete and consistent browser headers, persist cookies within a session, and add jitter so your cadence is not perfectly even. For sites with CAPTCHAs or fingerprinting, a managed scraping API that presents realistic fingerprints and solves challenges is far more reliable than rolling your own evasion. Watch your status codes and back off the moment challenges start appearing.

When should I use a headless browser versus a plain HTTP request?

Reach for a headless browser only when the data is rendered client-side and is not reachable any other way. First check the network tab for an internal JSON API the page calls; hitting that endpoint directly is faster and far more stable. Headless browsers are CPU and memory hungry, which limits parallelism and slows every request, so at scale you want them scoped to the minority of pages that genuinely require rendering, not running for the whole crawl.

How does Crawlbase help me scale a web scraping project?

Crawlbase removes the infrastructure most of these practices require. The Crawling API rotates IPs, presents realistic fingerprints, optionally renders JavaScript, and retries failures in a single call. Smart AI Proxy gives you a managed rotating pool behind one endpoint. The async Crawler provides a push-based queue with concurrency, automatic retries, and webhook delivery, plus dashboards for success rates, live activity, and retries. Together they let you focus on scraping logic instead of building and maintaining the scaling layer yourself.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available