Pulling a few hundred pages off a website is easy. Pulling a few million is a different problem, because at that volume the thing standing between you and the data is no longer parsing, it is staying unblocked. Sites that happily serve a human reader will throttle, challenge, or ban a client that requests thousands of pages an hour from one address in a rigid pattern. The data is often public, but the access pattern is what gets you flagged.

This guide explains why large scrapes get banned and what actually keeps them running: rotating IP addresses, pacing requests, sending realistic headers, rendering pages that need a browser, retrying only the requests that failed, and paying for successful responses rather than wasted ones. It closes with how a managed scraping API folds all of that into a single request, and a short note on collecting data responsibly.

Why websites block large scrapes

To a website, one human visitor is a single client: one IP address, browsing at human speed, clicking a handful of pages, with a normal browser's fingerprint. A naive scraper looks nothing like that. It hammers the same endpoint from one IP, far faster than any person could read, in a predictable loop, often with a default user agent that announces it is a script. Each of those signals is easy to detect, and together they are unmistakable.

Sites also publish their preferences in a robots.txt file, which lays out what automated clients should and should not touch and how aggressively. On top of that they deploy active defenses: rate limits per address, CAPTCHA challenges, login walls, browser fingerprinting, and honeypot links that are hidden from human eyes but visible in the HTML, so any client that follows one outs itself as a bot. None of these defenses are aimed at people. They are aimed at exactly the behavior an unconfigured scraper produces. Avoiding bans, then, is mostly about not looking like the thing those systems are built to catch. The sections below walk the techniques that get you there.

Four habits keep you unblocked. Rotate IPs, pace requests, render where needed, and retry only what failed; a managed API folds all four into one call.

Rotate IP addresses

The single loudest signal you send is volume from one address. A hundred requests a minute from a single IP is the easiest possible thing to rate-limit, and once that address is flagged, every request from it fails regardless of how careful the rest of your setup is. The fix is to spread requests across many addresses so no single one carries a suspicious load.

This is what a proxy does. A proxy is a gateway that sits between your scraper and the target site, so the site sees the proxy's address instead of yours. A rotating proxy goes further and changes that address from request to request, so a job that issues a million requests is distributed across a large pool rather than concentrated on one identity. Proxies also come in flavors that matter for blocking: datacenter addresses are fast and cheap but easier to recognize as non-residential, while residential and mobile addresses come from real consumer connections and blend in far better on sites with aggressive defenses. For a deeper look at rotation strategy, see our guide on how to use rotating proxies.

Pace requests and respect rate limits

Even across many IPs, speed alone gives you away. No human loads thirty pages a second, so a scraper that does is trivially distinguishable from real traffic. Pacing your requests, adding deliberate delays, and randomizing the gaps between them makes the traffic look organic instead of mechanical.

The goal is a request rate the target site can absorb without strain. That is both the courteous thing to do and the effective one: a measured crawl is far less likely to trip a rate limit or get an address banned than a flat-out sprint. Many sites also signal their limits directly through response headers and status codes, and a well-behaved scraper reads those signals and backs off when asked. Treat the rate limit as a constraint to design around, not an obstacle to outrun, and most of the throttling problem disappears.

Send realistic headers

Every browser sends a set of HTTP headers with each request: a user agent identifying the browser and operating system, accepted languages, encodings, and more. A default scraping library sends a sparse, obviously automated set of headers, sometimes a user agent that literally names the HTTP client. Sites read those headers, and a request that does not look like it came from a real browser is an easy flag.

Matching the headers a genuine browser sends, and varying the user agent across a pool of real ones rather than reusing a single string for every request, makes each request blend in. Headers should also be internally consistent: an Accept-Language and a user agent that contradict each other are their own tell. The aim is for each request to be indistinguishable from one a person's browser would produce, so there is nothing in the request itself to single it out.

Render pages that need JavaScript

A growing share of the web does not ship its content in the initial HTML. Single-page apps and dynamic sites load a skeleton, then fetch and render the real data with JavaScript in the browser. A plain HTTP request to one of those pages returns almost nothing useful, because the content you want never existed in the raw response.

Scraping those sites means running a real browser engine that executes the page's JavaScript and waits for the content to appear before extracting it. Headless browsers handle this, at the cost of being heavier and slower than simple requests, which matters when you are running millions of them. Knowing which pages genuinely need rendering, and which return everything in the first response, is what keeps a large job efficient rather than burning browser time on pages that never needed it. Our walkthrough on crawling JavaScript websites covers when rendering is worth the overhead.

Retry only what failed

At scale, some fraction of requests will always fail: a transient timeout, a temporary block, a slow upstream. The wrong response is to restart the whole job, which wastes everything that already succeeded and doubles the load you put on the target. The right one is to track each request's outcome and retry only the ones that failed, ideally with a short backoff so a struggling endpoint gets a moment to recover.

This keeps a large scrape both efficient and gentle. Successful pages are banked and never re-fetched, failures are isolated and retried on their own, and the total volume you send the site stays close to the minimum the job actually requires. A job built this way degrades gracefully under partial failure instead of thrashing, which is exactly what you want when a run spans hours and millions of URLs.

Crawlbase Crawling API

Stacking rotation, pacing, header management, rendering, and retries by hand is a lot of moving parts to maintain across a large job. The Crawlbase Crawling API bundles them into a single request: it rotates IPs from a large residential and datacenter pool, handles CAPTCHAs and blocks, and renders JavaScript when a page needs it, returning clean HTML. You get 1,000 free requests to start, and you only pay for successful ones.

Pay only for successful requests

There is an economic side to large-scale scraping that is easy to overlook until the bill arrives. If you run your own proxy fleet and browser farm, you pay for every request you send, including the ones that get blocked, time out, or come back empty. On a million-request job with a non-trivial failure rate, that waste is real money spent on data you never received.

A pricing model that charges only for successful responses flips that incentive. The cost of failed requests sits with the provider, which aligns their interest with yours: they are motivated to keep your success rate high because that is what they bill for. It also makes a large job easier to budget, since you pay for results rather than attempts. When you compare scraping approaches at volume, this distinction between paying per request and paying per success is one of the larger line items.

How a managed scraping API handles it

Each technique above is straightforward on its own. The difficulty is running all of them together, reliably, across millions of requests, and keeping them working as target sites change their defenses. That is the gap a managed scraping API fills. Instead of assembling and maintaining a proxy pool, a header rotation layer, a headless browser farm, a retry queue, and a CAPTCHA solver yourself, you send a URL to a single endpoint and get clean data back.

Under the hood, the API rotates IP addresses across a large pool, paces and shapes requests to look human, sends realistic headers, renders JavaScript-heavy pages with a real browser engine when needed, solves or sidesteps CAPTCHAs, and retries transient failures, all before it returns a response. For jobs too large to run synchronously, an asynchronous mode lets you submit URLs in bulk and receive results via callback as they complete, so you are not holding open millions of connections. The result is that the anti-ban work becomes someone else's problem, and you spend your time on the data instead of the plumbing. For the broader picture of running jobs at this size, see our guide to large-scale web scraping and the best practices for scaling scraping projects.

Scraping responsibly

Avoiding bans is a technical problem, but it sits inside an ethical one. Scrape public data only, and check a site's terms of service and its robots.txt before you start a large job. Keep your request rate reasonable so you are not degrading the service for the people it is actually built for, since a crawl heavy enough to strain a site is both rude and counterproductive. When the data you collect includes anything personal, treat regulations like GDPR and CCPA as hard requirements, not afterthoughts: collect only what you need, aggregate where you can, and do not build profiles of individuals. Responsible scraping and unblockable scraping pull in the same direction, because the behavior that keeps you compliant is usually the same behavior that keeps you from looking like an abusive bot.

Recap

Key takeaways

  • Access pattern, not data, gets you banned. Sites block clients that request too much, too fast, from one address in a rigid pattern, even when the data itself is public.
  • Rotation and pacing do most of the work. Spreading requests across many IPs, especially residential ones, and adding randomized delays makes traffic look human instead of mechanical.
  • Look like a real browser. Send realistic, varied headers and render JavaScript when a page needs it, so each request is indistinguishable from genuine traffic.
  • Retry only failures and pay per success. Bank successful pages, isolate and retry the rest with backoff, and prefer a model that charges for results rather than every attempt.
  • A managed API consolidates the techniques. One endpoint folds rotation, headers, rendering, CAPTCHA handling, and retries into a single request, with an async mode for very large jobs.

Frequently Asked Questions (FAQs)

Why do websites ban scrapers if the data is public?

The block is rarely about the data and almost always about the access pattern. A scraper that requests thousands of pages an hour from one IP address, at machine speed, in a predictable loop, looks nothing like a human visitor, and that behavior is what anti-bot systems are built to catch. Public data viewed at a human-like rate from varied addresses draws far less attention than the same data pulled at industrial volume from a single identity.

What is the single most important technique to avoid getting blocked?

Rotating IP addresses, because volume from one address is the loudest and easiest-to-detect signal a scraper sends. Spreading requests across a large pool, especially residential or mobile addresses on aggressive sites, prevents any single identity from carrying a suspicious load. That said, rotation works best combined with pacing and realistic headers, since speed and obvious automated fingerprints will still get you flagged even across many IPs.

How fast can I scrape without getting banned?

There is no universal number, because each site sets its own limits, but the principle is to stay at a rate the target can absorb comfortably and to randomize the gaps between requests so the traffic looks organic. Many sites communicate their limits through response headers and status codes, so read those signals and back off when asked. A measured crawl that respects rate limits is far less likely to get throttled than one that sprints.

Do I always need a headless browser to scrape at scale?

No, and you should avoid one where you can, because rendering is heavier and slower than a plain request, which matters across millions of pages. You only need a browser engine for sites that load their content with JavaScript after the initial HTML arrives. Pages that return everything in the first response can be scraped with simple requests, so the efficient approach is to render only the pages that genuinely require it.

What does "pay only for successful requests" mean?

It is a pricing model where you are charged for responses that actually return the data you asked for, not for requests that get blocked, time out, or come back empty. On a large job with a real failure rate, that difference is significant, since you are not paying for data you never received. It also aligns the provider's incentive with yours, because they only earn when your requests succeed.

How does a scraping API help compared to building my own scraper?

A managed API runs rotation, pacing, header management, JavaScript rendering, CAPTCHA handling, and retries behind a single endpoint, so you send a URL and get clean data back instead of building and maintaining each of those layers yourself. It also adapts as target sites change their defenses, which is ongoing work on a homegrown setup. For very large jobs, an asynchronous mode lets you submit URLs in bulk and collect results via callback rather than holding millions of connections open.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available