A scraper that sails through the first page often falls apart at volume. The script runs clean against ten URLs in testing, ships, and then somewhere past the ten-thousandth request the success rate quietly drops: empty bodies, CAPTCHA redirects, half-filled datasets, a worker that crashes after running for hours. Nothing in the code changed. What changed is that the site started treating your traffic as a pattern instead of a visitor.
This guide walks the failure modes that only show up at scale, IP reputation and bans, CAPTCHA walls, fingerprint and TLS detection, session expiry, selector drift, JavaScript-rendered content, resource leaks, and missing retry logic, and pairs each one with a concrete fix. By the end you will know why the same code behaves differently at ten requests and ten thousand, and which of these problems are worth solving yourself versus handing to a managed layer.
Why scrapers break at scale
At low volume, a website has little reason to care about you. Your handful of requests blends into ordinary background traffic, so even a sloppy scraper, bare headers, one IP, no pacing, sails through and gives you false confidence. The trouble is that anti-bot systems do not score individual requests in isolation. They profile behavior across a session and across an IP over time, and that profile only sharpens as your request count climbs.
Past a few thousand requests, several things shift at once. Your traffic pattern becomes statistically distinct from human browsing, per-IP thresholds trip, IP reputation degrades as the address logs more automated activity, and small inconsistencies a single request would never reveal accumulate into a confident "this is a bot" verdict. The defenses did not turn on at request 10,000; you simply crossed the volume where they had enough signal to act. The fixes below all push one way: make each request read like a real browser, and make the pipeline survive the failures that are now guaranteed.
The failure modes, and how to fix each
1. IP rate limiting and bans
The first wall is volume from a single address. Sites count requests per IP and act when one source looks too busy: rate limits cap requests in a window and start returning 429s, and once an address crosses into "abusive" territory it gets blacklisted outright. Reputation compounds the problem. Bot-mitigation systems track the ASN behind an IP, whether it comes from a datacenter, residential, or mobile pool, and the historical behavior of that range, so a flagged pool drags down every address in it.
Solution. Spread requests across many addresses so no single IP shows a bannable signature. A rotating proxy pool that mixes residential and datacenter IPs distributes load, sidesteps per-IP rate limits, and routes through different regions to reach geo-gated content. Rotation alone is not a cure, though: if you rotate faster while keeping the same robotic timing and headers, you just burn through addresses. Pair rotation with the pacing in the next section. See how to use rotating proxies for the setup.
2. CAPTCHA walls
When a site suspects automation, it stops blocking and starts challenging: reCAPTCHA, hCaptcha, FunCaptcha, or a click-and-drag puzzle. At scale these appear not just at login but mid-crawl on ordinary content pages, and a scraper that hits one simply stalls, or worse, follows the redirect and starts collecting challenge pages as if they were data.
Solution. The durable fix is to avoid triggering the challenge in the first place by looking like a real browser: realistic headers, persisted cookies, paced requests, and a trustworthy IP. Solving CAPTCHAs after the fact is a losing race; preventing them is the win. When one does appear, detect it explicitly, treat a challenge page as a failure rather than a success, and route around it instead of parsing it. How to bypass CAPTCHAs in web scraping covers the mechanics.
3. Fingerprint and TLS detection
Modern detection goes well past counting requests. Anti-bot systems profile the request itself: the order and completeness of your headers, the TLS handshake your client produces (its JA3 signature), client hints, and whether all of these agree with the user agent you claim. A scraper that sends a Chrome user agent over a Python HTTP client's TLS fingerprint is contradicting itself, and that mismatch is trivial to flag. Behavioral signals pile on, since a session that never moves a mouse, never loads a secondary asset, and fires requests on a metronome reads as synthetic.
Solution. Coming from a clean IP is not enough; the request has to read as a real browser end to end. Send a complete, consistent header set, persist cookies across the session, and never assemble a combination of headers and TLS that no real browser produces. Keeping a fingerprint consistent across every attribute is genuinely hard, which is precisely the gap detectors exploit, so this is one of the strongest cases for offloading to a layer that maintains real browser fingerprints for you. Browser fingerprinting explains what you are up against.
4. Session and cookie expiry
Long runs introduce a failure that short tests never reach: sessions go stale. Authenticated cookies expire, CSRF tokens rotate, and session-bound state tied to a single IP breaks the moment you rotate to a new address mid-flow. A scraper that authenticated at the start of a million-page job and assumed the session would last is collecting redirects to a login page by hour two.
Solution. Manage sessions deliberately. Log in once, persist the cookies, and reuse that session rather than re-authenticating on every request, but also detect expiry, watch for the login redirect or the dropped token, and refresh credentials before the next batch. When a flow ties a session to one IP, pin that session to a single sticky address instead of rotating inside it, so the site sees a consistent visitor for the life of the session.
5. Selector drift from markup changes
Even a flawless scraper breaks the moment the target redesigns. Sites rename classes, restructure the DOM, and reshuffle endpoints to improve their own product, and every such change can silently snap a selector your parser depended on. At scale this is not an "if" but a "when," and across many sites it happens constantly: scripts that worked yesterday return empty fields today, with no error to announce it.
Solution. Parse defensively. Prefer stable, semantic selectors and durable attributes over brittle, deep CSS paths that any redesign will shift. Validate every extraction, assert that required fields are present and well-typed, so a missing field raises an alert instead of writing a null into your dataset. Keep parsers modular so one site's change touches one parser, not the whole pipeline.
6. JavaScript-rendered content
Many sites ship a near-empty HTML shell and paint the real content with JavaScript after load, often from a follow-up API call. A plain HTTP fetch grabs the shell and your parser finds nothing, because the data was never in the source you downloaded. This produces the most confusing failure at scale: a clean 200 OK on a page that is functionally empty, so your scraper reports success while your dataset fills with blanks.
Solution. Two paths work. First, open the browser network tab and look for the internal JSON API the page calls; hitting that endpoint directly is faster and far more stable than rendering, and many "JavaScript sites" are thin front-ends over an API you can query. When the data is only reachable after rendering, drive a headless browser or use an API that renders for you and returns the finished HTML. Either way, validate the body before you parse it, since a 200 with 700 bytes and a "Just a moment" title is a silent block, not a result. See how to crawl JavaScript websites.
Rotation, realistic fingerprints, and JavaScript rendering are exactly the layers that get expensive to maintain at volume, and they are what the Crawling API absorbs. You send a URL; it rotates IPs, presents a consistent browser fingerprint, optionally renders the page, clears the challenges it can, retries the rest, and returns clean HTML. One call replaces the proxy pool, the CAPTCHA handling, and the headless fleet you would otherwise build and babysit, so the curve at scale stays flat instead of falling off a cliff.
7. Memory and connection leaks
Some scrapers are never blocked at all; they collapse under their own weight. A loop that opens a new connection per request without pooling or closing exhausts file descriptors and sockets. Accumulating every response in memory before writing balloons the process until it is killed. Concurrency set too high overwhelms your own machine before it ever overwhelms the target. None of this shows up in a ten-URL test, because the leak needs hours and thousands of iterations to become fatal.
Solution. Treat resources as finite. Reuse a pooled HTTP session instead of opening a fresh connection each time, and make sure responses are consumed and closed so sockets return to the pool. Stream results to storage as you go rather than holding the full dataset in memory. Cap concurrency per host and overall to a level your machine and the target can sustain. These are ordinary engineering habits, but at scale they are the difference between a process that runs for days and one that dies overnight.
8. No retry or backoff logic
At volume, transient failures are not edge cases; they are constant. Timeouts, dropped connections, the occasional 429 or 503. A scraper with no retry logic throws those rows away. A scraper that retries immediately and aggressively is worse, because a tight retry loop amplifies traffic at the exact moment the site is already pushing back, which accelerates the block. This "retry storm" is one of the most common ways a scraper takes itself down.
Solution. Retry, but back off exponentially and add jitter so your retries do not arrive in a synchronized wave. Cap the number of attempts, respect any Retry-After header, and stop retrying status codes that will never succeed. A small wrapper is enough:
import random, time, requests def fetch(url, attempts=5, base=1.0, cap=30.0): for n in range(attempts): r = requests.get(url, timeout=30) if r.status_code < 400: return r if r.status_code in (400, 404): break # never going to succeed; do not retry delay = min(cap, base * 2 ** n) + random.uniform(0, base) time.sleep(delay) # exponential backoff with jitter return None
The same idea covers throttling: pace your normal requests with a small jittered delay between them, so even your successful traffic does not arrive on a perfectly even beat that a detector can lock onto.
Offloading rotation and rendering
Look back at the eight fixes and a pattern emerges: most of the hardest ones are not about your data at all. Rotation, fingerprint consistency, CAPTCHA avoidance, and rendering are undifferentiated infrastructure, an arms race you maintain against every vendor on every target, separate from the extraction logic that actually creates value for you. Building all of it yourself is possible, but it is a standing tax on engineering time that grows with every site you add.
This is the natural point to offload. A managed crawling layer carries rotation, realistic fingerprints, optional JavaScript rendering, challenge handling, and intelligent retry behind a single request, and returns clean HTML. You keep the parsing and the business logic, which are genuinely yours, and let the layer absorb the parts that exist only to get the request through. For the wider catalog of issues and trade-offs, our guide on web scraping challenges and solutions goes broader.
Monitoring and alerting
The failure mode that hurts most at scale is the one nobody sees. A scraper degrades into 200 responses with empty bodies and half-filled datasets, and the gap surfaces only when a downstream report looks wrong, days later. The fix is to make silence loud. Instrument the scraper as a living system: track success and failure rates per domain, block and CAPTCHA rates, body sizes, and throughput, so a creeping rise in 403s or a sudden drop in average response size raises an alert within minutes rather than after a broken run finishes. Validate as you go, and alert when a required field goes missing across a batch, because a structure change should page you, not quietly poison the data. The real cost of scraping is rarely the first build; it is keeping it honest over time.
Scraping responsibly
Staying unblocked is partly restraint. Stick to public data, the content anyone can see without an account, and stay away from anything behind a login or anything that identifies a person. Read the target's robots.txt and its stated rate expectations, and keep volume low enough that you are not straining its servers, since scraping too fast can genuinely degrade or crash a site. Privacy laws such as GDPR and CCPA govern what you may collect about people, and a site's Terms of Service may forbid scraping outright, so check both before a large run. A scraper that behaves like a good citizen is also one that stays unblocked far longer.
Key takeaways
- Scale is the trigger, not the bug. Your code did not break at request 10,000; the site finally had enough signal to profile your traffic, so every fix is about looking more like a real browser and surviving inevitable failures.
- Rotate and pace together. A rotating pool of mixed residential and datacenter IPs sidesteps rate limits, but only when paired with jittered throttling, since rotating faster on robotic timing just burns addresses.
- Consistency beats cleverness on detection. Headers, cookies, and TLS must agree with the browser you claim to be, and a stale session or a contradictory fingerprint is what gets a long run flagged.
- Validate before you trust a 200. Silent failures, empty bodies, challenge pages, and drifted selectors, are caught by defensive parsing, field validation, and per-domain monitoring, not by hope.
- Offload the undifferentiated layer. Rotation, fingerprints, rendering, and retries are infrastructure you can rent so the curve stays flat at scale, leaving your team on the extraction and logic that actually matter.
Frequently Asked Questions (FAQs)
Why does my scraper work in testing but fail at scale?
Early tests do not generate enough traffic to trip a site's thresholds, so even a sloppy scraper passes. Once you run sustained volume, your traffic becomes easy to profile and small inconsistencies in headers, timing, fingerprint, and session behavior accumulate into a confident bot verdict. The code did not change; you simply crossed the point where the defenses had enough signal to act.
Why am I getting 200 OK responses but the data is missing?
That is usually a silent block or unrendered content. The server returns a valid status, but the body is a placeholder, a challenge page, or an empty JavaScript shell rather than the real content. Validate the response before parsing: check the body size and look for tell-tale titles like "Just a moment" so a silent failure becomes a loud one instead of a null in your dataset.
Does rotating proxies fix rate limiting on its own?
Not by itself. Rotation spreads requests so no single IP trips a per-IP limit, but if you keep the same robotic timing and header set across the pool, the pattern is still detectable and you just burn addresses faster. Pair rotation with jittered pacing and realistic, consistent requests so each address looks like an ordinary visitor.
How should retries be handled so they do not make blocking worse?
Retry with exponential backoff and jitter, cap the number of attempts, and respect any Retry-After header. Immediate, aggressive retries create a storm that amplifies traffic exactly when the site is already pushing back, which accelerates the block. Also skip retrying status codes like 404 that will never succeed.
When should I render JavaScript instead of fetching raw HTML?
Render when the data you need is painted by JavaScript after load or when the site relies on scripts to set session cookies or unlock the real HTML. Before reaching for a headless browser, check whether the page loads its data from an internal JSON API you can call directly, since that is faster and far more stable. Raw fetches are fine when the content is already present in the source.
When is it worth offloading to a managed crawling API?
When the maintenance of rotation, fingerprints, rendering, and challenge handling starts costing more than the data is worth, or when you are scaling across many sites and cannot keep patching each one. A managed layer carries that infrastructure behind a single request, so your team stays focused on extraction and business logic rather than the undifferentiated work of getting requests through.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

