A scraper gets blocked when its traffic does not match what the target expects from a real visitor. The request arrives from an address with a poor reputation, carries a thin or contradictory set of headers, fires faster than any human could click, or asks for a page the site never serves to bots. Any one of those tells is enough for a modern anti-bot system to return a 403, a CAPTCHA, or an empty page instead of the data you came for.
This guide covers the practical tactics that keep a scraper looking like a browser: rotating residential IPs, sending realistic headers and user-agents, throttling and randomising your timing, honoring robots.txt, rendering JavaScript when the page needs it, handling CAPTCHAs, managing sessions and cookies, and watching status codes so you back off before a soft throttle hardens into a ban. None is a silver bullet, but stacked in the right order they take most targets from a wall of blocks to a steady stream of 200s.
Why scrapers get blocked
Before the tactics, it helps to know what you are up against. Anti-bot systems flag automated traffic on three broad signals, and almost every block traces back to one of them.
- Fingerprint. A real browser carries a consistent bundle of signals: a full header set, a TLS handshake that matches the user-agent it claims, a JavaScript runtime, cookies that persist across requests. A default HTTP client carries almost none of that, and a half-spoofed one carries signals that contradict each other. Either way, it stands out.
- Rate. Sites count requests per IP and per session over time. Traffic that arrives faster than a human could generate it, or at a perfectly regular interval, reads as a script no matter how clean each individual request looks.
- IP reputation. Addresses in known datacenter ranges, on shared blocklists, or with a history of abuse are treated with suspicion from the first request. The IP you come from sets your starting credibility before you send a single header.
Every tactic below works by repairing one of these signals. Apply the cheap ones first, measure your block rate, and only reach for the heavy machinery when a target actually forces you to.
Rotate residential IPs
The most common block is also the simplest: too many requests from one address. A site counts hits per IP and starts returning 429s or a block page once you cross its threshold. Spread the same volume of requests across many IPs and no single address ever trips the limit. This is why scraping infrastructure is mostly proxy infrastructure, the proxy makes the request on your behalf so the target sees its address instead of yours.
The type of IP matters as much as the rotation. Datacenter IPs are fast and cheap, but they sit in hosting ranges any target can identify with a quick lookup, so they read as automated on hardened sites. Residential IPs exit through real consumer connections and look like ordinary visitors, at higher cost and lower speed. The full tradeoff is in datacenter vs residential proxies. Buy exactly as much trust as the target demands and not a tier more: residential for the strict sites, datacenter for the tolerant ones.
Rotating by hand means maintaining a pool of addresses and cycling through them per request, then pruning the ones that get burned. A rotating gateway hides that behind a single endpoint and swaps the exit IP for you, either fresh per request or sticky per session when you need to hold one identity across several pages.
Send realistic headers and user-agents
A default HTTP client gives itself away in the first line it sends. Python's requests library announces User-Agent: python-requests/2.x and ships almost no other headers, while a real browser sends a dozen in a specific order. Sites that do nothing more than read that header will block the first request and pass the second.
Set a current, real browser user-agent, and rotate through a small pool of them rather than hammering one string forever. Then send the headers that always travel alongside it: Accept, Accept-Language, Accept-Encoding, and a plausible Referer. The goal is not one magic header, it is internal consistency. A Chrome user-agent paired with Firefox-style Accept values is more suspicious than no spoofing at all.
import requests headers = { "User-Agent": ( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/124.0 Safari/537.36" ), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/", } resp = requests.get("https://example.com", headers=headers)
One layer deeper sits the TLS fingerprint. Before any HTTP byte is sent, your client opens a TLS handshake whose exact shape forms a signature often summarized as a JA3 hash. A real Chrome produces one well-known signature; a Python client produces a completely different one. When you send a Chrome user-agent over a Python TLS stack, the two disagree and a fingerprint check flags the mismatch no matter how perfect your headers are. Closing that gap means using a client that mimics a browser's handshake, or driving a real browser engine that produces a genuine one for free.
Throttle and randomise your timing
Even spread across many IPs, a scraper that fires requests on a fixed cadence reads as automated. A perfectly regular 500ms gap between requests is itself a fingerprint, because humans do not click like a metronome. Add a randomised delay between requests instead of a constant one, and keep concurrency to a level the target can absorb without noticing the spike.
The legacy advice to use an irregular, human-like scraping pattern still holds: vary your intervals, do not crawl pages in a rigid sequence, and avoid sending many requests simultaneously against the same host. The other half of timing is reducing load you do not need to generate at all. Cache pages you have already fetched so you never request them twice, and scrape only the content you actually need rather than the whole site.
Respect robots.txt and avoid honeypots
Before any evasion technique, read the site's robots.txt. It declares which paths the operator is willing to have crawled and, often, a Crawl-delay that tells you the minimum interval they expect between requests. Honoring it is partly courtesy and partly self-preservation: ignoring declared rules is the fastest way to get flagged, and it is the line where terms-of-service questions start. Check the site's terms as well; if it explicitly forbids scraping, that is a signal to reconsider the target.
A related trap is the honeypot, a link hidden from human eyes by CSS (display:none, zero size, or off-screen positioning) but still present in the HTML. A naive crawler that follows every <a> tag walks straight into it and instantly outs itself as a bot, because no real user could have clicked a link they cannot see. Follow only the links a rendered browser would actually show, and skip anything visually hidden.
Render JavaScript like a browser
Plenty of pages return nearly empty HTML and build their real content with JavaScript after load. Fetch one of those with a plain HTTP client and you get a shell with no data. Some sites go further and serve a JavaScript challenge: a small script that must run and pass before the real page is delivered, which a non-browser client can never clear.
For both cases you need a real browser engine. A headless browser such as Playwright, Puppeteer, or Selenium driving Chrome loads the page, runs its scripts, and hands you the DOM a user would see. It also produces a genuine browser TLS fingerprint and a real navigator object, so it clears a class of checks a raw client cannot. The cost is weight: a headless browser uses far more CPU and memory per page than a simple request, so reserve it for pages that genuinely need rendering. For a fuller walkthrough see how to crawl JavaScript websites and the Python-specific guide to scraping JavaScript pages with Python.
Handle CAPTCHAs
A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a challenge a site shows when it suspects a request is automated. Many sites integrate algorithms that score each visitor and trigger a CAPTCHA when the score looks robotic. Once you hit one, no amount of header tuning gets you the data on that request.
The durable fix is to stop triggering them in the first place: a clean residential IP, a consistent fingerprint, and a human-like rate keep your robot score low enough that the challenge never fires. When a target shows them anyway, a managed scraping endpoint that solves or sidesteps CAPTCHAs server-side is far more reliable than wiring a solver into your own stack. If you want the detail, how to bypass CAPTCHAs in web scraping covers the options.
Once a target needs more than a clean IP, the Crawling API owns the whole stack for you: it rotates across a large pool of datacenter, residential, and mobile exits, sends a believable fingerprint, renders JavaScript when the page requires it, and handles CAPTCHAs and blocks server-side. If you only need the rotating IP layer, the Smart AI Proxy routes ordinary requests through the same network from one endpoint. You send a URL and get the finished result. Run your real target through it on the free tier first.
Manage sessions and cookies
Many sites set a cookie on the first visit and expect to see it on every request after. Discard cookies between requests and you look like a fresh, suspiciously stateless visitor each time, which trips behavioral checks that assume a real user accumulates state as they browse. Use a session that persists cookies across requests so a multi-step flow (search, paginate, open a detail page) carries the same identity throughout.
Sessions interact with IP rotation, so coordinate the two. If you rotate to a new IP mid-session, the cookie that was issued to your old address now arrives from a different one, which is itself a flag. Hold a sticky IP for the duration of a logical session, then rotate when you start a fresh one. The example below uses a requests.Session to keep cookies and headers consistent across calls.
import requests session = requests.Session() session.headers.update(headers) # Cookies set on the first call ride along on the rest. session.get("https://example.com/search?q=phones") session.get("https://example.com/search?q=phones&page=2")
Watch status codes and back off
Your scraper should treat HTTP status codes as live feedback, not just success or failure. A run of 429 (Too Many Requests) or 503 responses is the server telling you to slow down. Honor it: back off exponentially, respect the Retry-After header when the server sends one, and treat a burst of 429s as a signal to lower your overall rate rather than retry harder. Hammering a rate-limited endpoint at full speed is exactly how a soft throttle becomes a hard ban.
Other codes carry their own meaning. A 403 usually means a fingerprint or IP-reputation block, so changing the request rate will not help; you need a better IP or a more believable fingerprint. A sudden 200 that returns a CAPTCHA page instead of content is a block in disguise, so validate the body, not just the code.
import time, random def fetch(session, url, tries=4): for attempt in range(tries): resp = session.get(url) if resp.status_code == 200: return resp if resp.status_code in (429, 503): wait = int(resp.headers.get("Retry-After", 2 ** attempt)) time.sleep(wait + random.uniform(0, 1)) continue resp.raise_for_status() raise RuntimeError("exhausted retries")
Scraping responsibly
Staying unblocked and scraping responsibly are the same discipline seen from two angles. Read each site's terms of service and robots.txt and respect what they declare, keep to public pages rather than anything behind a login, and hold your request rate to a level the target can serve without strain. Cache what you have already fetched so you do not re-request it, and pull only the data you actually need. A scraper that behaves like a considerate visitor is both far less likely to be blocked and far less likely to cause a problem worth blocking.
Key takeaways
- Blocks trace to three signals. Fingerprint, rate, and IP reputation cover nearly every block, and every tactic works by repairing one of them.
- Rotate residential IPs first. Most blocks are per-IP rate limits, so spreading requests across a pool of believable addresses is the cheapest, highest-impact fix.
- Keep your signals consistent. Realistic headers, a matching TLS fingerprint, persistent cookies, and a randomised rate are more convincing together than any one of them alone.
- Respect the site. Honor robots.txt and terms of service, avoid honeypots, stay on public data, and back off the moment a 429 or 503 tells you to.
- Offload when it gets hard. When a target fights back across rendering, CAPTCHAs, and reputation at once, a managed crawling API or smart proxy absorbs the whole stack so you do not maintain it yourself.
Frequently Asked Questions (FAQs)
Why does my web scraper keep getting blocked?
Because its traffic does not look like a real browser on at least one of three axes: fingerprint, rate, or IP reputation. The request might come from a flagged datacenter IP, carry a thin or contradictory set of headers, or arrive faster and more regularly than a human could click. Anti-bot systems need only one of those tells to return a 403, a CAPTCHA, or an empty page.
What is the single most effective way to avoid blocked web scraping requests?
Rotating across a pool of good IPs, ideally residential ones for strict targets. The most common block is a per-IP rate limit, and spreading the same volume of requests across many addresses means no single one ever crosses the threshold. It is the cheapest fix with the largest impact, which is why it is usually the first technique to apply before tuning headers or timing.
Is changing the user-agent enough to stop blocks?
On the least defended sites, sometimes; on anything serious, no. A realistic user-agent has to be paired with the full set of headers a browser sends, a TLS fingerprint that matches that browser, persistent cookies, and a believable request rate. A spoofed user-agent over a default HTTP-client TLS stack is a contradiction that fingerprint checks catch easily.
How should I handle a 429 Too Many Requests response?
Slow down rather than retry harder. Back off exponentially, respect the Retry-After header when the server sends one, and treat a run of 429s as a signal to lower your overall request rate. Hammering a rate-limited endpoint at full speed is how a temporary throttle turns into a permanent ban.
Do I need a headless browser to avoid getting blocked?
Only when the page builds its content with JavaScript after load or serves a JavaScript challenge a plain client cannot pass. A headless browser renders the page and produces a genuine browser fingerprint, which clears checks a raw request cannot, but it costs far more CPU and memory per page. For static HTML, a well-configured HTTP request is faster, cheaper, and just as unblocked.
When does a managed scraping API make more sense than building my own?
When a target fights back across several layers at once. Maintaining a residential proxy pool, header and TLS rotation, cookie sessions, backoff logic, a headless fleet, and a CAPTCHA path is a real engineering burden, and a new challenge can break it overnight. A crawling API or smart proxy absorbs all of that behind one request, so you trade per-request cost and some control for not running anti-bot infrastructure yourself.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
