A scraper gets blocked because it does not look like the traffic the target expects. Real browsers carry a consistent set of signals: a plausible IP, a full header set, a TLS fingerprint that matches the User-Agent they claim, and a request rhythm that is not a metronome. Strip any of those away and a modern anti-bot system notices. Most of the work in scraping without getting blocked is putting those signals back, in the right order, and only paying for the heavy ones when the target actually demands them.
This piece walks the techniques in roughly the order they pay off: rotate your IP, send a believable request, smooth out your rate, honor what the site declares, render JavaScript when the page needs it, and hand the whole thing to a managed endpoint when a few headers no longer cut it. None is a silver bullet. Stacked in the right order they get you from a 403 wall to a steady 200, on most targets, most of the time.
The fastest wins, in order
| Technique | Stops which block | Effort |
|---|---|---|
| Rotate your IP | Per-IP rate limits, hard IP bans | Low |
| Send realistic headers + User-Agent | Naive bot fingerprinting | Low |
| Throttle and back off | Velocity-based detection, 429s | Low |
| Match TLS to your User-Agent | Fingerprint mismatch checks | Medium |
| Render JavaScript | Empty HTML, JS challenges | Medium |
| Hand it to a managed API | The whole stack at once | Lowest, paid |
Start at the top, measure your block rate, and only climb when the target makes you. Reaching for a headless browser fleet to scrape a static price page is wasted effort; reaching for plain requests against a hardened login wall is wasted requests.
Rotate your IP
The single most common block is the simplest: too many requests from one address. A site counts requests per IP and starts returning 429s or a block page once you cross its threshold. Spread those requests across many IPs and no single address ever trips the limit. This is the whole reason scraping infrastructure is mostly proxy infrastructure, the proxy makes the request for you so the target sees its IP, not yours.
The IP you rotate through matters as much as the rotation itself. Datacenter IPs are fast and cheap but sit in known hosting ranges, so a target that runs an ASN lookup flags them instantly. Residential IPs exit from real consumer connections and read as ordinary visitors, at higher cost and lower speed. The full tradeoff is in datacenter vs residential proxies, and the static-residential middle ground in ISP vs residential proxies. Buy exactly as much trust as the target demands and not a tier more.
Rotating IPs by hand means maintaining a list and cycling through it per request. A rotating proxy gateway hides that behind one endpoint and swaps the exit IP for you, either per request or sticky per session when you need to hold one identity.
# Rotate exits through a single gateway endpoint. # The gateway picks a fresh IP; your logic stays here. import requests proxies = { "http": "http://_USER_TOKEN_:@smartproxy.crawlbase.com:8012", "https": "http://_USER_TOKEN_:@smartproxy.crawlbase.com:8012", } resp = requests.get("https://example.com/product/123", proxies=proxies, verify=False) print(resp.status_code)
Send a request a real browser would send
A default HTTP client gives itself away in the first line. The Python requests library sends User-Agent: python-requests/2.x and almost no other headers; a real browser sends a dozen, in a specific order. Sites that do nothing more than read those headers will block the first and pass the second.
Set a current, real browser User-Agent and rotate through a small pool of them rather than hammering one string. Then send the headers that always travel with it: Accept, Accept-Language, Accept-Encoding, and a plausible Referer. The goal is not a single magic header, it is internal consistency: a Chrome User-Agent paired with Firefox-style Accept headers is more suspicious than no spoofing at all.
import requests headers = { "User-Agent": ( "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/124.0 Safari/537.36" ), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/", } resp = requests.get("https://example.com", headers=headers)
Match your TLS fingerprint to your User-Agent
Headers are the obvious layer; TLS is the one that catches scrapers who fixed the headers and stopped there. Before a single HTTP byte is sent, your client opens a TLS handshake, and the exact shape of that handshake (cipher order, extensions, supported groups) forms a fingerprint commonly summarized as a JA3 hash. A real Chrome produces one well-known fingerprint. Python's requests produces a completely different one. When you send a Chrome User-Agent over a Python TLS stack, the two disagree, and a fingerprint check flags the mismatch no matter how perfect your headers are.
The fix is to make the handshake itself look like a browser. Use a client that mimics a real browser's TLS profile (in Python, curl_cffi with its impersonate option is the common choice), or drive a real browser engine, which produces a genuine handshake for free. This is where do-it-yourself scraping starts getting expensive, and where a managed endpoint that already handles fingerprints starts looking attractive.
Anti-bot systems rarely block on one bad signal; they block on signals that contradict each other. A datacenter IP with a perfect browser header set, a Chrome User-Agent with a Python TLS fingerprint, a desktop UA with a mobile Accept-Language: each contradiction is a flag. Aim for a request where the IP, headers, TLS, and behavior all tell the same story.
Throttle and back off
Even across many IPs, a scraper that fires requests faster than any human could click reads as automated. Add a randomized delay between requests rather than a fixed one (a fixed 500ms gap is itself a fingerprint), and keep concurrency to a level the target can absorb without noticing.
More important than the steady-state delay is how you react to pushback. When a server returns 429 or 503, it is telling you to slow down. Honor it: back off exponentially, respect the Retry-After header when present, and treat a burst of 429s as a signal to drop your overall rate, not to retry harder. Retrying a rate-limited endpoint at full speed is how a soft throttle becomes a hard ban.
import time, random, requests def fetch(url, headers, tries=4): for attempt in range(tries): resp = requests.get(url, headers=headers) if resp.status_code == 200: return resp if resp.status_code in (429, 503): wait = int(resp.headers.get("Retry-After", 2 ** attempt)) time.sleep(wait + random.uniform(0, 1)) continue resp.raise_for_status() raise RuntimeError("exhausted retries")
If you are stuck deciphering which status code means what, proxy status error codes walks through the common ones and what each is actually telling you.
Honor robots.txt and stay on public data
Before the evasion techniques, a discipline that keeps you out of trouble: read the site's robots.txt, respect its crawl-delay and disallowed paths, and scrape public pages rather than anything behind a login. This is partly courtesy and partly self-preservation. Authenticated scraping ties every request to an account the site can ban in one click, and ignoring declared rules is both the fastest way to get flagged and the line where legal questions start.
A related trap is the honeypot: a link hidden from human eyes by CSS (display:none, zero size, off-screen positioning) but present in the HTML. A naive crawler that follows every <a> walks straight into it and outs itself as a bot. Only follow links a rendered browser would actually show, and skip anything visually hidden.
Render JavaScript when the page needs it
Plenty of pages return almost empty HTML and build their real content with JavaScript after load. Fetch one of those with a plain HTTP client and you get a shell with no data. Worse, some sites serve a JavaScript challenge: a small script that must run and pass before the real page is delivered, which a non-browser client can never clear.
For both cases you need a real browser engine. A headless browser (Playwright, Puppeteer, or Selenium driving Chrome) loads the page, runs its scripts, and hands you the DOM the user would see. It also produces a genuine browser TLS fingerprint and a real navigator object, so it clears a class of checks a raw client cannot. The cost is weight: a headless browser uses far more CPU and memory per page than an HTTP request, so reserve it for pages that genuinely need rendering. For a deeper walkthrough see web scraping with Python and Selenium.
One caveat: a default headless browser is detectable. The navigator.webdriver flag, missing or odd plugin lists, and headless-specific quirks all leak. Stealth plugins paper over the common tells, but it is an arms race, and on a hardened target it is often the moment to stop maintaining your own fleet.
When to hand it to a managed API
Each technique above is a layer you build and maintain: a proxy pool, a header rotator, a TLS-mimicking client, a backoff policy, a headless fleet with stealth patches. On tolerant targets you may need only the first two. On a hardened one, you end up assembling and babysitting all of them, and a CAPTCHA or a new JS challenge can break the whole pipeline overnight.
A crawling API collapses that stack into one request. You send a URL; the provider picks the IP origin, sends a consistent fingerprint, renders the page when a browser is required, retries on blocks server-side, and returns the finished HTML. The tradeoff is honest: you pay per request and give up some low-level control, in exchange for not running anti-bot infrastructure as a second job.
When a target needs more than a clean IP, the Crawling API owns the whole stack: it rotates across a 140M+ IP pool of datacenter, residential, and mobile exits, sends a believable fingerprint, renders JavaScript when the page requires it, and retries on blocks server-side. You send a URL and get the result. Run your real target through it on the free tier first.
# Send the URL; rotation, fingerprint, rendering, # and retries are handled server-side. import requests resp = requests.get( "https://api.crawlbase.com/", params={ "token": "_YOUR_TOKEN_", "url": "https://example.com/product/123", "javascript": "true", # render the page in a browser }, ) print(resp.text)
Whether you build or buy, the proxy question underneath does not go away. If you are still choosing the IP layer, best proxies for web scrapers maps target types to the proxy type that fits, and how to use rotating proxies covers wiring rotation into your code.
Key takeaways
- Blocks come from inconsistency. Make your IP, headers, TLS, and timing all tell the same story; one contradiction is enough to get flagged.
- Rotate IPs first. Most blocks are per-IP rate limits, and spreading requests across a pool is the cheapest, highest-impact fix.
- Fix headers and TLS together. A browser User-Agent over a Python TLS stack is more suspicious than no spoofing at all.
- Respect the site. Honor robots.txt, back off on 429s, avoid honeypots, and stay on public data.
- Render only when needed, and offload when it gets hard. Reserve headless browsers for JS-heavy pages, and reach for a managed API once a target fights back across every layer.
Frequently Asked Questions (FAQs)
What is the most common reason a scraper gets blocked?
Too many requests from one IP address. Sites count requests per IP and start returning 429s or block pages once you cross a threshold. Rotating requests across a pool of IPs so no single address trips the limit is the single highest-impact fix, which is why IP rotation is usually the first technique to apply.
Is changing the User-Agent enough to avoid blocks?
On the least defended sites, sometimes. On anything serious, no. A realistic User-Agent has to be paired with the full set of headers a browser sends, a TLS fingerprint that matches that browser, and a believable request rate. A spoofed User-Agent over a default HTTP-client TLS stack is a contradiction that fingerprint checks catch easily.
Do I always need a headless browser to scrape?
No. A headless browser is only needed when the page builds its content with JavaScript after load, or serves a JavaScript challenge a non-browser client cannot pass. For static HTML, a plain HTTP request is far faster and cheaper. Reserve the headless browser for pages that genuinely require rendering, since it costs much more CPU and memory per page.
How do I handle a 429 Too Many Requests response?
Slow down rather than retry harder. Back off exponentially, respect the Retry-After header when the server sends one, and treat a run of 429s as a signal to lower your overall request rate. Hammering a rate-limited endpoint at full speed is how a temporary throttle turns into a hard ban.
Should I scrape data behind a login?
Avoid it where you can. Authenticated requests tie every call to an account the site can ban instantly, and they raise legal and terms-of-service questions that public-page scraping does not. Read the site's robots.txt, stay on public data, and skip honeypot links hidden from real users.
When does a managed scraping API make more sense than building my own?
When the target fights back across multiple layers at once. Maintaining a proxy pool, header rotation, a TLS-mimicking client, backoff logic, and a headless fleet with stealth patches is a real engineering burden, and a new CAPTCHA or challenge can break it overnight. A crawling API absorbs all of that behind one request, so you trade per-request cost and some control for not running anti-bot infrastructure yourself.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
