Every crawler engineer has had this conversation. A site that worked yesterday now returns 403. The same code, the same proxy, the same headers, and somewhere, quietly, a system has decided your traffic is not human. The instinct is to find the one thing that broke. The reality is that nothing broke. The detection improved.
Modern anti-bot systems are not gatekeepers. They are probability engines. Each request is evaluated across dozens of independent signal surfaces, and a single composite score decides whether you're allowed through, soft-throttled, presented with a challenge, or returned a poisoned response. Understanding that score (what feeds into it, how the weights shift, and where the easy wins for evasion actually live) is the difference between a crawler that survives a quarter and one that survives a decade.
This piece is the systems-level view. Not a list of headers to spoof. Not "five proxies to try." A map of the territory, the way we explain it internally at Crawlbase when a new engineer joins the platform team.
The three detection surfaces
Bot detection happens at three layers, and they don't share a vocabulary. A fingerprint that passes the first layer can fail the second. A session that passes both can fail the third on its eighteenth request. The layers are roughly:
- Transport-layer fingerprinting: what your TLS handshake reveals about the stack underneath
- Protocol-layer inspection: what your HTTP behavior says about the client
- Behavioral modeling: what your session looks like over time
They are evaluated independently and combined statistically. We'll walk through each.
TLS fingerprinting
The first signal arrives before your HTTP request is even parsed. When your client opens a TLS connection, it sends a ClientHello packet that lists the cipher suites it supports, the extensions it advertises, the elliptic curves it prefers, and the order of all of these. That ordering is determined by the implementation: OpenSSL produces one fingerprint, BoringSSL another, Go's crypto/tls a third, the Chrome stack a fourth. JA4 hashes the relevant components into a single string.
The consequence: if you make a request from Python's requests library through a residential proxy, the proxy provides a beautifully residential IP, but the TLS fingerprint announces "OpenSSL via Python 3.11" to anyone listening. The detection system doesn't need to look at your IP, your headers, or your behavior. The handshake alone tells it what you are.
The popular workaround, patching requests to use a Chrome-like cipher suite list, fails for the same reason it succeeds: every crawler now does it. Anti-bot vendors maintain inventories of "Chrome but actually Python" fingerprints. The mismatch between the claimed user-agent and the real handshake is itself a signal.
HTTP profile inspection
Past the TLS layer, the HTTP request itself is interrogated. Not the content: the shape. Real browsers send headers in a specific order. Chrome sends :method before :authority; Firefox reverses two of the lower-priority entries. HTTP/2 introduces frame ordering and stream prioritization that varies by client.
Here's what an HTTP/2 fingerprint actually looks like to a detector:
{ "akamai_hash": "1:65536;3:1000;4:6291456;6:262144|15663105|0|m,s,a,p", "h2_settings": { "HEADER_TABLE_SIZE": 65536, "INITIAL_WINDOW_SIZE": 6291456, "MAX_HEADER_LIST_SIZE": 262144 }, "header_order": [":method", ":authority", ":scheme", ":path"], "pseudo_headers": "m,a,s,p", // Chrome canonical "frame_priority": [256, 255, 254, 253, 252]}
That blob is enough to identify "Chrome 122 on macOS" with high confidence, independent of the user-agent string. The Akamai hash, in particular, is the de facto standard for HTTP/2 fingerprinting and is checked by virtually every major CDN.
The trap here is more subtle than at the TLS layer. You can rotate IPs all day; you can swap user-agents per request. But if your HTTP/2 client always negotiates the same INITIAL_WINDOW_SIZE regardless of which browser you claim to be, you'll fail the consistency check long before you fail the fingerprint check.
The goal is not to avoid blocks. The goal is to make your system antifragile to them.From our internal engineering handbook
Behavioral signals
The third surface is the one that grows over time and is, by far, the hardest to fake convincingly at scale. Detection systems build a model of what a session looks like. Real users navigate. They click backwards. They open a link in a new tab, leave it for forty seconds, then close it. They request favicon.ico on the first hit and not again. They occasionally fail to load a stylesheet and request it twice. Their inter-request timing has jitter that follows a recognizable distribution: log-normal in most measurements.
Crawlers, especially production-grade ones tuned for throughput, do almost none of this. We request the article. We extract the data. We move on. We do not browse. We do not idle. We do not click ads we have no interest in.
If your crawler's request intervals have a coefficient of variation below 0.3, you are visibly automated even with perfect fingerprints. Real users sit at roughly 0.8–1.4. The fix isn't to add random sleeps; it's to model arrival times as a stochastic process and sample from it.
The joint probability model
Here is the thing that took us longest to internalize, and the reason most "anti-detect" advice ages poorly: the three layers are not evaluated independently with thresholds. They are combined.
A modern detection pipeline outputs something like:
def verdict(request, session) -> Verdict: tls_score = score_tls(request.handshake) # 0.0 – 1.0 http_score = score_http(request.profile) # 0.0 – 1.0 behavior_score = score_session(session) # 0.0 – 1.0 # Weights are tuned per-customer, per-route, per-hour. composite = ( 0.30 * tls_score + 0.25 * http_score + 0.45 * behavior_score ) if composite > 0.85: return Verdict.BLOCK if composite > 0.60: return Verdict.CHALLENGE if composite > 0.40: return Verdict.THROTTLE return Verdict.ALLOW
Three things follow from this structure that most engineers miss.
First, you do not need to win on every surface. A scraper with a slightly off TLS signature, slightly weird headers, and excellent session behavior can score lower than a scraper with perfect TLS and headers but obvious session patterns. The behavioral score is weighted highest in nearly every modern system we have inspected.
Second, the thresholds shift. The same composite score that allowed traffic at 2am UTC may challenge it at 11am. The same score that allows on the public catalog page may block on the checkout API. Treating the system as static is the source of most "it worked yesterday" outages.
Third, and this is the strategic insight, blocks are not the worst outcome. A challenge gives you feedback. A block gives you feedback. The throttle tier is where data integrity dies quietly. You get back responses; they look correct; but pricing data is being subtly perturbed, inventory counts are stale, and you discover three weeks later that 12% of your dataset is poisoned. Designing for blocks is easy. Designing to detect that you've been quietly downgraded is the hard part.
Building for the antifragile case
Most evasion engineering tries to be invisible. We have come to believe that is the wrong goal. Invisibility is a moving target maintained by an adversary with more resources than you have. Resilience is the goal: building a system whose performance degrades gracefully when blocked, recovers automatically when conditions change, and surfaces honest signal about its own data quality.
Concretely, this means a few things in our infrastructure:
- Multi-layer fingerprint diversity. Not one Chrome impersonation, but a population of them, sampled appropriately. A proxy pool isn't a list of IPs; it's a joint distribution over (IP, TLS-stack, HTTP-profile, geolocation).
- Real-time scoring of our own traffic. We treat each outbound request as a draw from an unknown distribution, and we measure the response distribution. If
p(200)on a given route drops below baseline, that route is auto-quarantined for diagnostic crawl before production traffic resumes. - Adversarial validation. We periodically scrape known-good content and check that what we got back matches what we expected. The drift between expected and observed is a far better health signal than HTTP status codes.
Anti-bot vendors optimize for catching the median scraper. The median scraper is loud: same fingerprint across millions of requests, no behavioral modeling, no validation. A crawler that is statistically indistinguishable from human traffic on the dimensions the detector measures doesn't need to be invisible. It just needs to be unremarkable.
What the numbers look like
To make this concrete: on our own infrastructure across the last quarter, on a sample of the top 500 most-crawled domains, we observed roughly the following.
- Single-fingerprint clients (default Python/Node libraries) hit a block rate of 47% within the first 100 requests against a Cloudflare-protected route.
- Fingerprint-matched clients with no behavioral modeling hit 22%.
- Fingerprint-matched clients with behavioral modeling and per-route adaptive timing hit 3.1%.
- The same population running through Crawlbase's smart-routing layer hit 0.4%.
The improvement from layer two to layer three is the one that matters. The behavioral model isn't marginal. It is the difference between a crawler that needs constant babysitting and one that runs unattended for months.
The 0.4% figure above is our smart-routing layer doing the fingerprint diversity and behavioural modelling described in this section, without you maintaining any of it.
What to remember when you build the next crawler
- Bot detection is a joint probability across three independent surfaces. You do not need to win on each. You need to not lose badly on any.
- Behavioral signal is weighted highest in modern systems. The hours of work spent perfecting TLS fingerprints pay less than an afternoon spent modeling realistic session behavior.
- Throttle is more dangerous than block. A block tells you something is wrong. A throttle silently poisons your dataset.
- Design for resilience, not invisibility. Invisibility is an arms race you will lose eventually. Resilience compounds.
- Measure your own data quality, not just your status codes. 200 OK is necessary but not sufficient. The shape of the response is the real signal.
Anti-bot evasion is, in the end, less about beating the system and more about understanding what game the system is actually playing. The systems are smarter than they were five years ago. They will be smarter still in five more. The teams that succeed are the ones that stop treating detection as a wall to climb and start treating it as a constraint to design around, the same way you'd design around network latency, or database load, or any other property of the physical universe you operate in.
The web is not chaos. It has structure. We map it. You build with it.
Crawling at scale without fighting detection all day?
Crawlbase handles the fingerprint diversity, proxy rotation, and behavioural modelling described in this article, so your team ships data pipelines instead of maintaining evasion logic. 1,000 requests free, no card required.
