Docs
Log in

How it works

Every Crawling API request takes a target URL and returns the page that target would have served to a real browser at the right geography, with the right device profile, after any anti-bot challenges have been resolved. Three things happen in sequence on every call:

  1. Routing. The request is sent through a residential or datacenter exit node — automatically by default, or in a specific country if you pass country=. Sticky sessions are available so a sequence of calls reuses the same IP.
  2. Rendering. If you authenticate with a JavaScript token, the URL is loaded in a real headless browser. Page-wait, scroll, click, and AJAX-idle controls let you wait for the actual content rather than the initial HTML shell.
  3. Anti-bot bypass. Cloudflare, PerimeterX, DataDome, hCaptcha, and other common challenges are solved server-side. You get the post-challenge HTML, not the challenge page.

The same endpoint covers all three. Pass only the parameters you need — there's no separate "JS-rendering API" or "anti-bot API". If you don't pass JS-token-only parameters, the request takes the cheap, fast path; the moment you do, the request shifts to the rendering path. Pricing is the same per successful response either way.

Tokens

Authentication uses one of two token types — both live on a single account, both authenticate the same endpoint:

  • Normal Token (TCP) — for static HTML or JSON responses where you don't need a browser. Faster, cheaper, used for the majority of straightforward scrape targets.
  • JavaScript Token — for SPAs, React/Vue/Angular apps, lazy-loaded feeds, and any target that hides content behind client-side rendering. Required to use page_wait, ajax_wait, scroll, and css_click_selector.

If a Normal-token request returns an empty body or a 525 (challenge couldn't be solved), the standard fix is to retry on the JavaScript token — most modern targets need a browser even when their initial HTML looks complete. See Authentication for the full token-management flow.

Concurrency & pricing

Every request that returns pc_status: 200 counts against your monthly quota. Failed requests (timeouts, blocks, 5xx from the target) are free — retries against a flaky upstream don't surprise your bill. Concurrency limits scale with your plan; the response includes a remaining header you can use to back off proactively before hitting the cap. Long-running crawls (heavy JS rendering, large page_wait) should use the async mode below to release the concurrency slot the moment the request is queued.

Client timeouts. Average response time is 4–10 seconds per request, but tail-latency requests (heavy SPAs, scroll_interval=60, slow upstream sites) can take longer. Set your client timeout to at least 90 seconds so legitimate slow responses don't time out before they arrive.

Other client-side recommendations. Send Accept-Encoding: gzip on every request — payloads are non-trivial (full HTML pages or markdown) and gzip typically cuts them to a third of the wire size. If you're using Scrapy, disable the DNS cache so the API host stays resolvable across long-lived crawls.

Endpoint

GEThttps://api.crawlbase.com/?token=YOUR_TOKEN&url=ENCODED_URL
# All requests are GET. The url parameter must be fully URL-encoded.
# Body is returned as the target page's content (HTML, JSON, image, etc).
# Metadata is returned as response headers (pc_status, original_status, url, rid).

Quickstart

curl 'https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fgithub.com%2Fanthropic'
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_TOKEN'})
res = api.get('https://github.com/anthropic')
print(res['body'])
const { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
const res = await api.get('https://github.com/anthropic');
console.log(res.body);
require 'crawlbase'
api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://github.com/anthropic')
puts res.body
<?php
use Crawlbase\CrawlingAPI;
$api = new CrawlingAPI(['token' => 'YOUR_TOKEN']);
$res = $api->get('https://github.com/anthropic');
echo $res->body;
package main

import (
    "fmt"
    "github.com/crawlbase/crawlbase-go"
)

func main() {
    api, _ := crawlbase.NewCrawlingAPI("YOUR_TOKEN")
    res, _ := api.Get("https://github.com/anthropic", nil)
    fmt.Println(res.Body)
}

Request

Every Crawling API request is a single HTTP call to the endpoint. Most requests are GETs — pass the query parameters below to control rendering, geo, output format, and async behavior. Use POST when you need to send a form or JSON body, and PUT for raw payload uploads.

Request parameters

All parameters are passed as query string values. Only token and url are required.

Required

token
stringrequired
Your Normal or JavaScript token. See Authentication.
url
stringrequired
The fully URL-encoded target URL. Must include the scheme (http:// or https://).

Routing & geo

Pick where the request originates and what device the target sees. Routing matters for storefronts, SERPs, and any site that localises content by IP — the German Amazon catalog isn't reachable from a US exit even with the right URL, and Google SERPs are localised by both geography and the hl/gl URL params combined with the IP. Set country explicitly and the right currency, language, and inventory show up automatically.

country
stringoptional
Two-letter ISO country code (US, GB, DE, JP, …) to route the crawl through that country's exit nodes. Defaults to automatic geo selection.
device
desktop | tablet | mobiledesktop
Emulate the User-Agent and viewport of the chosen device class.
user_agent
stringoptional
Override the User-Agent header. Use sparingly — defaults are tuned for each target.
tor_network
booleanfalse
Route the request over the Tor network so you can crawl .onion sites. Leave off for any clearnet target — Tor exits are slower and noisier than the residential pool.
Country may be auto-overridden

Crawlbase may override the country parameter to auto-select a proxy based on the URL — this gives the best success rate on most sites. Contact support if you need to disable automatic proxy selection.

Specifying a country can reduce the number of successful requests, so use it only when geolocation actually matters for the page you're crawling. Some sites (notably Amazon) are routed via dedicated proxies regardless of the country you pass — every country is allowed for those domains even if it's not in the supported list below.

You have access to the following countries:

Australia (AU)Brazil (BR)Canada (CA)
Switzerland (CH)China (CN)Germany (DE)
Spain (ES)Finland (FI)France (FR)
United Kingdom (GB)India (IN)Japan (JP)
Mexico (MX)Netherlands (NL)Norway (NO)
Poland (PL)Russia (RU)Seychelles (SC)
Sweden (SE)Turkey (TR)Ukraine (UA)
United States (US)

Headers & cookies

Forward your own request headers and cookies through to the target site, or pin a sticky session so Set-Cookie values from one call replay on the next. Useful when the target needs an Accept-Language, a CSRF cookie, or a logged-in session that needs to survive across the requests in a flow.

request_headers
stringoptional
URL-encoded list of headers to forward, pipe-separated: accept-language:en-GB|accept-encoding:gzip. Pair with get_headers=true to also surface the target's response headers.
set_cookies
stringoptional
Cookies to forward to the target, in standard Cookie-header form: key1=value1; key2=value2.
cookies_session
stringoptional
Sticky cookie session — Crawlbase replays the cookies returned from previous calls on every subsequent call sharing the same value. Any string up to 32 chars; a new value starts a new session. Sessions expire 300 seconds after the last call.

Allowed headers. Not every header you pass via request_headers will reach the target site — Crawlbase strips a small set by default. To verify what actually goes out, send a test request to https://postman-echo.com/headers and inspect what the echo service receives. If you need an additional header authorised for your token, contact support with the header name(s).

JavaScript rendering

These parameters require a JavaScript token. They control how the headless browser waits for content before capturing the DOM. If you find yourself reaching for several at once, the order to think about is: page_wait first (a fixed delay for predictable animations), then ajax_wait (drop the fixed delay if the page emits network requests after mount), then scroll (only if the content you need is below the fold), then css_click_selector (only if a button or accordion gates the data).

A common pitfall: setting page_wait too high "just in case". Every extra millisecond is concurrency you can't use elsewhere. Start at 0, increase only when you see truncated output, and consider ajax_wait as a smarter alternative — it returns as soon as the network goes idle rather than blocking on a fixed timeout.

page_wait
int (ms)0
Wait this many milliseconds after page load before capturing. Useful for content that animates in.
ajax_wait
booleanfalse
Wait until the network is idle (no requests for ~500ms). Best for SPAs that fetch data after mount.
css_click_selector
stringoptional
CSS selector — Crawlbase will click the matching element before capturing. URL-encode special characters.
scroll
booleanfalse
Scroll to the bottom of the page before capturing. Triggers lazy-load.
scroll_interval
int (s)10
Maximum seconds to spend scrolling. Combined with scroll=true.
screenshot
booleanfalse
Capture a JPEG of the rendered page. The URL comes back as screenshot_url in the response headers (or the JSON body when format=json) and expires after one hour. For multi-shot or full-page workflows reach for the dedicated Screenshots API instead.

Screenshot output options. When screenshot=true, the default capture is the full rendered page. To narrow it to just the viewport, append mode=viewport; pair it with width and height (pixels) to constrain the capture. Both default to the screen dimensions and only take effect with mode=viewport. Example: &screenshot=true&mode=viewport&width=1200&height=800.

How scroll is billed. Scroll-enabled requests are billed by total server-side processing time. The first 8 seconds (page load + scrolling combined) count as 1 request; every additional 5 seconds beyond that adds 1 more billed request. A 20s scroll = 1 (first 8s) + 1 (9–13s) + 1 (14–18s) + 1 (19–20s, partial blocks count in full) = 4 billed requests. If the page completes before scroll_interval, only the actual processing time is billed.

The maximum scroll_interval is 60 seconds — past 60s scrolling stops and the response is returned. When you set scroll_interval=60, keep the client-side connection open for at least 90 seconds so the response has time to come back. Combining scroll with page_wait increases the total processing time and therefore the billed request count.

The css_click_selector parameter only takes effect when you're using the JavaScript token (it runs inside the headless browser before the DOM is captured). It accepts any fully specified, valid CSS selector — for example an ID like #some-button, a class like .some-other-button, or an attribute selector like [data-tab-item="tab1"]. Always URL-encode the value so special characters survive the query string intact.

If the selector is not found on the page the request fails with pc_status 595. To still receive a response when the click target may be absent, append a universally-found selector as a fallback — comma-separated. For example #some-button,body falls back to clicking body when #some-button doesn't exist.

Multiple selectors. To click several elements in sequence before the capture, separate them with a pipe (|) character. URL-encode the whole value, including the pipe. For example, clicking #start-button and then .next-page-link looks like #start-button|.next-page-link in raw form, or %23start-button%7C.next-page-link URL-encoded. The clicks happen in the order given. If any selector in the chain is missing the same pc_status 595 rule applies, so the ,body fallback pattern works per selector.

Need to run custom JavaScript inside the page before Crawlbase captures the DOM (e.g. dispatch a synthetic event, mutate state, force a fetch)? That's a per-account feature gated on your use case — contact support with what you're trying to do and we'll wire it up.

Async & storage

Async mode flips the API from "block until I have your page" to "queue this and tell me when it's done." The endpoint returns immediately with an rid; the actual result is delivered to a webhook you specify, or stored in Cloud Storage and fetched later by the same rid. This is the right mode for batch jobs and slow targets — async releases your concurrency slot the moment the request is queued, so you can keep submitting while crawls are still running. For high-volume jobs (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline with retries, rate management, and result delivery.

Async mode is currently linkedin.com only

The async=true flag is currently supported only for linkedin.com URLs. If you need async crawls on other domains, contact support with the target domain so we can enable it for your token.

async
booleanfalse
Return immediately with an rid instead of blocking. Result is delivered to callback if set, or available via Cloud Storage by rid.
callback
URLoptional
Webhook URL to receive the crawl result. Required when async=true if you don't want to poll.
store
booleanfalse
Persist the crawled page in Cloud Storage. Returns an rid in addition to the body.

Output format

The default response is the raw page body — exactly what a browser would receive after rendering and anti-bot resolution. For most pipelines that's the right shape (your downstream parser handles HTML directly). Use format=json when you want metadata (status, final URL, RID, headers) bundled into a single envelope rather than split across response headers and the body. Use scraper= or autoparse=true when the target is one we already have a parser for — you skip the parsing step entirely and get clean structured fields back instead of raw markup.

format
html | json | mdhtml
Choose the response envelope. html returns the raw page with metadata in the response headers. json wraps the page plus all metadata into a single JSON object. md converts the page to GitHub-Flavored Markdown — pair with md_readability=true to strip nav/sidebar/ads first.
md_readability
booleanfalse
Only meaningful with format=md. When true, Crawlbase runs a readability pass over the page before converting to Markdown — drops the chrome (nav, sidebar, footer, ad slots) and keeps the main article content. Best fit for converting blog posts and articles into clean LLM context.
pretty
booleanfalse
Only meaningful with format=json. Pretty-prints the JSON envelope with indentation and newlines for human reading; leave off in production to keep responses small.
scraper
stringoptional
Apply a built-in scraper to extract structured data instead of returning HTML. Example: amazon-product-details.
autoparse
booleanfalse
Auto-detect the page type and apply the matching scraper. Convenience for "give me JSON when you can".

Response control

These parameters change what the response contains or how Crawlbase decides a request succeeded. Use get_headers and get_cookies when you need the target site's response headers or Set-Cookie values surfaced back to you (they're stripped by default). Use custom_success_codes when the target legitimately returns a non-2xx status that your pipeline should treat as a clean fetch — without it, Crawlbase will retry those responses on your behalf.

get_headers
booleanfalse
Surface the target site's response headers. They come back prefixed as original_header_* response headers, or grouped under original_headers when format=json.
get_cookies
booleanfalse
Surface the target site's Set-Cookie values. They come back as original_set_cookie in the response headers, or under the same key when format=json.
custom_success_codes
stringoptional
Comma-separated list of HTTP status codes to treat as successful — e.g. custom_success_codes=403,429,503. Crawlbase won't retry these, and the original status is preserved in original_status. Use it when the target legitimately returns these codes for your endpoint (auth-gated APIs, region-blocked pages you still want the body of).

POST requests

Use POST when the target endpoint expects a request body — form submissions, JSON APIs, GraphQL, anything that doesn't fit in a query string. Same endpoint, same parameters, same response shape as GET; only the HTTP method and the body change.

POST is Normal-token only

POST requests work with the Normal token only. The JavaScript token (and the JS-rendering parameters page_wait, ajax_wait, scroll, css_click_selector) are GET-only — when you need to submit a form on a JS-rendered page, use the JavaScript token with css_click_selector to drive the form button instead of POSTing to the form URL directly.

The default Content-Type is application/x-www-form-urlencoded. Pass the form fields as the request body — Crawlbase forwards them to the target unchanged.

curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
  --data-urlencode 'url=https://postman-echo.com/post' -G \
  -F 'parameter1=testing some post data' \
  -F 'parameter2=here goes some data'
import requests
from urllib.parse import quote_plus

url = quote_plus('https://postman-echo.com/post')
res = requests.post(
    f'https://api.crawlbase.com/?token=YOUR_TOKEN&url={url}',
    data={'parameter1': 'value', 'parameter2': 'another value'},
)
print(res.status_code, res.text)
const url = encodeURIComponent('https://postman-echo.com/post');
const body = new URLSearchParams({ parameter1: 'value', parameter2: 'another' });

const res = await fetch(`https://api.crawlbase.com/?token=YOUR_TOKEN&url=${url}`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
  body,
});
console.log(res.status, await res.text());
require 'net/http'

uri = URI('https://api.crawlbase.com')
uri.query = URI.encode_www_form(token: 'YOUR_TOKEN', url: 'https://postman-echo.com/post')

res = Net::HTTP.post_form(uri, 'parameter1' => 'value', 'parameter2' => 'another')
puts res.code, res.body
<?php
$url  = 'https://postman-echo.com/post';
$body = http_build_query(['parameter1' => 'value', 'parameter2' => 'another']);

$ch = curl_init('https://api.crawlbase.com/?token=YOUR_TOKEN&url=' . urlencode($url));
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($ch);
package main

import (
    "fmt"
    "io"
    "net/http"
    "net/url"
    "strings"
)

func main() {
    target := url.QueryEscape("https://postman-echo.com/post")
    body   := strings.NewReader("parameter1=value¶meter2=another")

    res, _ := http.Post(
        "https://api.crawlbase.com/?token=YOUR_TOKEN&url="+target,
        "application/x-www-form-urlencoded",
        body,
    )
    out, _ := io.ReadAll(res.Body)
    fmt.Println(string(out))
}
Don't abuse this

POST cannot be used to spam or otherwise harm target websites. Crawlbase actively monitors for abusive patterns; accounts caught using POST for spam, credential stuffing, or other malicious traffic will be suspended and reported.

POST with a JSON body

Override the default form-urlencoded content type with post_content_type. URL-encode the value (e.g. application/json becomes application%2Fjson). The body is forwarded to the target unchanged — encode it as JSON yourself.

curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
  --data-urlencode 'url=https://postman-echo.com/post' \
  --data-urlencode 'post_content_type=application/json;charset=UTF-8' -G \
  --request POST \
  --data '{"param1":"value","param2":"another"}'
import json, requests
from urllib.parse import quote_plus

url = quote_plus('https://postman-echo.com/post')
res = requests.post(
    f'https://api.crawlbase.com/?token=YOUR_TOKEN'
    f'&url={url}'
    f'&post_content_type=application/json',
    data=json.dumps({'param1': 'value', 'param2': 'another'}),
    headers={'Content-Type': 'application/json'},
)
print(res.status_code, res.text)
const url  = encodeURIComponent('https://postman-echo.com/post');
const ct   = encodeURIComponent('application/json;charset=UTF-8');
const body = JSON.stringify({ param1: 'value', param2: 'another' });

const res = await fetch(
  `https://api.crawlbase.com/?token=YOUR_TOKEN&url=${url}&post_content_type=${ct}`,
  { method: 'POST', headers: { 'Content-Type': 'application/json' }, body },
);
console.log(res.status, await res.text());
require 'net/http'
require 'json'

uri = URI('https://api.crawlbase.com')
uri.query = URI.encode_www_form(
  token: 'YOUR_TOKEN',
  url: 'https://postman-echo.com/post',
  post_content_type: 'application/json'
)

req = Net::HTTP::Post.new(uri, 'Content-Type' => 'application/json')
req.body = { param1: 'value', param2: 'another' }.to_json

res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |h| h.request(req) }
puts res.code, res.body
<?php
$url  = 'https://postman-echo.com/post';
$ct   = urlencode('application/json;charset=UTF-8');
$body = json_encode(['param1' => 'value', 'param2' => 'another']);

$ch = curl_init(
    'https://api.crawlbase.com/?token=YOUR_TOKEN'
    . '&url=' . urlencode($url)
    . '&post_content_type=' . $ct
);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($ch);
package main

import (
    "bytes"
    "fmt"
    "io"
    "net/http"
    "net/url"
)

func main() {
    target := url.QueryEscape("https://postman-echo.com/post")
    ct     := url.QueryEscape("application/json;charset=UTF-8")
    body   := bytes.NewBufferString(`{"param1":"value","param2":"another"}`)

    res, _ := http.Post(
        "https://api.crawlbase.com/?token=YOUR_TOKEN&url="+target+"&post_content_type="+ct,
        "application/json",
        body,
    )
    out, _ := io.ReadAll(res.Body)
    fmt.Println(string(out))
}

Note: the target site decides whether to accept the body. Crawlbase forwards the request honestly — if the target returns 4xx because the body shape is wrong, that surfaces in original_status, not in pc_status. See Errors for the branching pattern.

PUT requests

PUT works the same way as POST — same endpoint, same parameters, same body-encoding rules. The only difference is the HTTP method.

curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
  --data-urlencode 'url=https://api.example.com/resource/42' -G \
  --request PUT \
  --header 'Content-Type: application/json' \
  --data '{"name":"updated","status":"active"}'
import requests
from urllib.parse import quote_plus

url = quote_plus('https://api.example.com/resource/42')
res = requests.put(
    f'https://api.crawlbase.com/?token=YOUR_TOKEN&url={url}&post_content_type=application/json',
    data='{"name":"updated","status":"active"}',
    headers={'Content-Type': 'application/json'},
)
print(res.status_code, res.text)
const url  = encodeURIComponent('https://api.example.com/resource/42');
const ct   = encodeURIComponent('application/json');
const body = JSON.stringify({ name: 'updated', status: 'active' });

const res = await fetch(
  `https://api.crawlbase.com/?token=YOUR_TOKEN&url=${url}&post_content_type=${ct}`,
  { method: 'PUT', headers: { 'Content-Type': 'application/json' }, body },
);
console.log(res.status, await res.text());
require 'net/http'
require 'json'

uri = URI('https://api.crawlbase.com')
uri.query = URI.encode_www_form(
  token: 'YOUR_TOKEN',
  url: 'https://api.example.com/resource/42',
  post_content_type: 'application/json'
)

req = Net::HTTP::Put.new(uri, 'Content-Type' => 'application/json')
req.body = { name: 'updated', status: 'active' }.to_json

res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |h| h.request(req) }
puts res.code, res.body
<?php
$url  = 'https://api.example.com/resource/42';
$ct   = urlencode('application/json');
$body = json_encode(['name' => 'updated', 'status' => 'active']);

$ch = curl_init(
    'https://api.crawlbase.com/?token=YOUR_TOKEN'
    . '&url=' . urlencode($url)
    . '&post_content_type=' . $ct
);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'PUT');
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($ch);
package main

import (
    "bytes"
    "fmt"
    "io"
    "net/http"
    "net/url"
)

func main() {
    target := url.QueryEscape("https://api.example.com/resource/42")
    ct     := url.QueryEscape("application/json")
    body   := bytes.NewBufferString(`{"name":"updated","status":"active"}`)

    req, _ := http.NewRequest(
        "PUT",
        "https://api.crawlbase.com/?token=YOUR_TOKEN&url="+target+"&post_content_type="+ct,
        body,
    )
    req.Header.Set("Content-Type", "application/json")

    res, _ := http.DefaultClient.Do(req)
    out, _ := io.ReadAll(res.Body)
    fmt.Println(string(out))
}

Like POST, PUT requires the Normal token. Use post_content_type to control the body's media type if it isn't form-urlencoded.

Don't use POST/PUT to spam

Crawlbase actively monitors POST and PUT traffic. Sending request bodies that target third-party sites you don't own — comment spam, fraudulent form submissions, scripted account creation — gets the originating account suspended on first detection. Use these verbs for legitimate API integrations, your own staging and production endpoints, and explicitly permitted automation.

Response

Successful responses return the target page in the body. Metadata lives in the response headers.

Headers

HeaderDescription
pc_statusCrawlbase status code. 200 = success.
original_statusHTTP status from the target site.
urlFinal URL after redirects.
ridRequest ID. Returned when async=true or store=true.
content-typeMIME type of the body (text/html, application/json, image/png, etc).
original_header_*Returned when get_headers=true. Each header from the target site arrives with an original_header_ prefix (e.g. original_header_x_frame_options). Grouped under original_headers when format=json.
screenshot_urlReturned when screenshot=true. Temporary JPEG URL for the rendered page; expires one hour after the crawl.
original_set_cookieReturned when get_cookies=true. Concatenated Set-Cookie values from the target site's response.
domain_complexity
also X-Domain-Complexity
The complexity tier of the crawled domain — one of standard, moderate, or complex. Reflects the resources required to bypass the site's protections and maps directly onto the pricing tier billed for the request. See complexity tiers below.
storage_urlReturned when the request was made with store=true. Pointer to the stored copy of the response in Crawlbase Cloud Storage; pair with rid to retrieve later.
Content-Typetext/markdown; charset=utf-8 when the request was made with format=md; the standard text/html or application/json otherwise.
X-Markdown-FlavorMarkdown dialect of the response body — currently GitHub Flavored Markdown (GFM). Only emitted when format=md.
X-Markdown-FeaturesComma-separated list of GFM features used in the body (e.g. tables,lists). Lets you pick a parser with the right extensions enabled. Only emitted when format=md.
X-Markdown-Base-URLHost of the resolved URL (after any redirects). Useful for resolving relative links in the markdown body. Only emitted when format=md.
X-Markdown-GeneratorIdentifies the converter — value is ProxyCrawl-API. Only emitted when format=md.

HTML response

The default. format=html (or no format at all) returns the raw page body in the HTTP body, with metadata in the response headers (url, original_status, pc_status, X-Domain-Complexity, plus any original_header_* entries you opted into via get_headers=true).

GET 'https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fgithub.com%2Fcrawlbase&format=html'

Response:
  Headers:
    url: https://github.com/crawlbase
    original_status: 200
    pc_status: 200
    X-Domain-Complexity: standard

  Body:
    <!doctype html><html>
      <head>...</head>
      <body>... (full page HTML) ...</body>
    </html>

JSON response

Set format=json to get the same data as a single JSON object instead:

GET 'https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fgithub.com%2Fcrawlbase&format=json'

Response:
  {
    "original_status": 200,
    "pc_status": 200,
    "url": "https://github.com/crawlbase",
    "domain_complexity": "standard",
    "body": "<!doctype html><html>... (full page HTML) ...</html>"
  }

Markdown response

format=md returns the page already converted to GitHub Flavored Markdown in the body, with Content-Type: text/markdown; charset=utf-8 and a block of X-Markdown-* metadata headers (Flavor, Features, Base-URL, Generator) alongside the usual url / original_status / pc_status. Pair it with md_readability=true when you want main-content extraction (article body, no chrome) before the conversion runs — see the md_readability parameter.

GET 'https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fgithub.com%2Fcrawlbase&format=md'

Response:
  Headers:
    Content-Type: text/markdown; charset=utf-8
    X-Markdown-Flavor: GitHub Flavored Markdown (GFM)
    X-Markdown-Features: tables,lists
    X-Markdown-Base-URL: github.com
    X-Markdown-Generator: ProxyCrawl-API
    url: https://github.com/crawlbase
    original_status: 200
    pc_status: 200

  Body:
    # crawlbase
    ... (markdown text of the page) ...

Billable requests

Crawlbase only charges requests where pc_status is 200 and original_status is one of:

CodeMeaning
200OK
201Created
204No Content
301Moved Permanently
302Found — only when the redirect was followed and returned content
404Not Found
410Gone

Any other original_status is free, and so is any non-200 pc_status. Use this list when reconciling a usage invoice against your application logs.

Domain complexity tiers

The domain_complexity field (also returned as the X-Domain-Complexity response header) tells you how hard it was to crawl the target domain — and what pricing tier the request fell into.

  • standard — easy to crawl, minimal protection. Lowest pricing tier.
  • moderate — moderate anti-bot protection that needs specialised handling. Intermediate pricing tier.
  • complex — advanced protection requiring specialised resources. Highest pricing tier.

For tier-specific pricing see your subscription plan or contact sales.

Common patterns

JS-rendered SPA with scroll

curl 'https://api.crawlbase.com/?token=JS_TOKEN' \
  --data-urlencode 'url=https://feed.example.com' \
  --data-urlencode 'page_wait=2000' \
  --data-urlencode 'scroll=true' \
  --data-urlencode 'scroll_interval=15' -G
from crawlbase import CrawlingAPI
api = CrawlingAPI({'token': 'JS_TOKEN'})
res = api.get('https://feed.example.com', {
    'page_wait': 2000,
    'scroll': True,
    'scroll_interval': 15,
})
const { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: 'JS_TOKEN' });

const res = await api.get('https://feed.example.com', {
  page_wait: 2000,
  scroll: true,
  scroll_interval: 15,
});
console.log(res.body);
require 'crawlbase'

api = Crawlbase::API.new(token: 'JS_TOKEN')
res = api.get('https://feed.example.com',
  page_wait: 2000,
  scroll: true,
  scroll_interval: 15
)
puts res.body
<?php
use Crawlbase\CrawlingAPI;

$api = new CrawlingAPI(['token' => 'JS_TOKEN']);
$res = $api->get('https://feed.example.com', [
    'page_wait' => 2000,
    'scroll' => true,
    'scroll_interval' => 15,
]);
echo $res->body;
package main

import (
    "fmt"
    "log"
    "github.com/crawlbase/crawlbase-go"
)

func main() {
    api, err := crawlbase.NewCrawlingAPI("JS_TOKEN")
    if err != nil {
        log.Fatal(err)
    }
    res, _ := api.Get("https://feed.example.com", map[string]string{
        "page_wait":       "2000",
        "scroll":          "true",
        "scroll_interval": "15",
    })
    fmt.Println(res.Body)
}

Geo-routed request

# Get the German version of a localized site
curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
  --data-urlencode 'url=https://www.amazon.com/dp/B08N5WRWNW' \
  --data-urlencode 'country=DE' -G
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_TOKEN'})
res = api.get('https://www.amazon.com/dp/B08N5WRWNW', {'country': 'DE'})
print(res['body'])
const { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });

const res = await api.get('https://www.amazon.com/dp/B08N5WRWNW', { country: 'DE' });
console.log(res.body);
require 'crawlbase'

api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://www.amazon.com/dp/B08N5WRWNW', country: 'DE')
puts res.body
<?php
use Crawlbase\CrawlingAPI;

$api = new CrawlingAPI(['token' => 'YOUR_TOKEN']);
$res = $api->get('https://www.amazon.com/dp/B08N5WRWNW', ['country' => 'DE']);
echo $res->body;
package main

import (
    "fmt"
    "log"
    "github.com/crawlbase/crawlbase-go"
)

func main() {
    api, err := crawlbase.NewCrawlingAPI("YOUR_TOKEN")
    if err != nil {
        log.Fatal(err)
    }
    res, _ := api.Get("https://www.amazon.com/dp/B08N5WRWNW", map[string]string{
        "country": "DE",
    })
    fmt.Println(res.Body)
}

Async crawl with webhook

curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
  --data-urlencode 'url=https://example.com' \
  --data-urlencode 'async=true' \
  --data-urlencode 'callback=https://your-app.com/webhook' -G

# → returns immediately: { "rid": "a1B2c3D4e5F6" }
# → result POSTed to your callback when ready
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_TOKEN'})
res = api.get('https://example.com', {
    'async': 'true',
    'callback': 'https://your-app.com/webhook',
})
print(res['rid'])  # → returned immediately; result POSTed to callback later
const { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });

const res = await api.get('https://example.com', {
  async: true,
  callback: 'https://your-app.com/webhook',
});
console.log(res.rid); // → returned immediately; result POSTed to callback later
require 'crawlbase'

api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://example.com',
  async: true,
  callback: 'https://your-app.com/webhook'
)
puts res.rid # → returned immediately; result POSTed to callback later
<?php
use Crawlbase\CrawlingAPI;

$api = new CrawlingAPI(['token' => 'YOUR_TOKEN']);
$res = $api->get('https://example.com', [
    'async' => 'true',
    'callback' => 'https://your-app.com/webhook',
]);
echo $res->rid; // → returned immediately; result POSTed to callback later
package main

import (
    "fmt"
    "log"
    "github.com/crawlbase/crawlbase-go"
)

func main() {
    api, err := crawlbase.NewCrawlingAPI("YOUR_TOKEN")
    if err != nil {
        log.Fatal(err)
    }
    res, _ := api.Get("https://example.com", map[string]string{
        "async":    "true",
        "callback": "https://your-app.com/webhook",
    })
    fmt.Println(res.RID) // → returned immediately; result POSTed to callback later
}
When to use async

Async releases your concurrency slot the moment the request is queued, so a long crawl doesn't tie up budget. Use it for slow targets (heavy JS, long page_wait) when you need to push high volume.

Proxy mode

The same Crawling API can be invoked as an HTTP/HTTPS proxy instead of a REST endpoint — useful when you have an existing scraper, browser-automation script, or HTTP client that already supports proxy configuration and you'd rather drop Crawlbase in front of it than rewrite the request layer.

Point your client at smartproxy.crawlbase.com:8001 (HTTPS, recommended) or smartproxy.crawlbase.com:8000 (HTTP) and pass your token as the proxy username. All Crawling API features — JS rendering, anti-bot bypass, geo-routing — apply identically; the only difference is the request shape.

Proxy mode vs. Smart AI Proxy

Two products share the same hostname but use different ports — easy to mix up. Capabilities are essentially the same on both (country routing, device emulation, sessions, custom headers, JS rendering via CrawlbaseAPI-* controls); they differ in the subscription you're billed against and the concurrency / thread tier that subscription provides:

  • Crawling API in proxy mode (this section) → ports 8000 / 8001. Routes through your Crawling API plan: same monthly quota, same concurrency budget, same per-success billing as REST-mode calls. Pick this when you already pay for the Crawling API and want a proxy-shaped interface alongside the REST endpoint.
  • Smart AI Proxy (separate product, see Smart Proxy) → ports 8012 / 8013. A separate SKU with its own subscription and its own thread / concurrency model, sized for proxy-first scraping pipelines that already run high thread counts. Same network and same control headers underneath — the choice is which contract and concurrency shape fit your usage.

Rule of thumb: pick the product whose subscription you already hold (or whose pricing model fits your traffic shape). The capability surface is the same; the ports just route you to the right billing + concurrency lane.

Quickstart

A first call from your shell — Normal token, HTTPS proxy:

# HTTPS proxy (recommended)
curl -x 'https://[email protected]:8001' \
  -k 'https://httpbin.org/ip'

# HTTP alternative
curl -x 'http://[email protected]:8000' \
  -k 'https://httpbin.org/ip'
import requests

proxies = {
    'http':  'http://[email protected]:8000',
    'https': 'http://[email protected]:8000',
}
res = requests.get('https://httpbin.org/ip', proxies=proxies, verify=False)
print(res.status_code, res.text)
const { HttpsProxyAgent } = require('https-proxy-agent');

const agent = new HttpsProxyAgent('http://[email protected]:8000');
const res = await fetch('https://httpbin.org/ip', { agent });
console.log(res.status, await res.text());
require 'net/http'

uri  = URI('https://httpbin.org/ip')
http = Net::HTTP.new(uri.host, uri.port,
  'smartproxy.crawlbase.com', 8000, 'YOUR_TOKEN', '')
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE

res = http.get(uri.request_uri)
puts res.code, res.body
<?php
$ch = curl_init('https://httpbin.org/ip');
curl_setopt($ch, CURLOPT_PROXY,         'smartproxy.crawlbase.com:8000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD,  'YOUR_TOKEN:');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($ch);
package main

import (
    "crypto/tls"
    "fmt"
    "io"
    "net/http"
    "net/url"
)

func main() {
    proxyURL, _ := url.Parse("http://[email protected]:8000")
    client := &http.Client{
        Transport: &http.Transport{
            Proxy:           http.ProxyURL(proxyURL),
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        },
    }
    res, _ := client.Get("https://httpbin.org/ip")
    out, _ := io.ReadAll(res.Body)
    fmt.Println(string(out))
}

For JS-rendered targets, swap in your JavaScript token:

curl -x 'https://[email protected]:8001' \
  -k 'https://spa.example.com'
import requests

proxies = {
    'http':  'http://[email protected]:8000',
    'https': 'http://[email protected]:8000',
}
res = requests.get('https://spa.example.com', proxies=proxies, verify=False)
print(res.status_code)
const { HttpsProxyAgent } = require('https-proxy-agent');

const agent = new HttpsProxyAgent('http://[email protected]:8000');
const res = await fetch('https://spa.example.com', { agent });
console.log(res.status);
require 'net/http'

uri  = URI('https://spa.example.com')
http = Net::HTTP.new(uri.host, uri.port,
  'smartproxy.crawlbase.com', 8000, 'YOUR_JS_TOKEN', '')
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE

res = http.get(uri.request_uri)
puts res.code
<?php
$ch = curl_init('https://spa.example.com');
curl_setopt($ch, CURLOPT_PROXY,         'smartproxy.crawlbase.com:8000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD,  'YOUR_JS_TOKEN:');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch);
package main

import (
    "crypto/tls"
    "fmt"
    "net/http"
    "net/url"
)

func main() {
    proxyURL, _ := url.Parse("http://[email protected]:8000")
    client := &http.Client{
        Transport: &http.Transport{
            Proxy:           http.ProxyURL(proxyURL),
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        },
    }
    res, _ := client.Get("https://spa.example.com")
    fmt.Println(res.Status)
}

Rate limits

Default rate limit in proxy mode is 20 requests per second (~1.7M req/day). Concurrency-based clients should think in threads rather than RPS — at typical Crawling API latency (~4s for an Amazon product page) that converts to roughly 80 concurrent threads. Faster targets convert to fewer threads.

If you hit the cap, contact support with your use case to negotiate higher concurrency.

Errors & retries

The Crawling API surfaces two status codes on every response: original_status (what the target site returned) and pc_status (what Crawlbase made of it after applying anti-bot, redirect, and content-validation rules). They can disagree — a target might return 200 with an empty body, in which case original_status is 200 but pc_status is 520. Always branch on pc_status when deciding whether to retry.

The most common Crawling-API-specific failures:

CodeMeaningAction
422url missing or not URL-encodedEncode the URL before sending. Most clients (libcurl --data-urlencode, Python requests, Node fetch) handle this automatically — but hand-built query strings often miss it.
520Empty response from targetRetry once. If still empty, switch from Normal to JS token — many sites serve an empty shell to non-browser user agents and rely on JS to populate.
521Target site is down / unreachableTreat like a transient upstream error. Backoff + retry; if persistent over minutes, the site is genuinely down.
522Connection timed out reaching the targetRetry with backoff. Try a different country if the target is geo-flaky.
523Origin unreachable from the chosen exitRetry without country (let auto-routing pick) or with a different country.
525Anti-bot challenge couldn't be solvedSwitch from Normal to JS token. If already on JS, retry; if persistent, escalate to support — usually means the target rolled out a new challenge variant.
595Selector not found. The page loaded successfully but the CSS selector you passed via css_click_selector didn't match any element.Append a fallback to the selector (#start-button,body) so the click still lands on a known element. See the css_click_selector notes for the full pattern.
599Internal Crawlbase errorRetry. If a request hits this consistently, contact support with the rid.

Full HTTP + pc_status reference is in Status Codes; Error handling covers the recommended retry-with-backoff loop and the SDK helpers that implement it for you in each language.

Anchoring example. The most common reason pc_status diverges from original_status is a CAPTCHA: the target site returns a 200 (the captcha page rendered fine) but Crawlbase recognises the response as an interstitial and surfaces pc_status: 503 so you can route around it instead of treating the captcha HTML as your data.

Non-standard pc_status codes. Codes outside the usual HTTP range — 601, 999, and similar — are internal markers used by the Crawlbase engineering team. They're surfaced in the response only to help you debug when contacting support; you don't need to handle them in application code.

Retry strategy

The simple version: retry transient errors (5xx) with exponential backoff up to a cap (typically 3-5 attempts), don't retry client errors (4xx — they won't fix themselves), and switch token type once on the first 520/525 before retrying further. The SDK helpers implement this loop with sensible defaults; for a custom client, the rule of thumb is:

  • First retry: ~1s after failure
  • Second retry: ~3s after failure
  • Third retry: ~10s after failure
  • After that: log + alert; persistent failures usually mean a target-side change rather than transient networking

All retries against this API are free — only successful responses (pc_status: 200) count against your quota. That makes aggressive backoff cheap; the only real cost of retrying is the latency you add to your pipeline.

Performance & best practices

A few patterns recur across customers running this API at scale. Adopting them up front avoids the most common support-ticket categories.

  • Use the cheapest token that works. Don't default to the JavaScript token "just in case" — Normal token requests are faster and use less concurrency. Promote to JS only when the Normal response is empty or challenge-blocked.
  • Prefer ajax_wait over page_wait. Fixed delays burn concurrency on every request, even fast ones. ajax_wait returns the moment the page goes network-idle — typically faster on average and only slower on truly long-loading pages.
  • Push high volume through async + webhook. Synchronous mode is the right default for ad-hoc and interactive use. For batch jobs over a few hundred URLs, the async mode (or the Enterprise Crawler) keeps your concurrency budget free for new submissions while existing crawls finish.
  • Reuse sessions for stateful flows. If your target requires a logged-in session or cart cookies, hold a session ID and pass it on subsequent requests so the same exit IP and cookie jar are reused. See Authentication for the session-cookie pattern.
  • Watch the remaining header. Backoff before you hit your concurrency cap rather than discovering it through 429s — the response carries the number of slots left, so a healthy client sleeps proactively instead of reacting to errors.