Docs
Log in

How the SDK is shaped

The Python SDK is a thin, dependency-light wrapper around the same HTTP API documented in API Reference. Every Crawling API parameter you'd append as a query string in a raw HTTP call is reachable from the SDK as a keyword in the options dict — names, defaults, and behavior all map one-to-one. There is no parameter the SDK adds; there is no parameter the SDK hides.

What you get for using it instead of requests directly:

  • URL encoding, parameter validation, and response parsing handled out of the box — your application code reads like product code, not HTTP plumbing.
  • A single client class per Crawlbase API, all sharing the same constructor / call shape so once you've used one, you've used all of them.
  • Sensible defaults (90-second timeout, JSON parsing of format=json responses, automatic UTF-8 decoding) that match what most teams configure by hand on their first integration.
  • A small surface area to learn — five client classes, two verbs (get / post), one response shape.

The SDK is open source, MIT-licensed, and accepts community PRs at github.com/crawlbase/crawlbase-python. Most reported issues land in a release within a sprint.

Install

Latest version on PyPI. Requires Python 3.7+; tested through Python 3.13.

pip install crawlbase

# Or via Poetry / uv / pip-tools
poetry add crawlbase
uv add crawlbase

Source on GitHub. Issues + PRs welcome.

Authentication

Every Crawlbase API authenticates with the same token model — there's no separate API key per product. Two token types live on a single account:

  • Normal Token (TCP) — for static HTML, JSON endpoints, anything that doesn't need a browser. Faster + cheaper.
  • JavaScript Token — for SPAs, lazy-loaded feeds, and any target that hides content behind client-side rendering. Required to use page_wait, ajax_wait, scroll, and css_click_selector.

Use environment variables in production rather than hard-coding tokens. The SDK doesn't read env vars itself — that's a deliberate choice so you stay in control of where credentials come from — but the idiomatic pattern is:

import os
from crawlbase import CrawlingAPI

# Pick the right token at instantiation; the SDK doesn't switch
# tokens per-call, so keep two clients if you alternate.
api = CrawlingAPI({'token': os.environ['CRAWLBASE_TOKEN']})
js  = CrawlingAPI({'token': os.environ['CRAWLBASE_JS_TOKEN']})

res = api.get('https://github.com/anthropic')
res = js.get('https://feed.example.com', {'page_wait': 2000})

Full token model + dashboard locations on the Authentication page.

Quickstart

Three lines from import to crawled HTML:

from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_TOKEN'})
res = api.get('https://github.com/anthropic')

if res['status_code'] == 200:
    print(res['body'])

Branch on status_code (the HTTP status of the SDK's request to Crawlbase) and pc_status (the Crawlbase verdict — see Errors below) when deciding whether to retry. The body is bytes by default; pass 'format': 'json' to receive a JSON envelope instead of raw page content.

All APIs in one package

Every Crawlbase API has a matching client class. Same constructor, same get / post verbs. Pick the class by what you're doing; behind the scenes they all hit a different endpoint of the same platform.

from crawlbase import (
    CrawlingAPI,    # general-purpose page fetch (HTML / JSON / etc.)
    ScraperAPI,     # parsed JSON for supported sites (Amazon, Google, etc.)
    LeadsAPI,       # domain-scoped email extraction (legacy)
    ScreenshotsAPI, # screenshots of any URL
    StorageAPI,     # Cloud Storage CRUD
)

token = {'token': 'YOUR_TOKEN'}

crawl   = CrawlingAPI(token)
scraper = ScraperAPI(token)
leads   = LeadsAPI(token)
shots   = ScreenshotsAPI(token)
storage = StorageAPI(token)

# Push high-volume async jobs to the Enterprise Crawler via the
# Crawling API: api.get(url, {'async': True, 'callback': '...',
# 'crawler': 'YourCrawler'}). See /docs/crawler for the queue
# workflow.

Common patterns

JavaScript rendering

For SPAs, lazy-loaded feeds, and pages where the initial HTML is empty, instantiate with the JavaScript token and pass any combination of page_wait, ajax_wait, scroll, and css_click_selector. Order to think about: a fixed wait, then network-idle, then scroll for lazy-load, then click for any gating UI element.

api = CrawlingAPI({'token': 'YOUR_JS_TOKEN'})
res = api.get('https://spa.example.com', {
    'page_wait': 2000,
    'ajax_wait': True,
    'scroll': True,
})

Use a built-in scraper

Skip the parser entirely on supported sites. Pass 'scraper': 'NAME' and the response body becomes a JSON string with the structured fields documented on the per-scraper page.

import json
from crawlbase import ScraperAPI

api = ScraperAPI({'token': 'YOUR_TOKEN'})
res = api.get(
    'https://www.amazon.com/dp/B08N5WRWNW',
    {'scraper': 'amazon-product-details'}
)
data = json.loads(res['body'])
print(data['name'], data['price'])

Geo-routing

Pass 'country'='ISO' to route the crawl through that country's exit nodes. Use it any time the target serves localized content based on IP — most retailers, all SERPs, geo-restricted streaming pages.

api = CrawlingAPI({'token': 'YOUR_TOKEN'})

# Hit the German Amazon catalog from a German residential IP
res = api.get(
    'https://www.amazon.com/dp/B08N5WRWNW',
    {'country': 'DE'}
)

Async with retries

The recommended retry shape: exponential backoff capped at 3-5 attempts, retry on transient errors only (5xx or empty body), don't retry on 4xx (the request shape is wrong and won't fix itself).

import time, random
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_TOKEN'})

def crawl(url, attempts=5):
    for i in range(attempts):
        res = api.get(url)
        # 200 from Crawlbase + non-empty body from the target
        if res['status_code'] == 200 and int(res.get('pc_status', 0)) == 200:
            return res
        # Don't bother retrying client errors (4xx)
        if 400 <= res['status_code'] < 500:
            raise ValueError(f"client error {res['status_code']}: {url}")
        # Exponential backoff with jitter
        time.sleep(random.uniform(0, 2 ** i))
    raise RuntimeError(f'Failed: {url}')

Async crawls + webhooks

Fire-and-forget mode. The SDK call returns immediately with an rid; Crawlbase POSTs the result to your callback URL when the page is ready. Useful for batch jobs and slow targets where you don't want a synchronous request to occupy a concurrency slot for 30+ seconds.

api = CrawlingAPI({'token': 'YOUR_TOKEN'})
res = api.get('https://example.com', {
    'async': True,
    'callback': 'https://your-app.com/webhook',
})
rid = res['rid']  # use this to correlate the eventual webhook delivery

# Webhook handler (Flask / FastAPI / etc.) receives a POST with:
#   { rid, url, original_status, pc_status, body }

For very high volumes (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline with retries, rate management, and result delivery.

Sticky sessions

Some flows need the same residential IP across multiple calls — a checkout, a paginated search, a logged-in session. Pass 'cookies_session' with a stable identifier and Crawlbase reuses the same exit node for ~30 minutes.

api = CrawlingAPI({'token': 'YOUR_JS_TOKEN'})

session = f'checkout-{user_id}'
api.get('https://shop.example.com/cart',     {'cookies_session': session})
api.get('https://shop.example.com/checkout', {'cookies_session': session})
api.get('https://shop.example.com/confirm',  {'cookies_session': session})

Errors & retries

The Crawlbase platform surfaces two status codes on every response: the SDK's own status_code (the HTTP status of the request to Crawlbase itself) and pc_status (Crawlbase's verdict on the target — see the Crawling API errors table for the full list). Always branch on pc_status when deciding whether to retry — a target can return 200 with empty body, in which case status_code is 200 but pc_status is 520.

res = api.get(url)
pc = int(res.get('pc_status', 0))

if pc == 200:
    use(res['body'])
elif pc in (520, 525):
    # 520 = empty body, 525 = anti-bot couldn't be solved.
    # Switch to JS token and retry.
    retry_with_js_token(url)
elif pc in (521, 522, 523):
    # Target unreachable or timed out. Retry with backoff.
    schedule_retry(url)
else:
    log.error('crawl failed', extra={'url': url, 'pc_status': pc})

All retries against the platform are free — only successful responses (pc_status: 200) count against your quota. That makes aggressive backoff cheap; the only real cost of retrying is added latency.

Performance & best practices

  • Reuse a single client per token. The constructor is cheap but each instance opens its own connection pool. Build it once at module scope, share it across calls.
  • Use the cheapest token that works. Don't default to the JavaScript token "just in case" — Normal-token requests are faster and use less concurrency. Promote to JS only when the Normal response is empty or anti-bot-blocked.
  • Prefer ajax_wait over page_wait. Fixed delays burn concurrency on every request, even fast ones. ajax_wait returns the moment the page goes network-idle.
  • For batch jobs: async + webhook, or push to the Enterprise Crawler. Synchronous mode is the right default for ad-hoc and interactive use; for sustained high-volume submission switch to async so your concurrency slot frees up the moment a request is queued rather than when it completes.
  • Watch the remaining response header. It carries the number of concurrency slots you have left — a healthy client backs off proactively before hitting the cap rather than reacting to 429s.

Method reference

All client classes share the same surface. Constructor takes a single options dict; verbs mirror the underlying HTTP methods.

CrawlingAPI({'token': T, 'timeout': N})
constructor
Initialize a client with your token. Optional: 'timeout' in seconds (default 90) — applies to the SDK's HTTP call to Crawlbase, not the upstream crawl.
.get(url, options=None)
method
Send a GET. options is a dict mapping any Crawling API parameter to its value. Returns a response dict.
.post(url, data, options=None)
method
Send a POST. data is the body — pass a dict for form-encoded, a string for raw. options works the same as .get.

Response shape (dict, all keys present even when their value is empty):

status_code
int
HTTP status of the SDK's request to Crawlbase. 200 means the request was accepted; check pc_status for the target outcome.
pc_status
int
Crawlbase verdict on the target. Branch on this for retry decisions.
original_status
int
HTTP status the target site returned to Crawlbase.
url
str
Final URL after target-side redirects.
body
bytes | str
Page content (or JSON string when format=json / scraper= was used).
headers
dict
Response headers from the target site.
rid
str
Request ID (when async=true or store=true).