Docs
Log in

The Crawler is a managed queue you push URLs into and read results out of. Three lifecycle steps — Setup (configure the queue), Push (enqueue URLs), Pull (receive results) — covered in order below.

Setup

Create a named queue in your dashboard. Each crawler holds up to 100K URLs. Create one queue per workload — they don't share state. At creation you pick:

  • A unique name (you choose it: product-monitor, news-feed, etc.)
  • A delivery mode — either a callback URL (Crawlbase POSTs each result to that webhook) or Cloud Storage (results are persisted automatically and you fetch them via the Storage API on your own schedule). Picked once at creation; the two modes are exclusive — the same crawler doesn't do both.
  • A token type (Normal or JavaScript)
  • A concurrency limit (default 20, raisable on request)
No per-request store flag

Storage delivery is a property of the crawler, not the push. If the crawler was created in Storage mode, every result lands in Cloud Storage automatically — you don't need to set store=true on each push, and webhook-mode crawlers can't opt in per request.

Push

Send URLs to the crawler's queue. The push returns immediately with an rid so your client can move on; the actual crawl happens in the background at the crawler's configured concurrency. Pass callback=true to opt the request into queue delivery instead of running it inline.

GEThttps://api.crawlbase.com/?token=…&crawler=NAME&callback=true&url=…
curl 'https://api.crawlbase.com/?token=YOUR_TOKEN&crawler=product-monitor&callback=true&url=https%3A%2F%2Fexample.com%2Fp%2F12345'
from crawlbase import CrawlerAPI

api = CrawlerAPI({'token': 'YOUR_TOKEN'})
res = api.push(
    'https://example.com/p/12345',
    {'crawler': 'product-monitor'}
)
print(res['rid'])
const { CrawlerAPI } = require('crawlbase');
const api = new CrawlerAPI({ token: 'YOUR_TOKEN' });

const res = await api.push(
  'https://example.com/p/12345',
  { crawler: 'product-monitor' }
);
console.log(res.rid);

Push response is small — just confirmation that the URL is queued.

{ "rid": "a1B2c3D4e5F6" }
Push throughput and queue caps

Push rate is capped at 30 URLs/sec per token by default. Each crawler holds up to 100K URLs in its waiting queue, and the combined total across all of your crawlers is capped at 1,000,000; once you cross that, pushes pause and you get an email — drain the queue (or purge) and pushes resume automatically.

Pull

How completed crawls reach you. Two channels, picked once at crawler creation:

Webhook

When the crawler was created with a callback URL, Crawlbase POSTs each result to that webhook the moment the crawl finishes. Body in the request body, metadata in the request headers — no polling, no client-side state.

# POST https://your-app.com/webhook
# Content-Type: text/html  (or application/json if scraper used)
# pc_status: 200
# original_status: 200
# rid: a1B2c3D4e5F6
# url: https://example.com/p/12345

…

Your webhook should:

  • Be publicly reachable from Crawlbase servers.
  • Accept POST and respond with 200, 201, or 204 within 200ms.
  • Be idempotent on rid — duplicate deliveries can happen on retry.
  • Acknowledge before processing — kick the work off async if it takes longer than the response window.

The body shape follows the format parameter you set on push:

format=htmlformat=json
Content-Type: text/plainContent-Type: gzip/json
Body is the HTML of the pageBody is JSON: { pc_status, original_status, rid, url, body }
Headers carry metadata: Original-Status, PC-Status, rid, urlHeaders carry the same metadata, fields are also in the body

Both shapes arrive gzip-compressed (Content-Encoding: gzip) — your handler needs to decompress before parsing. The exception is Zapier webhooks, which can't read gzipped bodies; Crawlbase detects Zapier callback URLs and skips compression.

Failed deliveries are charged

Every retry counts as a successful crawl for billing purposes — Crawlbase already paid the proxy / browser cost. Keep your webhook reliable; the cheapest way to reduce credit burn is to stop dropping deliveries, not to fight the retry policy.

Testing. When you're wiring up the handler for the first time and want to inspect the exact payload shape for a real URL, create a Storage-mode crawler alongside your webhook one and push the same URLs to both. Pull from Cloud Storage by RID and you have a frozen reference to compare your webhook receipts against — useful for catching decompression bugs and metadata-handling mistakes before they hit production traffic.

Monitoring bot. Crawlbase polls your webhook on a schedule to detect outages. If the bot can't reach your endpoint or you stop returning 2xx, the crawler pauses itself automatically and resumes once your endpoint comes back. The probe is a regular POST with a JSON body, distinguishable by its User-Agent:

POST https://your-app.com/webhook
User-Agent: Crawlbase Monitoring Bot 1.0
Content-Type: application/json

{ "monitor": true }

Treat probes as a no-op and return 200. Don't process them as crawl results — there's no real RID to act on.

Protecting the endpoint. A random-string path (yourdomain.com/2340JOiow43djoqe21rjosi) is already most of the protection in practice — the URL is unlikely to be discovered. For belt-and-braces, layer one or more of:

  • A query-string token: ?token=… the webhook checks before accepting the body.
  • A custom header sent via callback_headers on push (e.g. X-Webhook-Token|s3kret) and verified server-side.
  • Reject anything that isn't POST.
  • Reject anything missing the expected metadata headers (Pc-Status, Original-Status, rid).

We don't recommend IP allowlisting — Crawlbase pushes from many IPs and the set rotates without notice.

Cloud Storage

When the crawler was created with Storage as its delivery mode, every result is persisted to Cloud Storage automatically — no per-push flag, no webhook. Your consumer fetches results on its own schedule via the Storage API. Use this when downstream is batched, when you can't run an HTTPS endpoint, or when you want a stable URL for each crawled page.

Push the same way you would for a webhook-mode crawler — the only difference is where the result ends up. Once a URL finishes crawling, fetch by RID:

# Single fetch by RID
curl 'https://api.crawlbase.com/storage?token=YOUR_TOKEN&rid=a1B2c3D4e5F6'

# Or batch-drain up to 100 RIDs at once with auto_delete=true
# so storage stays small.
curl -X POST 'https://api.crawlbase.com/storage/bulk?token=YOUR_TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{ "rids": ["RID1","RID2","RID3"], "auto_delete": true }'

To know when a result is ready, either poll Find Job for a specific RID, watch Stats for the queued / completed counters, or just batch-drain /storage/bulk on a schedule and let "RID not found" tell you what's still pending.

Delivery mode is set at creation

The two modes are exclusive and bound to the crawler when you create it. A webhook crawler doesn't write to Storage; a Storage crawler doesn't fire a webhook. To switch, create a new crawler with the other mode and migrate your push traffic to it.

Management API

A small REST surface for monitoring and managing your crawlers — get stats, purge a queue, pause/unpause, find or delete a single job by RID. All endpoints live under /crawler/<TOKEN>/... and authenticate by token in the path (no query-string token needed).

Token in path, not query string

Unlike the Crawling API, these endpoints expect the token in the URL path. For JavaScript-token crawlers, swap the Normal token for your JS token in every example below.

Stats

Summary across all your crawlers — concurrency, queue depth, completed/failed counts, and a history breakdown.

GEThttps://api.crawlbase.com/crawler/<TOKEN>/stats
# All-time summary
curl 'https://api.crawlbase.com/crawler/YOUR_TOKEN/stats'

# Same, filtered to a date range (YYYY-MM-DD bounds, inclusive)
curl 'https://api.crawlbase.com/crawler/YOUR_TOKEN/stats?history_from=2026-04-01&history_to=2026-04-30'

Purge a crawler

Empties the crawler's queue immediately — every still-pending URL is dropped. Use this to recover from a runaway producer or to clear a batch you no longer want to process. There's no undo.

POSThttps://api.crawlbase.com/crawler/<TOKEN>/<NAME>/purge
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/purge'
Purge is immediate and total

Every queued URL in that crawler is dropped — there's no soft-delete or recovery. If you only need to drop a single URL, use Delete Job instead.

Delete a single job

Drop one URL from the queue by its RID — the request ID returned when you pushed the URL.

POSThttps://api.crawlbase.com/crawler/<TOKEN>/<NAME>/delete_job
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/delete_job?rid=YOUR_RID'

Find a job by RID

Look up where a request stands. Returns QUEUED with the queued metadata if it's still pending, or NOT_QUEUED if it's already crawled (or never made it onto the queue).

GEThttps://api.crawlbase.com/crawler/<TOKEN>/<NAME>/find_by_rid/<RID>
curl 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/find_by_rid/YOUR_RID'
{
  "status": "QUEUED",
  "request_info": {
    "rid": "YOUR_RID",
    "url": "YOUR_URL",
    "retry": 3,
    "created_at": 1600494969.189415
  }
}
{
  "status": "NOT_QUEUED",
  "request_info": {
    "rid": "YOUR_RID"
  }
}

Pause and unpause

Stop a crawler from picking up new work without losing its queue. Pushed URLs continue to enqueue, but the crawler stops processing them until you unpause. Useful for maintenance windows or backing off when a downstream system is unhealthy.

# Pause — stops the crawler picking up new work
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/pause'

# Unpause — resumes processing
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/unpause'

Parameters

crawler
stringrequired
Name of the crawler from your dashboard.
url
stringrequired (push)
URL to enqueue. URL-encode it.
callback_headers
stringoptional
Extra headers to include on the webhook delivery, format name|value|name|value. Useful for passing IDs back to your handler. Webhook-mode crawlers only — ignored when the crawler delivers to Storage.
queue_timeout
integer (minutes)optional
Maximum time the request may sit in the queue before a worker picks it up. Range 110080 (1 minute to 7 days). Once a worker starts the crawl, this timer no longer applies. If the request expires waiting, you get a callback with HTTP 504 and pc_status=699. Omit (or set to 0) to disable. Aggressive values raise failure rate — pick what reflects how long the result is actually useful to you.
All Crawling API params
optional
Pass page_wait, scroll, country, scraper, etc. — they're applied to each crawl.

When to use Crawler vs the API

  • Crawler: any time you have more than a few hundred URLs to process, especially across long time horizons. The queue handles retries, scheduling, and concurrency.
  • Direct Crawling API: when you need the result inline — page rendering for a user-facing request, AI agent fetching context, etc.