Enterprise Crawler
Push URLs into a managed queue, let Crawlbase run them at high concurrency, get results delivered to your webhook or persisted to Cloud Storage for you to pull. No client-side scheduling, no retry logic, no concurrency math.
The Crawler is a managed queue you push URLs into and read results out of. Three lifecycle steps — Setup (configure the queue), Push (enqueue URLs), Pull (receive results) — covered in order below.
Setup
Create a named queue in your dashboard. Each crawler holds up to 100K URLs. Create one queue per workload — they don't share state. At creation you pick:
- A unique name (you choose it:
product-monitor,news-feed, etc.) - A delivery mode — either a callback URL (Crawlbase POSTs each result to that webhook) or Cloud Storage (results are persisted automatically and you fetch them via the Storage API on your own schedule). Picked once at creation; the two modes are exclusive — the same crawler doesn't do both.
- A token type (Normal or JavaScript)
- A concurrency limit (default 20, raisable on request)
Storage delivery is a property of the crawler, not the push. If the crawler was created in Storage mode, every result lands in Cloud Storage automatically — you don't need to set store=true on each push, and webhook-mode crawlers can't opt in per request.
Push
Send URLs to the crawler's queue. The push returns immediately with an rid so your client can move on; the actual crawl happens in the background at the crawler's configured concurrency. Pass callback=true to opt the request into queue delivery instead of running it inline.
curl 'https://api.crawlbase.com/?token=YOUR_TOKEN&crawler=product-monitor&callback=true&url=https%3A%2F%2Fexample.com%2Fp%2F12345'from crawlbase import CrawlerAPI
api = CrawlerAPI({'token': 'YOUR_TOKEN'})
res = api.push(
'https://example.com/p/12345',
{'crawler': 'product-monitor'}
)
print(res['rid'])const { CrawlerAPI } = require('crawlbase');
const api = new CrawlerAPI({ token: 'YOUR_TOKEN' });
const res = await api.push(
'https://example.com/p/12345',
{ crawler: 'product-monitor' }
);
console.log(res.rid);Push response is small — just confirmation that the URL is queued.
{ "rid": "a1B2c3D4e5F6" }Push rate is capped at 30 URLs/sec per token by default. Each crawler holds up to 100K URLs in its waiting queue, and the combined total across all of your crawlers is capped at 1,000,000; once you cross that, pushes pause and you get an email — drain the queue (or purge) and pushes resume automatically.
Pull
How completed crawls reach you. Two channels, picked once at crawler creation:
Webhook
When the crawler was created with a callback URL, Crawlbase POSTs each result to that webhook the moment the crawl finishes. Body in the request body, metadata in the request headers — no polling, no client-side state.
# POST https://your-app.com/webhook
# Content-Type: text/html (or application/json if scraper used)
# pc_status: 200
# original_status: 200
# rid: a1B2c3D4e5F6
# url: https://example.com/p/12345
…Your webhook should:
- Be publicly reachable from Crawlbase servers.
- Accept
POSTand respond with200,201, or204within 200ms. - Be idempotent on
rid— duplicate deliveries can happen on retry. - Acknowledge before processing — kick the work off async if it takes longer than the response window.
The body shape follows the format parameter you set on push:
format=html | format=json |
|---|---|
Content-Type: text/plain | Content-Type: gzip/json |
| Body is the HTML of the page | Body is JSON: { pc_status, original_status, rid, url, body } |
Headers carry metadata: Original-Status, PC-Status, rid, url | Headers carry the same metadata, fields are also in the body |
Both shapes arrive gzip-compressed (Content-Encoding: gzip) — your handler needs to decompress before parsing. The exception is Zapier webhooks, which can't read gzipped bodies; Crawlbase detects Zapier callback URLs and skips compression.
Every retry counts as a successful crawl for billing purposes — Crawlbase already paid the proxy / browser cost. Keep your webhook reliable; the cheapest way to reduce credit burn is to stop dropping deliveries, not to fight the retry policy.
Testing. When you're wiring up the handler for the first time and want to inspect the exact payload shape for a real URL, create a Storage-mode crawler alongside your webhook one and push the same URLs to both. Pull from Cloud Storage by RID and you have a frozen reference to compare your webhook receipts against — useful for catching decompression bugs and metadata-handling mistakes before they hit production traffic.
Monitoring bot. Crawlbase polls your webhook on a schedule to detect outages. If the bot can't reach your endpoint or you stop returning 2xx, the crawler pauses itself automatically and resumes once your endpoint comes back. The probe is a regular POST with a JSON body, distinguishable by its User-Agent:
POST https://your-app.com/webhook
User-Agent: Crawlbase Monitoring Bot 1.0
Content-Type: application/json
{ "monitor": true }Treat probes as a no-op and return 200. Don't process them as crawl results — there's no real RID to act on.
Protecting the endpoint. A random-string path (yourdomain.com/2340JOiow43djoqe21rjosi) is already most of the protection in practice — the URL is unlikely to be discovered. For belt-and-braces, layer one or more of:
- A query-string token:
?token=…the webhook checks before accepting the body. - A custom header sent via
callback_headerson push (e.g.X-Webhook-Token|s3kret) and verified server-side. - Reject anything that isn't
POST. - Reject anything missing the expected metadata headers (
Pc-Status,Original-Status,rid).
We don't recommend IP allowlisting — Crawlbase pushes from many IPs and the set rotates without notice.
Cloud Storage
When the crawler was created with Storage as its delivery mode, every result is persisted to Cloud Storage automatically — no per-push flag, no webhook. Your consumer fetches results on its own schedule via the Storage API. Use this when downstream is batched, when you can't run an HTTPS endpoint, or when you want a stable URL for each crawled page.
Push the same way you would for a webhook-mode crawler — the only difference is where the result ends up. Once a URL finishes crawling, fetch by RID:
# Single fetch by RID
curl 'https://api.crawlbase.com/storage?token=YOUR_TOKEN&rid=a1B2c3D4e5F6'
# Or batch-drain up to 100 RIDs at once with auto_delete=true
# so storage stays small.
curl -X POST 'https://api.crawlbase.com/storage/bulk?token=YOUR_TOKEN' \
-H 'Content-Type: application/json' \
-d '{ "rids": ["RID1","RID2","RID3"], "auto_delete": true }'To know when a result is ready, either poll Find Job for a specific RID, watch Stats for the queued / completed counters, or just batch-drain /storage/bulk on a schedule and let "RID not found" tell you what's still pending.
The two modes are exclusive and bound to the crawler when you create it. A webhook crawler doesn't write to Storage; a Storage crawler doesn't fire a webhook. To switch, create a new crawler with the other mode and migrate your push traffic to it.
Management API
A small REST surface for monitoring and managing your crawlers — get stats, purge a queue, pause/unpause, find or delete a single job by RID. All endpoints live under /crawler/<TOKEN>/... and authenticate by token in the path (no query-string token needed).
Unlike the Crawling API, these endpoints expect the token in the URL path. For JavaScript-token crawlers, swap the Normal token for your JS token in every example below.
Stats
Summary across all your crawlers — concurrency, queue depth, completed/failed counts, and a history breakdown.
# All-time summary
curl 'https://api.crawlbase.com/crawler/YOUR_TOKEN/stats'
# Same, filtered to a date range (YYYY-MM-DD bounds, inclusive)
curl 'https://api.crawlbase.com/crawler/YOUR_TOKEN/stats?history_from=2026-04-01&history_to=2026-04-30'Purge a crawler
Empties the crawler's queue immediately — every still-pending URL is dropped. Use this to recover from a runaway producer or to clear a batch you no longer want to process. There's no undo.
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/purge'Every queued URL in that crawler is dropped — there's no soft-delete or recovery. If you only need to drop a single URL, use Delete Job instead.
Delete a single job
Drop one URL from the queue by its RID — the request ID returned when you pushed the URL.
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/delete_job?rid=YOUR_RID'Find a job by RID
Look up where a request stands. Returns QUEUED with the queued metadata if it's still pending, or NOT_QUEUED if it's already crawled (or never made it onto the queue).
curl 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/find_by_rid/YOUR_RID'{
"status": "QUEUED",
"request_info": {
"rid": "YOUR_RID",
"url": "YOUR_URL",
"retry": 3,
"created_at": 1600494969.189415
}
}{
"status": "NOT_QUEUED",
"request_info": {
"rid": "YOUR_RID"
}
}Pause and unpause
Stop a crawler from picking up new work without losing its queue. Pushed URLs continue to enqueue, but the crawler stops processing them until you unpause. Useful for maintenance windows or backing off when a downstream system is unhealthy.
# Pause — stops the crawler picking up new work
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/pause'
# Unpause — resumes processing
curl -X POST 'https://api.crawlbase.com/crawler/YOUR_TOKEN/product-monitor/unpause'Parameters
name|value|name|value. Useful for passing IDs back to your handler. Webhook-mode crawlers only — ignored when the crawler delivers to Storage.1 – 10080 (1 minute to 7 days). Once a worker starts the crawl, this timer no longer applies. If the request expires waiting, you get a callback with HTTP 504 and pc_status=699. Omit (or set to 0) to disable. Aggressive values raise failure rate — pick what reflects how long the result is actually useful to you.page_wait, scroll, country, scraper, etc. — they're applied to each crawl.When to use Crawler vs the API
- Crawler: any time you have more than a few hundred URLs to process, especially across long time horizons. The queue handles retries, scheduling, and concurrency.
- Direct Crawling API: when you need the result inline — page rendering for a user-facing request, AI agent fetching context, etc.

