Error Handling
Crawling at scale means errors happen. Build for them up front and you'll spend your time shipping features, not babysitting retries.
Three classes of error
Every Crawlbase error falls into one of three buckets, and each needs a different response.
429, 500, 503, 522, 599. Always retry with backoff.404, 410, 451. Don't retry — the page genuinely doesn't exist or isn't accessible. Mark the URL as failed and move on.401, 402, 403, 422. Retrying won't help — fix the request, the token, or the account.Production retry pattern
The pattern that holds up under load: exponential backoff with full jitter, capped attempts, and a dead-letter destination for terminal failures.
import time, random, logging
from crawlbase import CrawlingAPI
api = CrawlingAPI({'token': 'YOUR_TOKEN'})
log = logging.getLogger('crawler')
TRANSIENT = {429, 500, 503, 522, 599}
TERMINAL = {401, 402, 403, 404, 410, 422, 451}
def crawl(url, max_attempts=5, base=0.5, cap=30):
for attempt in range(max_attempts):
res = api.get(url)
status = res['status_code']
if status == 200 and res['pc_status'] == 200:
return res
if status in TERMINAL or res['pc_status'] in TERMINAL:
log.warning(f'Terminal error {status}/{res['pc_status']} for {url}')
raise PermanentFailure(url, status)
# Transient — sleep with full jitter, then retry
wait = min(cap, base * (2 ** attempt))
wait = random.uniform(0, wait)
log.info(f'Attempt {attempt+1} got {status}; sleeping {wait:.2f}s')
time.sleep(wait)
raise RuntimeError(f'Exhausted retries for {url}')
class PermanentFailure(Exception): passconst { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: process.env.CRAWLBASE_TOKEN });
const TRANSIENT = new Set([429, 500, 503, 522, 599]);
const TERMINAL = new Set([401, 402, 403, 404, 410, 422, 451]);
async function crawl(url, { maxAttempts = 5, base = 500, cap = 30000 } = {}) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const res = await api.get(url);
const status = res.statusCode;
if (status === 200 && res.pcStatus === 200) return res;
if (TERMINAL.has(status) || TERMINAL.has(res.pcStatus)) {
throw new Error(`Permanent failure \${status} for \${url}`);
}
const wait = Math.random() * Math.min(cap, base * 2 ** attempt);
await new Promise(r => setTimeout(r, wait));
}
throw new Error(`Exhausted retries for \${url}`);
}Dead-letter queue
When retries exhaust, don't drop the URL silently. Push it somewhere a human can review.
- For Crawler API users: failures are automatically retried up to your configured count, then delivered to your webhook with the failure metadata. No DLQ to build.
- For direct API users: on terminal failure, write the URL + status + last response body to a separate queue or table. Review weekly.
Cap retries at 5 or so. A URL that fails 5 times in a row almost certainly will fail 50 times. Save the cycles for new work.
What to monitor
The four signals every Crawlbase-using system should chart:
| Signal | Where it comes from | Alert when |
|---|---|---|
| Success rate | pc_status == 200 / total | < 95% sustained for 10 min |
| P95 latency | request duration | > 15s sustained |
| 429 rate | HTTP status histogram | > 5% sustained — bump concurrency |
| Retry count distribution | your retry loop | P95 > 2 — something's degrading upstream |
Tag every metric with the target domain so you can spot when a single site is poisoning your overall numbers.
Making retries safe
Crawlbase requests are inherently idempotent — a GET on the same URL with the same token returns the same kind of result every time. You can retry freely without worrying about duplicate side effects.
Two notes:
- Async + store: if you used
&async=true&store=true, each retry consumes a credit and creates a newrid. Dedupe on your end if needed. - Webhooks: Crawler API webhooks may be delivered more than once on failure. Make your webhook handler idempotent on
rid.

