Error Handling · Crawlbase Documentation

Three classes of error

Every Crawlbase error falls into one of three buckets, and each needs a different response.

Transient

retry

Network blip, brief upstream outage, rate limit. 429, 500, 503, 522, 599. Always retry with backoff.

Site-side

handle

The target site returned a real error: 404, 410, 451. Don't retry - the page genuinely doesn't exist or isn't accessible. Mark the URL as failed and move on.

Configuration

fix code

Your fault. 401, 402, 403, 422. Retrying won't help - fix the request, the token, or the account.

Production retry pattern

The pattern that holds up under load: exponential backoff with full jitter, capped attempts, and a dead-letter destination for terminal failures.

import time, random, logging
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_TOKEN'})
log = logging.getLogger('crawler')

TRANSIENT = {429, 500, 503, 522, 599}
TERMINAL  = {401, 402, 403, 404, 410, 422, 451}

def crawl(url, max_attempts=5, base=0.5, cap=30):
    for attempt in range(max_attempts):
        res = api.get(url)
        status = res['status_code']

        if status == 200 and res['pc_status'] == 200:
            return res

        if status in TERMINAL or res['pc_status'] in TERMINAL:
            log.warning(f'Terminal error {status}/{res['pc_status']} for {url}')
            raise PermanentFailure(url, status)

        # Transient - sleep with full jitter, then retry
        wait = min(cap, base * (2 ** attempt))
        wait = random.uniform(0, wait)
        log.info(f'Attempt {attempt+1} got {status}; sleeping {wait:.2f}s')
        time.sleep(wait)

    raise RuntimeError(f'Exhausted retries for {url}')

class PermanentFailure(Exception): pass
const { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: process.env.CRAWLBASE_TOKEN });

const TRANSIENT = new Set([429, 500, 503, 522, 599]);
const TERMINAL  = new Set([401, 402, 403, 404, 410, 422, 451]);

async function crawl(url, { maxAttempts = 5, base = 500, cap = 30000 } = {}) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const res = await api.get(url);
    const status = res.statusCode;

    if (status === 200 && res.pcStatus === 200) return res;

    if (TERMINAL.has(status) || TERMINAL.has(res.pcStatus)) {
      throw new Error(`Permanent failure \${status} for \${url}`);
    }

    const wait = Math.random() * Math.min(cap, base * 2 ** attempt);
    await new Promise(r => setTimeout(r, wait));
  }
  throw new Error(`Exhausted retries for \${url}`);
}

Dead-letter queue

When retries exhaust, don't drop the URL silently. Push it somewhere a human can review.

For Crawler API users: failures are automatically retried up to your configured count, then delivered to your webhook with the failure metadata. No DLQ to build.
For direct API users: on terminal failure, write the URL + status + last response body to a separate queue or table. Review weekly.

Don't retry forever

Cap retries at 5 or so. A URL that fails 5 times in a row almost certainly will fail 50 times. Save the cycles for new work.

What to monitor

The four signals every Crawlbase-using system should chart:

Signal	Where it comes from	Alert when
Success rate	`pc_status == 200` / total	< 95% sustained for 10 min
P95 latency	request duration	> 15s sustained
429 rate	HTTP status histogram	> 5% sustained - bump concurrency
Retry count distribution	your retry loop	P95 > 2 - something's degrading upstream

Tag every metric with the target domain so you can spot when a single site is poisoning your overall numbers.

Making retries safe

Crawlbase requests are inherently idempotent - a GET on the same URL with the same token returns the same kind of result every time. You can retry freely without worrying about duplicate side effects.

Two notes:

Async + store: if you used &async=true&store=true, each retry consumes a credit and creates a new rid. Dedupe on your end if needed.
Webhooks: Crawler API webhooks may be delivered more than once on failure. Make your webhook handler idempotent on rid.