Docs
Log in

How the SDK is shaped

The Node SDK is a thin wrapper around the same HTTP API documented in API Reference. Every Crawling API parameter you'd append as a query string in a raw HTTP call is reachable from the SDK as a key in the options object — names, defaults, and behavior all map one-to-one. There is no parameter the SDK adds; there is no parameter it hides.

What you get for using it instead of fetch / axios directly:

  • URL encoding, parameter validation, and response parsing handled out of the box — code reads like product code, not HTTP plumbing.
  • Both ESM (import { CrawlingAPI } from 'crawlbase') and CommonJS (const { CrawlingAPI } = require('crawlbase')) supported.
  • A single client class per Crawlbase API, all sharing the same constructor / call shape.
  • Sensible defaults (90-second timeout, automatic JSON parsing of format=json responses, UTF-8 decoding) that match what most teams configure by hand on their first integration.

The SDK is open source, MIT-licensed, and accepts community PRs at github.com/crawlbase/crawlbase-node.

Install

Latest version on npm. Works with Node.js 16+ on all major package managers.

npm install crawlbase

# Or via pnpm / yarn / bun
pnpm add crawlbase
yarn add crawlbase
bun add crawlbase

Source on GitHub. Issues + PRs welcome.

Authentication

Every Crawlbase API authenticates with the same token model. Two token types live on a single account:

  • Normal Token (TCP) — for static HTML, JSON endpoints, anything that doesn't need a browser. Faster + cheaper.
  • JavaScript Token — for SPAs, lazy-loaded feeds, anything that hides content behind client-side rendering. Required to use page_wait, ajax_wait, scroll, and css_click_selector.

Use environment variables in production. The SDK doesn't read env vars itself — that's deliberate so you stay in control of where credentials come from — but the idiomatic pattern is:

import { CrawlingAPI } from 'crawlbase';

// Pick the right token at instantiation; the SDK doesn't switch
// tokens per-call, so keep two clients if you alternate.
const api = new CrawlingAPI({ token: process.env.CRAWLBASE_TOKEN });
const js  = new CrawlingAPI({ token: process.env.CRAWLBASE_JS_TOKEN });

await api.get('https://github.com/anthropic');
await js.get('https://feed.example.com', { page_wait: 2000 });

Full token model + dashboard locations on the Authentication page.

Quickstart

Three lines from import to crawled HTML. Both ESM and CommonJS work:

// ESM
import { CrawlingAPI } from 'crawlbase';

const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
const res = await api.get('https://github.com/anthropic');

if (res.statusCode === 200) {
  console.log(res.body);
}

// CommonJS — same shape
// const { CrawlingAPI } = require('crawlbase');

Branch on response.statusCode (the SDK's HTTP status to Crawlbase) and response.headers.pc_status (the Crawlbase verdict — see Errors below) when deciding whether to retry. Pass { format: 'json' } to receive a JSON envelope instead of raw page content.

All APIs in one package

Every Crawlbase API has a matching client class. Same constructor, same get / post verbs.

import {
  CrawlingAPI,    // general-purpose page fetch
  ScraperAPI,     // parsed JSON for supported sites
  LeadsAPI,       // domain-scoped email extraction (legacy)
  ScreenshotsAPI, // screenshots of any URL
} from 'crawlbase';

const token = { token: 'YOUR_TOKEN' };

const crawl   = new CrawlingAPI(token);
const scraper = new ScraperAPI(token);
const leads   = new LeadsAPI(token);
const shots   = new ScreenshotsAPI(token);

// Push high-volume async jobs to the Enterprise Crawler via the
// Crawling API: api.get(url, { async: true, callback: '...',
// crawler: 'YourCrawler' }). See /docs/crawler for the queue
// workflow.

Common patterns

JavaScript rendering

For SPAs, lazy-loaded feeds, and pages where the initial HTML is empty, instantiate with the JavaScript token and pass any combination of page_wait, ajax_wait, scroll, and css_click_selector. Order to think about: a fixed wait, then network-idle, then scroll for lazy-load, then click for any gating UI element.

const api = new CrawlingAPI({ token: 'YOUR_JS_TOKEN' });
const res = await api.get('https://spa.example.com', {
  page_wait: 2000,
  ajax_wait: true,
  scroll: true,
});

Use a built-in scraper

Skip the parser entirely on supported sites. Pass scraper: 'NAME' and the response body becomes a JSON string with the structured fields documented on the per-scraper page.

import { ScraperAPI } from 'crawlbase';

const api = new ScraperAPI({ token: 'YOUR_TOKEN' });
const res = await api.get(
  'https://www.amazon.com/dp/B08N5WRWNW',
  { scraper: 'amazon-product-details' }
);
const data = JSON.parse(res.body);
console.log(data.name, data.price);

Geo-routing

Pass country: 'ISO' to route the crawl through that country's exit nodes. Use it any time the target serves localized content based on IP.

const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });

// Hit the German Amazon catalog from a German residential IP
const res = await api.get(
  'https://www.amazon.com/dp/B08N5WRWNW',
  { country: 'DE' }
);

Retry with backoff

The recommended retry shape: exponential backoff capped at 3-5 attempts, retry on transient errors only (5xx or empty body), don't retry on 4xx.

const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
const sleep = ms => new Promise(r => setTimeout(r, ms));

async function crawl(url, attempts = 5) {
  for (let i = 0; i < attempts; i++) {
    const res = await api.get(url);
    if (res.statusCode === 200 && Number(res.headers.pc_status) === 200) {
      return res;
    }
    if (res.statusCode >= 400 && res.statusCode < 500) {
      throw new Error(`client error ${res.statusCode}: ${url}`);
    }
    await sleep(Math.random() * (2 ** i) * 1000);
  }
  throw new Error(`Failed: ${url}`);
}

Async crawls + webhooks

Fire-and-forget mode. The SDK call resolves immediately with an rid; Crawlbase POSTs the result to your callback URL when the page is ready. Useful for batch jobs and slow targets.

const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
const res = await api.get('https://example.com', {
  async: true,
  callback: 'https://your-app.com/webhook',
});
const rid = res.rid;  // correlate the eventual webhook delivery

// Your Express / Fastify / Hono webhook receives a POST with:
//   { rid, url, original_status, pc_status, body }

For very high volumes (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline with retries, rate management, and result delivery.

Sticky sessions

Some flows need the same residential IP across multiple calls — a checkout, a paginated search, a logged-in session. Pass cookies_session with a stable identifier and Crawlbase reuses the same exit node for ~30 minutes.

const api = new CrawlingAPI({ token: 'YOUR_JS_TOKEN' });

const session = `checkout-${userId}`;
await api.get('https://shop.example.com/cart',     { cookies_session: session });
await api.get('https://shop.example.com/checkout', { cookies_session: session });
await api.get('https://shop.example.com/confirm',  { cookies_session: session });

Errors & retries

The Crawlbase platform surfaces two status codes on every response: the SDK's own response.statusCode (HTTP status of the request to Crawlbase itself) and the pc_status response header (Crawlbase's verdict on the target — see the Crawling API errors table for the full list). The Node SDK exposes response headers as a plain object on response.headers, so the verdict reads as response.headers.pc_status. Always branch on that when deciding whether to retry — a target can return 200 with empty body, in which case response.statusCode is 200 but response.headers.pc_status is 520.

const res = await api.get(url);
const pc = Number(res.headers.pc_status);

if (pc === 200) {
  use(res.body);
} else if (pc === 520 || pc === 525) {
  // 520 = empty body, 525 = anti-bot couldn't be solved.
  // Switch to JS token and retry.
  await retryWithJsToken(url);
} else if ([521, 522, 523].includes(pc)) {
  // Target unreachable or timed out. Retry with backoff.
  schedule_retry(url);
} else {
  logger.error('crawl failed', { url, pc_status: pc });
}

All retries against the platform are free — only successful responses (pc_status: 200) count against your quota. That makes aggressive backoff cheap; the only real cost of retrying is added latency.

Performance & best practices

  • Reuse a single client per token. The constructor is cheap but each instance opens its own underlying connection pool. Build it once at module scope, share it across calls.
  • Use the cheapest token that works. Don't default to the JavaScript token "just in case" — Normal-token requests are faster and use less concurrency. Promote to JS only when the Normal response is empty or anti-bot-blocked.
  • Prefer ajax_wait over page_wait. Fixed delays burn concurrency on every request, even fast ones.
  • For batch jobs: async + webhook, or push to the Enterprise Crawler. Synchronous mode is the right default for ad-hoc and interactive use; for sustained high-volume submission switch to async so your concurrency slot frees up the moment a request is queued.
  • Watch the remaining response header. It carries the number of concurrency slots you have left — a healthy client backs off proactively before hitting the cap.

Method reference

All client classes share the same surface. Constructor takes a single options object; verbs mirror the underlying HTTP methods. Every method returns a Promise.

new CrawlingAPI({ token, timeout })
constructor
Initialize a client with your token. Optional: timeout in milliseconds (default 90000).
.get(url, options?)
method
Send a GET. options maps any Crawling API parameter to its value.
.post(url, data, options?)
method
Send a POST. data is the body — pass an object for form-encoded, a string for raw.

Response shape (object, all properties present even when empty):

response.statusCode
number
HTTP status of the SDK's request to Crawlbase.
response.body
string
Page content (or JSON string when format=json / scraper= was used). UTF-8 decoded by default.
response.url
string
Final URL after target-side redirects.
response.headers
object
Lower-cased response headers. Crawlbase-specific status fields are exposed here:
  • response.headers.pc_status — Crawlbase verdict on the target (branch on this for retry decisions).
  • response.headers.original_status — HTTP status the target site returned to Crawlbase.
  • response.headers.rid — Request ID (when the call carried async: true or store: true).
response.json
object | undefined
Pre-parsed JSON when the response Content-Type is JSON. Parsed once by the SDK so you don't have to.