PHP SDK · Crawlbase Documentation

How the SDK is shaped

The PHP SDK is a thin wrapper around the same HTTP API documented in API Reference. Every Crawling API parameter you'd append as a query string in a raw HTTP call is reachable from the SDK as a key in the options array - names, defaults, and behavior all map one-to-one. There is no parameter the SDK adds; there is no parameter it hides.

What you get for using it instead of cURL or Guzzle directly:

URL encoding, parameter validation, and response parsing handled out of the box.
PSR-4 autoloading - drop into any modern PHP framework (Laravel, Symfony, Slim) without ceremony.
A single client class per Crawlbase API, all sharing the same constructor / call shape.
Sensible defaults (90-second timeout, automatic JSON parsing of format=json responses, UTF-8-encoded bodies).

Source on github.com/crawlbase/crawlbase-php. Issues + PRs welcome.

Install

Latest version on Packagist. Requires PHP 7.4+; tested through PHP 8.3.

composer require crawlbase/crawlbase

# Or add to composer.json directly:
# "crawlbase/crawlbase": "^1.0"

Authentication

Every Crawlbase API authenticates with the same token model. Two token types live on a single account:

Normal Token (TCP) - for static HTML, JSON endpoints, anything that doesn't need a browser. Faster + cheaper.
JavaScript Token - for SPAs, lazy-loaded feeds, anything that hides content behind client-side rendering. Required to use page_wait, ajax_wait, scroll, and css_click_selector.

Use environment variables (or your framework's config - Laravel config(), Symfony parameters) in production. The SDK doesn't read env vars itself - that's deliberate so you stay in control of where credentials come from. Pattern:

<?php
require 'vendor/autoload.php';

use Crawlbase\CrawlingAPI;

// Pick the right token at instantiation; the SDK doesn't switch
// tokens per-call, so keep two clients if you alternate.
$api = new CrawlingAPI(['token' => getenv('CRAWLBASE_TOKEN')]);
$js = new CrawlingAPI(['token' => getenv('CRAWLBASE_JS_TOKEN')]);

$api->get('https://github.com/anthropic');
$js->get('https://feed.example.com', ['page_wait' => 2000]);

Full token model + dashboard locations on the Authentication page.

Quickstart

Three lines from autoload to crawled HTML:

<?php
require 'vendor/autoload.php';

$api = new \Crawlbase\CrawlingAPI(['token' => 'YOUR_TOKEN']);
$res = $api->get('https://github.com/anthropic');

if ($res->statusCode == 200) {
 echo $res->body;
}

Branch on ->statusCode (the SDK's HTTP status to Crawlbase) and ->headers->pc_status (the Crawlbase verdict - see Errors below) when deciding whether to retry. Pass ['format' => 'json'] to receive a JSON envelope instead of raw page content.

All APIs in one package

Every Crawlbase API has a matching client class. Same constructor, same get / post verbs.

<?php
use Crawlbase\{CrawlingAPI, ScraperAPI, LeadsAPI, ScreenshotsAPI, StorageAPI};

$token = ['token' => 'YOUR_TOKEN'];

$crawl = new CrawlingAPI($token); // general-purpose page fetch
$scraper = new ScraperAPI($token); // parsed JSON for supported sites
$leads = new LeadsAPI($token); // domain-scoped email extraction (legacy)
$shots = new ScreenshotsAPI($token); // screenshots of any URL
$storage = new StorageAPI($token); // Cloud Storage CRUD

// Push high-volume async jobs to the Enterprise Crawler via the Crawling API:
// $api->get($url, ['async' => true, 'callback' => '...', 'crawler' => 'YourCrawler']).
// See /docs/crawler for the queue workflow.

Common patterns

JavaScript rendering

For SPAs, lazy-loaded feeds, and pages where the initial HTML is empty, instantiate with the JavaScript token and pass any combination of page_wait, ajax_wait, scroll, and css_click_selector. Order to think about: a fixed wait, then network-idle, then scroll for lazy-load, then click for any gating UI element.

$api = new \Crawlbase\CrawlingAPI(['token' => 'YOUR_JS_TOKEN']);
$res = $api->get('https://spa.example.com', [
 'page_wait' => 2000,
 'ajax_wait' => true,
 'scroll' => true,
]);

Use a built-in scraper

Skip the parser entirely on supported sites. Pass 'scraper' => 'NAME' and the response body becomes a JSON string with the structured fields documented on the per-scraper page.

<?php
use Crawlbase\ScraperAPI;

$api = new ScraperAPI(['token' => 'YOUR_TOKEN']);
$res = $api->get('https://www.amazon.com/dp/1098145356',
 ['scraper' => 'amazon-product-details']);
$data = json_decode($res->body, true);
echo $data['name'] . ' - ' . $data['price'];

Geo-routing

Pass 'country' => 'ISO' to route the crawl through that country's exit nodes. Use it any time the target serves localized content based on IP.

$api = new \Crawlbase\CrawlingAPI(['token' => 'YOUR_TOKEN']);

// Hit the German Amazon catalog from a German residential IP
$res = $api->get('https://www.amazon.com/dp/1098145356', ['country' => 'DE']);

Retry with backoff

The recommended retry shape: exponential backoff capped at 3-5 attempts, retry on transient errors only (5xx or empty body), don't retry on 4xx.

<?php
use Crawlbase\CrawlingAPI;

function crawl(CrawlingAPI $api, string $url, int $attempts = 5) {
 for ($i = 0; $i < $attempts; $i++) {
 $res = $api->get($url);
 if ($res->statusCode === 200 && (int) $res->headers->pc_status === 200) {
 return $res;
 }
 if ($res->statusCode >= 400 && $res->statusCode < 500) {
 throw new RuntimeException("client error {$res->statusCode}: $url");
 }
 usleep((int) (mt_rand() / mt_getrandmax() * pow(2, $i) * 1_000_000));
 }
 throw new RuntimeException("Failed: $url");
}

Async crawls + webhooks

Fire-and-forget mode. The SDK call returns immediately with an rid; Crawlbase POSTs the result to your callback URL when the page is ready. Useful for batch jobs and slow targets.

$api = new \Crawlbase\CrawlingAPI(['token' => 'YOUR_TOKEN']);
$res = $api->get('https://example.com', [
 'async' => true,
 'callback' => 'https://your-app.com/webhook',
]);
$rid = $res->rid; // correlate the eventual webhook delivery

// Your Laravel / Symfony / Slim webhook receives a POST with:
// { rid, url, original_status, pc_status, body }

For very high volumes (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline.

Sticky sessions

Some flows need the same residential IP across multiple calls. Pass cookies_session with a stable identifier and Crawlbase reuses the same exit node for ~30 minutes.

$api = new \Crawlbase\CrawlingAPI(['token' => 'YOUR_JS_TOKEN']);

$session = "checkout-{$userId}";
$api->get('https://shop.example.com/cart', ['cookies_session' => $session]);
$api->get('https://shop.example.com/checkout', ['cookies_session' => $session]);
$api->get('https://shop.example.com/confirm', ['cookies_session' => $session]);

Errors & retries

The platform surfaces two status codes on every response: the SDK's own ->statusCode (HTTP status of the request to Crawlbase itself) and ->headers->pc_status (Crawlbase's verdict on the target - see the Crawling API errors table for the full list). Always branch on ->headers->pc_status when deciding whether to retry - a target can return 200 with empty body, in which case ->statusCode is 200 but ->headers->pc_status is 520.

$res = $api->get($url);
$pc = (int) $res->headers->pc_status;

switch (true) {
 case $pc === 200:
 use_body($res->body);
 break;
 case in_array($pc, [520, 525], true):
 // 520 = empty body, 525 = anti-bot couldn't be solved.
 // Switch to JS token and retry.
 retry_with_js_token($url);
 break;
 case in_array($pc, [521, 522, 523], true):
 // Target unreachable or timed out. Retry with backoff.
 schedule_retry($url);
 break;
 default:
 $logger->error('crawl failed', ['url' => $url, 'pc_status' => $pc]);
}

All retries against the platform are free - only successful responses (pc_status: 200) count against your quota.

Performance & best practices

Reuse a single client per token. Build it once at app boot (Laravel service provider, Symfony service container) and inject everywhere - each instance opens its own connection.
Use the cheapest token that works. Don't default to the JavaScript token "just in case" - Normal-token requests are faster and use less concurrency. Promote to JS only when the Normal response is empty or anti-bot-blocked.
Prefer ajax_wait over page_wait. Fixed delays burn concurrency on every request, even fast ones.
For batch jobs: async + webhook, or push to the Enterprise Crawler. Queue workers calling the SDK synchronously will saturate your concurrency cap; async + webhook releases the slot the moment a request is queued.
Watch the remaining response header. It carries the number of concurrency slots you have left.

Method reference

All client classes share the same surface. Constructor takes an options array; verbs mirror the underlying HTTP methods.

new CrawlingAPI(['token' => T, 'timeout' => N])

constructor

Initialize a client with your token. Optional: 'timeout' in seconds (default 90).

->get($url, $options = [])

method

Send a GET. $options maps any Crawling API parameter to its value.

->post($url, $data, $options = [])

method

Send a POST. $data is the body - pass an array for form-encoded, a string for raw.

Response shape - public properties on the response object returned from each verb:

->statusCode

int

HTTP status of the SDK's request to Crawlbase.

->body

string

Page content (or JSON string when format=json / scraper= was used).

->headers

object

Response headers as an object. Crawlbase-specific status fields are exposed here:

->headers->pc_status: Crawlbase verdict on the target (branch on this for retry decisions).
->headers->original_status: HTTP status the target site returned to Crawlbase.
->headers->storage_url / ->headers->rid: set when the call carried 'store' => true.