Docs
Log in

How the SDK is shaped

The Java SDK is a thin wrapper around the same HTTP API documented in API Reference. Every Crawling API parameter you'd append as a query string in a raw HTTP call is reachable as a HashMap<String, Object> option — names, defaults, and behavior all map one-to-one.

One quirk worth knowing up front: the Java SDK exposes the response state on the API instance itself, not on a returned value object. Calls like api.get(url) return void; you read the result via api.getStatusCode(), api.getBody(), and so on. This is different from the Python / Node / Ruby / PHP SDKs (which return a response object) — kept in mind, the rest of the surface is straightforward.

What you get for using it instead of HttpClient / OkHttp directly:

  • URL encoding, parameter validation, and response parsing handled out of the box.
  • A single client class per Crawlbase API, all sharing the same constructor / call shape.
  • Idiomatic Java — runtime exceptions for transport failures (no checked exceptions to declare).
  • Sensible defaults (90-second timeout, automatic decoding of JSON / gzip responses).

Source on github.com/crawlbase/crawlbase-java.

Install

Latest version on Maven Central. Requires JDK 8+; tested through JDK 21.

<!-- pom.xml -->
<dependency>
 <groupId>com.crawlbase</groupId>
 <artifactId>crawlbase-java-sdk-pom</artifactId>
 <version>1.1</version>
</dependency>

<!-- Or build.gradle -->
implementation 'com.crawlbase:crawlbase-java-sdk-pom:1.1'

Authentication

Every Crawlbase API authenticates with the same token model. Two token types live on a single account:

  • Normal Token (TCP) — for static HTML, JSON endpoints, anything that doesn't need a browser. Faster + cheaper.
  • JavaScript Token — for SPAs, lazy-loaded feeds, anything that hides content behind client-side rendering. Required to use page_wait, ajax_wait, scroll, and css_click_selector.

Use environment variables or your Spring config in production. The SDK doesn't read either itself. Pattern:

import java.util.*;
import com.crawlbase.*;

// Pick the right token at instantiation; the SDK doesn't switch
// tokens per-call, so keep two clients if you alternate.
API api = new API(System.getenv("CRAWLBASE_TOKEN"));
API js = new API(System.getenv("CRAWLBASE_JS_TOKEN"));

api.get("https://github.com/anthropic");

HashMap<String, Object> opts = new HashMap<>();
opts.put("page_wait", 2000);
js.get("https://feed.example.com", opts);

Full token model + dashboard locations on the Authentication page.

Quickstart

Three lines from import to crawled HTML. Note that response state lives on the API instance:

import com.crawlbase.*;

API api = new API("YOUR_TOKEN");
api.get("https://github.com/anthropic");

if (api.getStatusCode() == 200) {
 System.out.println(api.getBody());
}

Branch on api.getStatusCode() (the SDK's HTTP status to Crawlbase) and api.getCrawlbaseStatus() (the Crawlbase verdict — see Errors below) when deciding whether to retry. Pass a HashMap with "format" → "json" to receive a JSON envelope instead of raw page content.

All APIs in one artifact

Each Crawlbase product has a matching client class. Same constructor (single token string), same method shape.

import com.crawlbase.*;

String token = "YOUR_TOKEN";

API crawl = new API(token); // Crawling API: general-purpose page fetch
ScraperAPI scraper = new ScraperAPI(token); // parsed JSON for supported sites
LeadsAPI leads = new LeadsAPI(token); // domain-scoped email extraction (legacy)
ScreenshotsAPI shots = new ScreenshotsAPI(token); // screenshots; body is base64-encoded image bytes

// Push high-volume async jobs to the Enterprise Crawler via the Crawling API:
// api.get(url, options) where options carries `callback=true` + `crawler=YourCrawler`.
// See /docs/crawler for the queue-management workflow.

Common patterns

JavaScript rendering

For SPAs, lazy-loaded feeds, and pages where the initial HTML is empty, instantiate with the JavaScript token and pass any combination of page_wait, ajax_wait, scroll, and css_click_selector. Order to think about: a fixed wait, then network-idle, then scroll for lazy-load, then click for any gating UI element.

API api = new API("YOUR_JS_TOKEN");

HashMap<String, Object> opts = new HashMap<>();
opts.put("page_wait", 2000);
opts.put("ajax_wait", true);
opts.put("scroll", true);

api.get("https://spa.example.com", opts);

Use a built-in scraper

Skip the parser entirely on supported sites. Pass "scraper" → "NAME" and the body becomes a JSON string with the structured fields documented on the per-scraper page.

import com.crawlbase.*;
import com.fasterxml.jackson.databind.*;
import java.util.*;

API api = new API("YOUR_TOKEN");

HashMap<String, Object> opts = new HashMap<>();
opts.put("scraper", "amazon-product-details");
api.get("https://www.amazon.com/dp/B08N5WRWNW", opts);

ObjectMapper mapper = new ObjectMapper();
Map<String, Object> data = mapper.readValue(api.getBody(), Map.class);
System.out.println(data.get("name") + " - " + data.get("price"));

Geo-routing

Pass "country" → "ISO" to route the crawl through that country's exit nodes. Use it any time the target serves localized content based on IP.

API api = new API("YOUR_TOKEN");

// Hit the German Amazon catalog from a German residential IP
HashMap<String, Object> opts = new HashMap<>();
opts.put("country", "DE");
api.get("https://www.amazon.com/dp/B08N5WRWNW", opts);

Retry with backoff

The recommended retry shape: exponential backoff capped at 3-5 attempts, retry on transient errors only (5xx or empty body), don't retry on 4xx.

import com.crawlbase.*;
import java.util.concurrent.ThreadLocalRandom;

public boolean crawl(API api, String url, int attempts) throws InterruptedException {
 for (int i = 0; i < attempts; i++) {
 api.get(url);
 if (api.getStatusCode() == 200 && api.getCrawlbaseStatus() == 200) {
 return true;
 }
 if (api.getStatusCode() >= 400 && api.getStatusCode() < 500) {
 throw new RuntimeException("client error " + api.getStatusCode() + ": " + url);
 }
 // Exponential backoff with jitter
 long ms = (long) (ThreadLocalRandom.current().nextDouble() * Math.pow(2, i) * 1000);
 Thread.sleep(ms);
 }
 return false;
}

Async crawls + webhooks

Fire-and-forget mode. Pass "async" → true with a "callback" URL; the call returns immediately and Crawlbase POSTs the result to your webhook when the page is ready. Useful for batch jobs and slow targets.

API api = new API("YOUR_TOKEN");

HashMap<String, Object> opts = new HashMap<>();
opts.put("async", true);
opts.put("callback", "https://your-app.com/webhook");
api.get("https://example.com", opts);

// api.getBody() now contains a JSON envelope with { rid: ... } —
// use that to correlate the eventual webhook delivery.
//
// Your Spring / Jakarta servlet receives a POST with:
// { rid, url, original_status, pc_status, body }

For very high volumes (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline.

Sticky sessions

Some flows need the same residential IP across multiple calls. Pass cookies_session with a stable identifier and Crawlbase reuses the same exit node for ~30 minutes.

API api = new API("YOUR_JS_TOKEN");

String session = "checkout-" + userId;
HashMap<String, Object> opts = new HashMap<>();
opts.put("cookies_session", session);

api.get("https://shop.example.com/cart", opts);
api.get("https://shop.example.com/checkout", opts);
api.get("https://shop.example.com/confirm", opts);

Errors & retries

The platform surfaces two status codes on every response: the SDK's own api.getStatusCode() (HTTP status of the request to Crawlbase itself) and api.getCrawlbaseStatus() (Crawlbase's verdict on the target — see the Crawling API errors table for the full list). Always branch on getCrawlbaseStatus() when deciding whether to retry — a target can return 200 with empty body, in which case getStatusCode() is 200 but getCrawlbaseStatus() is 520.

api.get(url);
int pc = api.getCrawlbaseStatus();

switch (pc) {
 case 200:
 useBody(api.getBody());
 break;
 case 520: case 525:
 // 520 = empty body, 525 = anti-bot couldn't be solved.
 // Switch to JS token and retry.
 retryWithJsToken(url);
 break;
 case 521: case 522: case 523:
 // Target unreachable or timed out. Retry with backoff.
 scheduleRetry(url);
 break;
 default:
 log.error("crawl failed url={} crawlbase_status={}", url, pc);
}

Note that all SDK methods throw RuntimeException (not checked exceptions) on transport failures. Wrap your retry loop accordingly.

All retries against the platform are free — only successful responses (crawlbaseStatus: 200) count against your quota.

Performance & best practices

  • Reuse a single client per token. Define it as a Spring bean / CDI singleton — each instance opens its own underlying HTTP client. Don't construct one per request.
  • Use the cheapest token that works. Don't default to the JavaScript token "just in case" — Normal-token requests are faster and use less concurrency.
  • Prefer ajax_wait over page_wait. Fixed delays burn concurrency on every request, even fast ones.
  • Mind the shared-state on the API instance. Because response data lives on the api object (not a return value), do not share one API instance across multiple threads making concurrent calls — a second thread's api.get() will overwrite the first thread's response state mid-read. Pool one instance per worker thread, or guard with a mutex.
  • For batch jobs: async + webhook, or push to the Enterprise Crawler. Thread pools blocking on synchronous calls saturate concurrency caps quickly; async + webhook releases the slot the moment a request is queued.

Method reference

All client classes share the same surface. Constructors take a token string; verbs mirror the underlying HTTP methods and write response state onto the api instance.

new API(String token)
constructor
Initialize a Crawling API client. Optional second-argument constructors set timeout / proxy. Same shape for ScraperAPI, LeadsAPI, ScreenshotsAPI.
api.get(String url)
method
Send a GET. Returns void; read response via getters.
api.get(String url, HashMap<String, Object> options)
method
Send a GET with options. options maps any Crawling API parameter name to its value.
api.post(String url, HashMap<String, Object> data)
method
Send a POST. data is the form-encoded body. Optional third-argument options.

Response state — getters on the api instance after a call:

api.getStatusCode()
int
HTTP status of the SDK's request to Crawlbase.
api.getCrawlbaseStatus()
int
Crawlbase verdict on the target. Branch on this for retry decisions.
api.getOriginalStatus()
int
HTTP status the target site returned to Crawlbase.
api.getBody()
String
Page content (or JSON string when format=json / scraper= was used). For ScreenshotsAPI, this is a base64-encoded image — use Base64.getDecoder().decode(...) to convert.