Ruby SDK · Crawlbase Documentation

How the SDK is shaped

The Ruby gem is a thin wrapper around the same HTTP API documented in API Reference. Every Crawling API parameter you'd append as a query string in a raw HTTP call is reachable from the gem as a keyword on the call - names, defaults, and behavior all map one-to-one. There is no parameter the gem adds; there is no parameter it hides.

What you get for using it instead of Net::HTTP / Faraday directly:

URL encoding, parameter validation, and response parsing handled out of the box - application code stays focused on the business logic.
Idiomatic Ruby surface - keyword args, snake_case parameter names, exception-raising for transport failures, plain-old-Ruby response objects.
A single client class per Crawlbase API, all sharing the same constructor / call shape.
Sensible defaults (90-second timeout, automatic JSON parsing of format=json responses, UTF-8-encoded bodies) that match what most teams configure by hand on their first integration.

Source on github.com/crawlbase/crawlbase-ruby. Issues + PRs welcome.

Install

Latest version on RubyGems. Tested on Ruby 2.7, 3.0, 3.1, 3.2, 3.3 + JRuby.

gem install crawlbase

# Or in your Gemfile
gem 'crawlbase'

Authentication

Every Crawlbase API authenticates with the same token model. Two token types live on a single account:

Normal Token (TCP) - for static HTML, JSON endpoints, anything that doesn't need a browser. Faster + cheaper.
JavaScript Token - for SPAs, lazy-loaded feeds, anything that hides content behind client-side rendering. Required to use page_wait, ajax_wait, scroll, and css_click_selector.

Use Rails credentials (Rails.application.credentials.crawlbase_token) or environment variables in production. The gem doesn't read either itself - that's deliberate so you stay in control of where credentials come from. Pattern:

require 'crawlbase'

# Pick the right token at instantiation; the gem doesn't switch
# tokens per-call, so keep two clients if you alternate.
api = Crawlbase::API.new(token: ENV.fetch('CRAWLBASE_TOKEN'))
js = Crawlbase::API.new(token: ENV.fetch('CRAWLBASE_JS_TOKEN'))

api.get('https://github.com/anthropic')
js.get('https://feed.example.com', page_wait: 2000)

Full token model + dashboard locations on the Authentication page.

Quickstart

Three lines from require to crawled HTML:

require 'crawlbase'

api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://github.com/anthropic')

puts res.body if res.status_code == 200

Branch on .status_code (the gem's HTTP status to Crawlbase) and .pc_status (the Crawlbase verdict - see Errors below) when deciding whether to retry. Pass format: 'json' to receive a JSON envelope instead of raw page content.

All APIs in one gem

Every Crawlbase API has a matching class. Same constructor, same get / post verbs.

require 'crawlbase'

token = { token: 'YOUR_TOKEN' }

crawl = Crawlbase::API.new(**token) # general-purpose page fetch
scraper = Crawlbase::ScraperAPI.new(**token) # parsed JSON for supported sites
leads = Crawlbase::LeadsAPI.new(**token) # domain-scoped email extraction (legacy)
shots = Crawlbase::ScreenshotsAPI.new(**token) # screenshots of any URL
storage = Crawlbase::StorageAPI.new(**token) # Cloud Storage CRUD

# Push high-volume async jobs to the Enterprise Crawler via the Crawling API:
# api.get(url, async: true, callback: '...', crawler: 'YourCrawler').
# See /docs/crawler for the queue workflow.

Common patterns

JavaScript rendering

For SPAs, lazy-loaded feeds, and pages where the initial HTML is empty, instantiate with the JavaScript token and pass any combination of page_wait, ajax_wait, scroll, and css_click_selector. Order to think about: a fixed wait, then network-idle, then scroll for lazy-load, then click for any gating UI element.

api = Crawlbase::API.new(token: 'YOUR_JS_TOKEN')
res = api.get('https://spa.example.com',
 page_wait: 2000,
 ajax_wait: true,
 scroll: true)

Use a built-in scraper

Skip the parser entirely on supported sites. Pass scraper: 'NAME' and the response body becomes a JSON string with the structured fields documented on the per-scraper page.

require 'crawlbase'
require 'json'

api = Crawlbase::ScraperAPI.new(token: 'YOUR_TOKEN')
res = api.get('https://www.amazon.com/dp/1098145356',
 scraper: 'amazon-product-details')
data = JSON.parse(res.body)
puts data['name'], data['price']

Geo-routing

Pass country: 'ISO' to route the crawl through that country's exit nodes. Use it any time the target serves localized content based on IP.

api = Crawlbase::API.new(token: 'YOUR_TOKEN')

# Hit the German Amazon catalog from a German residential IP
res = api.get('https://www.amazon.com/dp/1098145356', country: 'DE')

Retry with backoff

The recommended retry shape: exponential backoff capped at 3-5 attempts, retry on transient errors only (5xx or empty body), don't retry on 4xx.

require 'crawlbase'

api = Crawlbase::API.new(token: 'YOUR_TOKEN')

def crawl(api, url, attempts: 5)
 attempts.times do |i|
 res = api.get(url)
 return res if res.status_code == 200 && res.pc_status.to_i == 200
 raise "client error: %d" % res.status_code if (400..499).include?(res.status_code)
 sleep(rand * (2**i)) # exponential backoff with jitter
 end
 raise "Failed: %s" % url
end

Async crawls + webhooks

Fire-and-forget mode. The gem call returns immediately with an rid; Crawlbase POSTs the result to your callback URL when the page is ready. Useful for batch jobs and slow targets.

api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://example.com',
 async: true,
 callback: 'https://your-app.com/webhook')
rid = res.rid # correlate the eventual webhook delivery

# Your Rails / Sinatra webhook receives a POST with:
# { rid, url, original_status, pc_status, body }

For very high volumes (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline.

Sticky sessions

Some flows need the same residential IP across multiple calls. Pass cookies_session with a stable identifier and Crawlbase reuses the same exit node for ~30 minutes.

api = Crawlbase::API.new(token: 'YOUR_JS_TOKEN')

session = "checkout-#{user_id}"
api.get('https://shop.example.com/cart', cookies_session: session)
api.get('https://shop.example.com/checkout', cookies_session: session)
api.get('https://shop.example.com/confirm', cookies_session: session)

Errors & retries

The platform surfaces two status codes on every response: the gem's own .status_code (HTTP status of the request to Crawlbase itself) and .pc_status (Crawlbase's verdict on the target - see the Crawling API errors table for the full list). Always branch on .pc_status when deciding whether to retry - a target can return 200 with empty body, in which case .status_code is 200 but .pc_status is 520.

res = api.get(url)
pc = res.pc_status.to_i

case pc
when 200
 use(res.body)
when 520, 525
 # 520 = empty body, 525 = anti-bot couldn't be solved.
 # Switch to JS token and retry.
 retry_with_js_token(url)
when 521, 522, 523
 # Target unreachable or timed out. Retry with backoff.
 schedule_retry(url)
else
 Rails.logger.error('crawl failed', url: url, pc_status: pc)
end

All retries against the platform are free - only successful responses (pc_status: 200) count against your quota.

Performance & best practices

Reuse a single client per token. The constructor is cheap but each instance opens its own connection. Build it once at app boot (Rails initializer is the natural spot), share it across requests.
Use the cheapest token that works. Don't default to the JavaScript token "just in case" - Normal-token requests are faster and use less concurrency. Promote to JS only when the Normal response is empty or anti-bot-blocked.
Prefer ajax_wait over page_wait. Fixed delays burn concurrency on every request, even fast ones.
For batch jobs: async + webhook, or push to the Enterprise Crawler. Sidekiq workers calling the gem synchronously will saturate your concurrency cap; async + webhook releases the slot the moment a request is queued.
Watch the remaining response header. It carries the number of concurrency slots you have left - back off proactively before hitting the cap rather than reacting to 429s.

Method reference

All client classes share the same surface. Constructor takes keyword arguments; verbs mirror the underlying HTTP methods.

Crawlbase::API.new(token:, timeout:)

constructor

Initialize a client with your token. Optional: timeout in seconds (default 90).

#get(url, **options)

method

Send a GET. options maps any Crawling API parameter to its value. Returns a response object.

#post(url, data, **options)

method

Send a POST. data is the body - pass a hash for form-encoded, a string for raw.

Response shape - methods on the response object:

.status_code

Integer

HTTP status of the gem's request to Crawlbase.

.pc_status

Integer

Crawlbase verdict on the target. Branch on this for retry decisions.

.original_status

Integer

HTTP status the target returned to Crawlbase.

.url

String

Final URL after target-side redirects.

.body

String

Page content (or JSON string when format=json / scraper= was used).

.headers

Hash

Response headers from the target site.

.rid

String

Request ID (when async: true or store: true).