Ruby
Official Ruby gem for the Crawlbase platform. Idiomatic Ruby across Ruby 2.7+ and JRuby — same gem, every API, sensible defaults that match what most Rails apps configure by hand.
How the SDK is shaped
The Ruby gem is a thin wrapper around the same HTTP API documented in API Reference. Every Crawling API parameter you'd append as a query string in a raw HTTP call is reachable from the gem as a keyword on the call — names, defaults, and behavior all map one-to-one. There is no parameter the gem adds; there is no parameter it hides.
What you get for using it instead of Net::HTTP / Faraday directly:
- URL encoding, parameter validation, and response parsing handled out of the box — application code stays focused on the business logic.
- Idiomatic Ruby surface — keyword args, snake_case parameter names, exception-raising for transport failures, plain-old-Ruby response objects.
- A single client class per Crawlbase API, all sharing the same constructor / call shape.
- Sensible defaults (90-second timeout, automatic JSON parsing of
format=jsonresponses, UTF-8-encoded bodies) that match what most teams configure by hand on their first integration.
Source on github.com/crawlbase/crawlbase-ruby. Issues + PRs welcome.
Install
Latest version on RubyGems. Tested on Ruby 2.7, 3.0, 3.1, 3.2, 3.3 + JRuby.
gem install crawlbase
# Or in your Gemfile
gem 'crawlbase'Authentication
Every Crawlbase API authenticates with the same token model. Two token types live on a single account:
- Normal Token (TCP) — for static HTML, JSON endpoints, anything that doesn't need a browser. Faster + cheaper.
- JavaScript Token
— for SPAs, lazy-loaded feeds, anything that hides content behind client-side rendering. Required to use
page_wait,ajax_wait,scroll, andcss_click_selector.
Use Rails credentials (Rails.application.credentials.crawlbase_token) or environment variables in production. The gem doesn't read either itself — that's deliberate so you stay in control of where credentials come from. Pattern:
require 'crawlbase'
# Pick the right token at instantiation; the gem doesn't switch
# tokens per-call, so keep two clients if you alternate.
api = Crawlbase::API.new(token: ENV.fetch('CRAWLBASE_TOKEN'))
js = Crawlbase::API.new(token: ENV.fetch('CRAWLBASE_JS_TOKEN'))
api.get('https://github.com/anthropic')
js.get('https://feed.example.com', page_wait: 2000)Full token model + dashboard locations on the Authentication page.
Quickstart
Three lines from require to crawled HTML:
require 'crawlbase'
api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://github.com/anthropic')
puts res.body if res.status_code == 200Branch on .status_code (the gem's HTTP status to Crawlbase) and .pc_status (the Crawlbase verdict — see Errors below) when deciding whether to retry. Pass format: 'json' to receive a JSON envelope instead of raw page content.
All APIs in one gem
Every Crawlbase API has a matching class. Same constructor, same get / post verbs.
require 'crawlbase'
token = { token: 'YOUR_TOKEN' }
crawl = Crawlbase::API.new(**token) # general-purpose page fetch
scraper = Crawlbase::ScraperAPI.new(**token) # parsed JSON for supported sites
leads = Crawlbase::LeadsAPI.new(**token) # domain-scoped email extraction (legacy)
shots = Crawlbase::ScreenshotsAPI.new(**token) # screenshots of any URL
storage = Crawlbase::StorageAPI.new(**token) # Cloud Storage CRUD
# Push high-volume async jobs to the Enterprise Crawler via the Crawling API:
# api.get(url, async: true, callback: '...', crawler: 'YourCrawler').
# See /docs/crawler for the queue workflow.Common patterns
JavaScript rendering
For SPAs, lazy-loaded feeds, and pages where the initial HTML is empty, instantiate with the JavaScript token and pass any combination of page_wait, ajax_wait, scroll, and css_click_selector. Order to think about: a fixed wait, then network-idle, then scroll for lazy-load, then click for any gating UI element.
api = Crawlbase::API.new(token: 'YOUR_JS_TOKEN')
res = api.get('https://spa.example.com',
page_wait: 2000,
ajax_wait: true,
scroll: true)Use a built-in scraper
Skip the parser entirely on supported sites. Pass scraper: 'NAME' and the response body becomes a JSON string with the structured fields documented on the per-scraper page.
require 'crawlbase'
require 'json'
api = Crawlbase::ScraperAPI.new(token: 'YOUR_TOKEN')
res = api.get('https://www.amazon.com/dp/B08N5WRWNW',
scraper: 'amazon-product-details')
data = JSON.parse(res.body)
puts data['name'], data['price']Geo-routing
Pass country: 'ISO' to route the crawl through that country's exit nodes. Use it any time the target serves localized content based on IP.
api = Crawlbase::API.new(token: 'YOUR_TOKEN')
# Hit the German Amazon catalog from a German residential IP
res = api.get('https://www.amazon.com/dp/B08N5WRWNW', country: 'DE')Retry with backoff
The recommended retry shape: exponential backoff capped at 3-5 attempts, retry on transient errors only (5xx or empty body), don't retry on 4xx.
require 'crawlbase'
api = Crawlbase::API.new(token: 'YOUR_TOKEN')
def crawl(api, url, attempts: 5)
attempts.times do |i|
res = api.get(url)
return res if res.status_code == 200 && res.pc_status.to_i == 200
raise "client error: %d" % res.status_code if (400..499).include?(res.status_code)
sleep(rand * (2**i)) # exponential backoff with jitter
end
raise "Failed: %s" % url
endAsync crawls + webhooks
Fire-and-forget mode. The gem call returns immediately with an rid; Crawlbase POSTs the result to your callback URL when the page is ready. Useful for batch jobs and slow targets.
api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://example.com',
async: true,
callback: 'https://your-app.com/webhook')
rid = res.rid # correlate the eventual webhook delivery
# Your Rails / Sinatra webhook receives a POST with:
# { rid, url, original_status, pc_status, body }For very high volumes (millions of URLs), use the Enterprise Crawler which sits in front of this same async pipeline.
Sticky sessions
Some flows need the same residential IP across multiple calls. Pass cookies_session with a stable identifier and Crawlbase reuses the same exit node for ~30 minutes.
api = Crawlbase::API.new(token: 'YOUR_JS_TOKEN')
session = "checkout-#{user_id}"
api.get('https://shop.example.com/cart', cookies_session: session)
api.get('https://shop.example.com/checkout', cookies_session: session)
api.get('https://shop.example.com/confirm', cookies_session: session)Errors & retries
The platform surfaces two status codes on every response: the gem's own .status_code (HTTP status of the request to Crawlbase itself) and .pc_status (Crawlbase's verdict on the target — see the Crawling API errors table for the full list). Always branch on .pc_status when deciding whether to retry — a target can return 200 with empty body, in which case .status_code is 200 but .pc_status is 520.
res = api.get(url)
pc = res.pc_status.to_i
case pc
when 200
use(res.body)
when 520, 525
# 520 = empty body, 525 = anti-bot couldn't be solved.
# Switch to JS token and retry.
retry_with_js_token(url)
when 521, 522, 523
# Target unreachable or timed out. Retry with backoff.
schedule_retry(url)
else
Rails.logger.error('crawl failed', url: url, pc_status: pc)
endAll retries against the platform are free — only successful responses (pc_status: 200) count against your quota.
Performance & best practices
- Reuse a single client per token. The constructor is cheap but each instance opens its own connection. Build it once at app boot (Rails initializer is the natural spot), share it across requests.
- Use the cheapest token that works. Don't default to the JavaScript token "just in case" — Normal-token requests are faster and use less concurrency. Promote to JS only when the Normal response is empty or anti-bot-blocked.
- Prefer
ajax_waitoverpage_wait. Fixed delays burn concurrency on every request, even fast ones. - For batch jobs: async + webhook, or push to the Enterprise Crawler. Sidekiq workers calling the gem synchronously will saturate your concurrency cap; async + webhook releases the slot the moment a request is queued.
- Watch the
remainingresponse header. It carries the number of concurrency slots you have left — back off proactively before hitting the cap rather than reacting to 429s.
Method reference
All client classes share the same surface. Constructor takes keyword arguments; verbs mirror the underlying HTTP methods.
timeout in seconds (default 90).options maps any Crawling API parameter to its value. Returns a response object.data is the body — pass a hash for form-encoded, a string for raw.Response shape — methods on the response object:
format=json / scraper= was used).async: true or store: true).
