# Parameters
The API has the following parameters, only the token and url are mandatory, the rest are optional.
# token
- Required
- Type
string
This parameter is required for all calls
This is your authentication token. You have two tokens; one for normal requests and another one for JavaScript requests.
Use the JavaScript token when the content you need to crawl is generated via JavaScript, either because it's a JavaScript built page (React, Angular, etc.) or because the content is dynamically generated on the browser.
Normal token
_USER_TOKEN_
JavaScript token
_JS_TOKEN_
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# url
- Required
- Type
string
This parameter is required for all calls
You will need a url to crawl. Make sure it starts with http or https and that is fully encoded.
For example, in the following url: https://github.com/crawlbase?tab=repositories
the url should be encoded when calling the API like the following: https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# format
- Optional
- Type
string
Indicates the response format, either json
or html
. Defaults to html
.
If format html
is used, Crawlbase will send you back the response parameters in the headers (see HTML response below).
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories&format=json"
# pretty
- Optional
- Type
boolean
If you're expecting a json
response, you can optimize its readability by employing &pretty=true
.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories&format=json&pretty=true"
# user_agent
- Optional
- Type
string
If you want to make the request with a custom user agent, you can pass it here and our servers will forward it to the requested url.
We recommend to NOT use this parameter and let our artificial intelligence handle this.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&user_agent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_12_5%29+AppleWebKit%2F603.2.4+%28KHTML%2C+like+Gecko%29+Version%2F10.1.1+Safari%2F603.2.4&url=https%3A%2F%2Fpostman-echo.com%2Fheaders"
# page_wait
- Optional
- Type
number
If you are using the JavaScript token, you can optionally pass page_wait
parameter to wait an amount of milliseconds before the browser captures the resulting html code.
This is useful in cases where the page takes some seconds to render or some ajax needs to be loaded before the html is being captured.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&page_wait=1000&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# ajax_wait
- Optional
- Type
boolean
If you are using the JavaScript token, you can optionally pass ajax_wait
parameter to wait for the ajax requests to finish before getting the html response.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&ajax_wait=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# css_click_selector
- Optional
- Type
string
# Single CSS Selector
If you are using the JavaScript token, you can optionally pass the css_click_selector
parameter to click an element on the page before the browser captures the resulting HTML code.
This parameter accepts a fully specified and valid CSS selector. For example, you can use an ID selector such as #some-button
, a class selector like .some-other-button
, or an attribute selector such as [data-tab-item="tab1"]
. It is important to ensure that the CSS selector is properly encoded to avoid errors.
Please note, if the selector is not found on the page, the request will fail with pc_status
595
. To receive a response even when a selector is not found, you can append a universally found selector, like body
, as a fallback. For example: #some-button,body
.
# Multiple CSS Selectors
To accommodate scenarios where multiple elements may need to be clicked sequentially before capturing the page content, the css_click_selector
parameter can now accept multiple CSS selectors. Separate each selector with a pipe (|
) character. Ensure the entire value, including separators, is URL-encoded to avoid any parsing issues.
Suppose you want to click a button with the ID start-button
and then a link with the class next-page-link
. You would construct your css_click_selector
parameter like this:
- Original selectors:
#start-button|.next-page-link
- URL-encoded:
%23start-button%7C.next-page-link
Append this parameter to your API request to ensure both elements are clicked in the order specified.
Please ensure all selectors provided are valid and present on the page to avoid errors. If any selector is not found, the request will adhere to the error handling specified above, failing with pc_status
595
unless a fallback selector is included.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&css_click_selector=%5Bdata-tab-item%3D%22overview%22%5D&page_wait=1000&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# device
- Optional
- Type
string
Optionally, if you don't want to specify a user_agent but you want to have the requests from a specific device, you can use this parameter.
There are two options available: desktop
and mobile
.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&device=mobile&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# get_cookies
- Optional
- Type
boolean
Optionally, if you need to get the cookies that the original website sets on the response, you can use the &get_cookies=true
parameter.
The cookies will come back in the header (or in the json response if you use &format=json
) as original_set_cookie
.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&get_cookies=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# get_headers
- Optional
- Type
boolean
Optionally, if you need to get the headers that the original website sets on the response, you can use the &get_headers=true
parameter.
The headers will come back in the response as original_header_name
by default. When &format=json
is passed, the header will come back as original_headers
.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&get_headers=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# request_headers
- Optional
- Type
string
Optionally, if you need to send request headers to the original website, you can use the &request_headers=EncodedRequestHeaders
parameter.
Example request headers: accept-language:en-GB|accept-encoding:gzip
Example encoded: &request_headers=accept-language%3Aen-GB%7Caccept-encoding%3Agzip
Please note that not all request headers are allowed by the API. We recommend that you test the headers sent using this testing url: https://postman-echo.com/headers
If you need to send some additional headers which are not allowed by the API, please let us know the header names and we will authorize them for your token.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&request_headers=accept-language%3Aen-GB%7Caccept-encoding%3Agzip&url=https%3A%2F%2Fpostman-echo.com%2Fheaders"
# set_cookies
- Optional
- Type
string
Optionally, if you need to send cookies to the original website, you can use the &cookies=EncodedCookies
parameter.
Example cookies: key1=value1; key2=value2; key3=value3
Example encoded: &cookies=key1%3Dvalue1%3B%20key2%3Dvalue2%3B%20key3%3Dvalue3
We recommend that you test the cookies sent using this testing url: https://postman-echo.com/cookies
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&cookies=key1%3Dvalue1%3B%20key2%3Dvalue2%3B%20key3%3Dvalue3&url=https%3A%2F%2Fpostman-echo.com%2Fcookies"
# cookies_session
- Optional
- Type
string
If you need to send the cookies that come back on every request to all subsequent calls, you can use the &cookies_session=
parameter.
The &cookies_session=
parameter can be any value. Simply send a new value to create a new cookies session (this will allow you to send the returned cookies from the subsequent calls to the next API calls with that cookies session value). The value can be a maximum of 32-characters and sessions expire in 300 seconds after the last API call.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&cookies_session=1234abcd&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# screenshot
- Optional
- Type
boolean
If you are using the JavaScript token, you can optionally pass &screenshot=true
parameter to get a screenshot in the JPEG
format of the whole crawled page.
Crawlbase will send you back the screenshot_url
in the response headers (or in the json response if you use &format=json
).
The screenshot_url
expires in one hour.
Note: When using the screenshot=true
parameter, you can customize the screenshot output with these additional parameters:
mode
: Set toviewport
to capture only the viewport instead of the full page. Default isfullpage
.width
: Specify maximum width in pixels (only works withmode=viewport
). Default is screen width.height
: Specify maximum height in pixels (only works withmode=viewport
). Default is screen height.
Example: &screenshot=true&mode=viewport&width=1200&height=800
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&screenshot=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# store
- Optional
- Type
boolean
Optionally pass &store=true
parameter to store a copy of the API response in the Crawlbase Cloud Storage (opens new window).
Crawlbase will send you back the storage_url
in the response headers (or in the json response if you use &format=json
).
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&store=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"
# scraper
- Optional
- Type
string
Returns back the information parsed according to the specified scraper. Check the list of all the available data scrapers (opens new window) list of all the available data scrapers to see which one to choose.
The response will come back as JSON.
Please note: Scraper is an optional parameter. If you don't use it, you will receive back the full HTML of the page so you can scrape it freely.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&scraper=amazon-product-details&url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB0B7CBZZ16"
# async
- Optional
- Type
boolean
- Currently only linkedin.com is supported using this parameter. Talk to us if you require other domains on async mode.
Optionally pass &async=true
parameter to crawl the requested URL asynchronously. Crawlbase will store the resulted page in the Crawlbase Cloud Storage (opens new window).
As a result of doing a call with async=true
, Crawlbase will send you back the request identifier rid
in the json response. You will need to store the RID to retrieve the document from the storage. With the RID, you can then use the Cloud Storage API (opens new window) to retrieve the resulted page.
You can use the async=true
parameter in combination with other API parameter like for example &async=true&autoparse=true
.
Example of request with async=true
call:
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&async=true&url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fcrawlbase"
Example of response with async=true
call:
{ "rid": "1e92e8bff32c31c2728714d4" }
# autoparse
- Optional
- Type
boolean
Optionally, if you need to get the scraped data of the page that you requested, you can pass &autoparse=true
parameter.
The response will come back as JSON. The structure of the response varies depending on the URL that you sent.
Please note: &autoparse=true
is an optional parameter. If you don't use it, you will receive back the full HTML of the page so you can scrape it freely.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&autoparse=true&url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB0B7CBZZ16"
# country
- Optional
- Type
string
If you want your requests to be geolocated from a specific country, you can use the &country=
parameter, like &country=US
(two-character country code).
Please take into account that specifying a country can reduce the number of successful requests you get back, so use it wisely and only when geolocation crawls are required.
Also note that some websites like Amazon are routed via different special proxies and all countries are allowed regardless of being in the list or not.
You have access to the following countries
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&country=US&url=https%3A%2F%2Fpostman-echo.com%2Fip"
# tor_network
- Optional
- Type
boolean
If you want to crawl onion websites over the Tor network, you can pass the &tor_network=true
parameter.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&tor_network=true&url=https%3A%2F%2Fwww.facebookcorewwwi.onion%2F"
# scroll
- Optional
- Type
boolean
If you are using the JavaScript token, you can optionally pass &scroll=true
to the API this will by default scroll for a scroll_interval of 10 seconds.
If you want to scroll more than 10 seconds please send the &scroll=true&scroll_interval=20
. Those parameters will instruct the browser to scroll for 20 seconds after loading the page. The maximum scroll interval is 60 seconds, after 60 seconds of a scroll, the system captures the data and brings it back to you.
The default scroll interval is 10 seconds. Every 5 seconds of successful scroll counts as extra JS request on the Crawling API, so let us assume you send a scroll_interval 20, our system tries to scroll the page for a maximum of 20 seconds, if it only was able to scroll for 10 seconds, only 2 extra requests are consumed instead of 4.
Note: Please make sure to keep your connection open up to 90 seconds if you are intending to scroll for 60 seconds.
Important: Some domains require higher system timeouts which are set automatically. When combined with scroll
and page_wait
parameters, this may result in additional request counts. Contact support if you need to optimize these settings for specific domains.
- curl
- ruby
- node
- php
- python
- go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&scroll=true&url=https%3A%2F%2Fwww.reddit.com%2Fsearch%2F%3Fq%3Dcrawlbase"