# Webhook receiving
In order to receive the pushed data from your crawler, you will need to create a webhook endpoint in your server.
Your server webhook should...
- Be publicly reachable from Crawlbase servers
- Be ready to receive
POST
calls and respond within 200ms - Respond within 200ms with a status code
200
,201
or204
without content
The way the data is structured will depend on the format you specified when pushing the url with the format parameter,
&format=html
(which is the default) or &format=json
.
The Crawler engine will send the data back to your callback endpoint via POST
method with gzip
compression.
Note: Make sure that your callback is available at all times. Every time we deliver to your callback and your server fails to give back a successful response, we retry crawling the page and then retry the delivery again. Those retries are considered successful requests, so they are charged.
Note: If you are using Zapier webhooks, the Crawler does not send the data compressed. Zapier hooks do not work with Gzip compression.
# Request examples
Find here examples of what you can expect to receive from Crawlbase Crawler to your server webhook.
# Format HTML
This will come when you call the API with the &format=html
.
Headers:
"Content-Type" => "text/plain"
"Content-Encoding" => "gzip"
"Original-Status" => 200
"PC-Status" => 200
"rid" => "The RID you received in the push call"
"url" => "The URL which was crawled"
Body:
The HTML of the page
# Format JSON
This will come when you call the API with the &format=json
.
Headers:
"Content-Type" => "gzip/json"
"Content-Encoding" => "gzip"
Body:
{
pc_status: 200,
original_status: 200,
rid: "The RID you received in the push call",
url: "The URL which was crawled",
body: "The HTML of the page"
}
Please note that pc_status
and original_status
must be checked. You can read more about them here and here respectively.
# Testing integration
When creating your webhook, it can be helpful to see the exact response for a specific url.
To help testing you can configure Crawlbase Storage in your crawlers for testing purposes. You can see it here (opens new window).
# Monitoring bot
The Crawler will monitor your webhook url to know its status, if the webhook is down the Crawler will pause and it will resume automatically when your webhook goes up again.
Our monitoring bot will keep sending requests to your webhook endpoint. Make sure to ignore those requests with a 200
status code.
- Monitoring requests come as POST request with json body as you will receive with the non monitoring calls.
- Monitoring requests come with user agent
Crawlbase Monitoring Bot 1.0
so you can easily ignore them with status200
.
# Protecting your webhook
If you use some random endpoint like yourdomain.com/2340JOiow43djoqe21rjosi
it will unlikely be discovered but in any case, you can protect the webhook endpoint with the following methods (or several of them combined):
- Send a custom header on your request with some token that you check for its existence in your webhook.
- Use some url parameter in your url and check its existence on the webhook request, like:
yourdomain.com/2340JOiow43djoqe21rjosi?token=1234
- Only accept
POST
requests. - Check for some of the expected headers (for example
Pc-Status
,Original-Status
,rid
, etc).
We don't recommend IP whitelisting as our crawlers can push from many different IPs and the IPs can change without prior notification.