Businesses that want to stay ahead and make smarter decisions depend on web data more than ever. Crawlbase makes this easy with powerful tools for web scraping. One of its best products, the Crawlbase Crawler, helps you collect data asynchronously without waiting for the response. You can send URLs to it using the Crawlbase Crawling API, and instead of waiting or constantly checking for results, the Crawler automatically sends the scraped data to your server using a webhook—all in real-time. This means faster data collection with less effort.

In this blog, we’ll take a closer look at the Crawlbase Crawler and how its asynchronous processing and webhook integration make large-scale web scraping smooth and hassle-free. By the end of this blog, you’ll understand how to set up and use Crawlbase Crawler effectively.

Creating the Crawlbase Crawler

To use the Crawler, you must first create it from your Crawlbase account dashboard. Depending on your need, you can create two types of Crawler, TCP or JavaScript. Use TCP Crawler to crawl static pages. Use the JS Crawler when the content you need to crawl is generated via JavaScript, either because it’s a JavaScript-built page (React, Angular, etc.) or because the content is dynamically generated on the browser.

For the example, we will create a TCP crawler from the dashboard.

Image showing Create Crawler Page without options

To create a Crawler, either we have to create a webhook or we can use Crawlbase Storage API. If you wish not to create your own webhook and store data your Crawler generates securely, Crawlbase offers a seamless solution through its Crawlbase Storage API.

Crawlbase Storage API option

By setting up your Crawler to use the Storage webhook endpoint, you can securely store your crawled data with added privacy and control—without worrying about storage limits. To do this, simply select the Crawlbase Storage option when creating your Crawler.

If you prefer not to use Crawlbase Storage, you can specify your own webhook endpoint to receive the data directly. The steps below explain how to create a webhook that meets Crawlbase Crawler’s requirements using the Python Django framework.

1. Creating a Webhook

A Webhook is an HTTP-based callback mechanism that allows one system to send real-time data to another when a specific event occurs. In the case of Crawlbase Crawler webhook should…

  1. Be publicly reachable from Crawlbase servers
  2. Be ready to receive POST calls and respond within 200ms
  3. Respond within 200ms with a status code 200, 201, or 204 without content

Let’s create a simple webhook for receiving responses in the Python Django framework. Make sure you have Python and Django installed. To create a simple webhook receiver using Django in Python, follow these steps:

STEP 1

Create a new Django project and app using the following commands:

1
2
3
4
5
6
7
8
9
10
11
# Command to create the Project:

django-admin startproject webhook_project

# Go to webhook_project directory using terminal:

cd webhook_project

# Create webhook_app:

python manage.py startapp webhook_app

STEP 2

In the webhook_app directory, create a views.py file and define a view to receive the webhook data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# webhook_app/views.py

from django.shortcuts import render
from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse
import gzip

def save_data_to_file(data): # Assuming you want to save the data to a file named 'webhook_data.txt'
with open('webhook_data.txt', 'a') as file:
file.write(str(data) + '\n')

@csrf_exempt
def webhook_receiver(request): # Add the request to the queue for asynchronous processing
if request.method == 'POST':
try:
decompressed_data = gzip.decompress(request.body)
except OSError as e:
return HttpResponse('Error: Unable to decompress data', status=400)

# Convert the decompressed byte data to string (or further processing)
data_string = decompressed_data.decode('latin1')

# save data to the file
save_data_to_file(data_string)

# return 204 to the crawler
return HttpResponse(status=204)

The webhook_receiver function is decorated with @csrf_exempt to allow external services to send data without CSRF protection. It attempts to decompress Gzip-encoded data from the request body and, if successful, decodes it (assumed to be HTML) into a string. The data is then appended to a file named webhook_data.txt.

While this example simplifies things by saving the scraped HTML to a single file, in practice, you can extract and process any type of data from the HTML received via the webhook as needed.

STEP 3

Configure URL Routing In the webhook_project directory, edit the urls.py file to add a URL pattern for the webhook receiver:

1
2
3
4
5
6
7
8
9
10
# webhook_project/urls.py

from django.contrib import admin
from django.urls import path
from webhook_app.views import webhook_receiver

urlpatterns = [
path('admin/', admin.site.urls),
path('webhook/crawlbase/', webhook_receiver, name='webhook_receiver'),
]

STEP 4

Start the Django development server to test the webhook receiver:

1
2
3
# Command to start server
# Note: For Linux systems with Python version 3, use python3 at start
python manage.py runserver

The app will start running on localhost port 8000.

Start Django development server

After creating a webhook, the next thing we need is to make the webhook publicly available on the internet.

To do this for this example, we are using ngrok. Since our webhook is running on localhost with port 8000, we need to run the ngrok on port 8000.

ngrok console

After running ngrok at port 8000, we can see that ngrok provides a public forwarding URL that we can use to create the crawler. With the free version of ngrok, this link will auto-expire after 2 hours.

Creating Crawlbase Crawler with Webhook

Now, let’s create a crawler from dashboard.

Create new Crawler from Crawlbase Crawler dashboard

Start by giving your Crawler a unique name, like “test-crawler” in our case, and specify your webhook URL in the callback option. In this example, the webhook URL will be a public Ngrok forwarding URL, followed by the webhook route address.

Pushing URLs to the Crawler

Now that you’ve created the “test-crawler”, the next step is to push the URLs you want it to crawl. To do this, you’ll need to use the Crawlbase Crawling API, along with two additional parameters: crawler=YourCrawlerName and callback=true. By default, you can push up to 30 URLs per second to the Crawler. If you need to increase this limit, you can request a change by contacting Crawlbase customer support.

Here’s an example in Python that uses the Crawlbase Python library to push URLs to the Crawler.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# To install the crawlbase library
!pip install crawlbase

# Importing CrawlingAPI
from crawlbase import CrawlingAPI

# Initializing CrawlingAPI with your TCP token
api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

# Using random Amazon URLs for the example
urls = [
'https://www.amazon.com/AIRLITE-Microphone-Licensed-Microsoft-Accessories-x/dp/B08JR8HF2G',
'https://www.amazon.com/Cabinet-Stainless-Kitchen-Drawer-Handles/dp/B07SPXKNXN',
'https://www.amazon.com/Mkono-Decorative-Decoration-Organizer-Farmhouse/dp/B08292QMQR',
]

for url in urls:
# Asynchronously sending the crawling request with specified options
response = api.get(url, options={'callback': 'true', 'crawler': 'test-crawler'})

# Printing the content of the response body
print(response['body'])

After running the code, the Crawling API will push all URLs to the Crawler queue.

Example Output:

1
2
3
b'{"rid":"d756c32b0999b1c0507e364f"}'
b'{"rid":"455ee207f6907fbd6168ac1e"}'
b'{"rid":"e9eb6ce579dec207e8973615"}'

For every URL you push to the Crawler using the Crawling API, you’ll receive an RID (Request ID). You can use this RID to track your request. Once the Crawler processes the HTML data, it will automatically be pushed to the webhook you specified when creating the Crawler, keeping the process asynchronous.

The Crawler offers APIs that allow you to perform different actions, such as Find, Delete, Pause, Unpause, etc. You can learn more about them here.

Note: The total number of pages in all Crawler waiting queues is capped at 1 million. If the combined queues exceed this limit, your Crawler push will temporarily pause, and you will be notified via email. The push will automatically resume once the number of pages in the queue drops below 1 million.

Receiving data from the Crawler

After pushing the URLs to the Crawler, Crawler will crawl the page associated with every URL and push the response with crawled HTML as the body to the webhook.

1
2
3
4
5
6
7
8
9
10
Headers:
"Content-Type" => "text/plain"
"Content-Encoding" => "gzip"
"Original-Status" => 200
"PC-Status" => 200
"rid" => "The RID you received in the push call"
"url" => "The URL which was crawled"

Body:
The HTML of the page

The default format of the response is HTML. If you want to receive a response in JSON format, you can pass a query param “format=json” with Crawling API while pushing data to the Crawler. JSON response will look like this

1
2
3
4
5
6
7
8
9
10
11
12
Headers:
"Content-Type" => "gzip/json"
"Content-Encoding" => "gzip"

Body:
{
"pc_status": 200,
"original_status": 200,
"rid": "The RID you received in the push call",
"url": "The URL which was crawled",
"body": "The HTML of the page"
}

Since we only pushed 3 URLs to the Crawler in the previous example, we received 3 requests from the Crawler on our webhook.

Receive data from Crawler

As in the webhook_receiver function, we have coded to save the request body into a .txt file. We will be able to see all the HTML content in that file like this.

Scraped HTML data

Once you get the HTML at your webhook, you can scrape anything from it, depending on your needs.

Important Note: You can update the Webhook URL for your Crawler at any time through your Crawlbase dashboard. If the Crawler sends a response to your webhook but your server doesn’t return a successful response, the Crawler will automatically retry crawling the page and reattempt delivery. These retries are counted as successful requests and will be charged. Additionally, if your webhook goes down, the Crawlbase Monitoring Bot will detect it and pause the Crawler. The Crawler will resume once the webhook is back online. For any changes to these settings, you can contact Crawlbase technical support.

For a more comprehensive understanding, refer to Crawlbase Crawler documentation.

Enhanced Callback Functionality with Custom Headers

In addition to the standard callback mechanism, Crawlbase provides an optional feature that allows you to receive custom headers through the “callback_headers” parameter. This enhancement empowers you to pass additional data for identification purposes, facilitating a more personalized and efficient integration with your systems.

Custom Header Format:

The format for custom headers is as follows:

HEADER-NAME:VALUE|HEADER-NAME2:VALUE2|and-so-on

It’s crucial to ensure proper encoding for seamless data transfer and interpretation.

Usage Example

For these headers & values pairs { ”id”: 123, type: “etc” }

&callback_headers=id%3A123%7Ctype%3Aetc

Receiving Customer Headers

Crawler will send all the custom headers in the header section of the response. You can easily access them along with your crawled data.

1
2
3
4
5
6
7
8
9
10
11
12
13
Headers:
"Content-Type" => "gzip/json"
"Content-Encoding" => "gzip"
"id" => 123
"type" => "etc"
Body:
{
"pc_status": 200,
"original_status": 200,
"rid": "The RID you received in the push call",
"url": "The URL which was crawled",
"body": "The HTML of the page"
}

With this upgrade, you now have greater flexibility and control over the information you receive through callbacks. By leveraging custom headers, you can tailor the callback data to your specific requirements, making aligning our services with your unique needs easier than ever.

Conclusion

Crawlbase Crawler provides a robust and efficient solution for web crawling and data scraping. With its powerful asynchronous capabilities, Crawlbase helps businesses quickly collect large amounts of data, receive real-time updates, and manage the data extraction process smoothly. Crawlbase Crawler is a popular tool for businesses that need to scrape large amounts of data, helping them stay ahead in today’s fast-moving digital world.

That said, while Crawlbase Crawler is a powerful tool, it’s essential to use it responsibly. Always ensure you comply with website terms of service, follow ethical scraping practices, and respect the guidelines of responsible data extraction. By doing so, we can all contribute to a healthy and sustainable web ecosystem. Let’s make the most of the web—responsibly and effectively.

Frequently Asked Questions

Q: What are the benefits of using the Crawlbase Crawler?

  1. Efficiency: The Crawler’s asynchronous capabilities allow for faster data extraction from websites, saving valuable time and resources.
  2. Ease of Use: With its user-friendly design, the Crawler simplifies the process of pushing URLs and receiving crawled data through webhooks.
  3. Scalability: The Crawler can efficiently handle large volumes of data, making it ideal for scraping extensive websites and dealing with substantial datasets.
  4. Real-time Updates: By setting the scroll time variable, you can control when the Crawler sends back the scraped website, providing real-time access to the most recent data.
  5. Data-Driven Decision Making: The Crawler empowers users with valuable insights from web data, aiding in data-driven decision-making and competitive advantage.

Q: How does Crawlbase Crawler make web scraping asynchronous?

Crawlbase Crawler makes web scraping asynchronous by allowing users to push URLs to the Crawler and continue working without waiting for the scraping process to finish. When you submit URLs, the Crawler adds them to a queue and processes them in the background. It returns a Request ID (rid) instead of the scraped data, so you can track the progress while the Crawler works. Once the data is ready, it is automatically pushed to your specified webhook, allowing you to receive the results without needing to wait for the scraping to complete. This asynchronous approach speeds up the process and improves efficiency.

Q: Do I need to use Python to use the Crawlbase Crawler?

No, you do not need to use Python exclusively to use the Crawlbase Crawler. The Crawler provides multiple libraries for various programming languages, enabling users to interact with it using their preferred language. Whether you are comfortable with Python, JavaScript, Java, Ruby, or other programming languages, Crawlbase has you covered. Additionally, Crawlbase offers APIs that allow users to access the Crawler’s capabilities without relying on specific libraries, making it accessible to a wide range of developers with different language preferences and technical backgrounds. This flexibility ensures that you can seamlessly integrate the Crawler into your projects and workflows using the language that best suits your needs.