How to Use Crawlbase Crawler

Businesses that want to stay ahead and make smarter decisions depend on web data more than ever. Crawlbase makes this easy with powerful tools for web scraping. One of its best products, the Crawlbase Crawler, helps you collect data asynchronously without waiting for the response. You can send URLs to it using the Crawlbase Crawling API, and instead of waiting or constantly checking for results, the Crawler automatically sends the scraped data to your server using a webhook—all in real-time. This means faster data collection with less effort.

In this blog, we’ll take a closer look at the Crawlbase Crawler and how its asynchronous processing and webhook integration make large-scale web scraping smooth and hassle-free. By the end of this blog, you’ll understand how to set up and use Crawlbase Crawler effectively.

Here’s a detailed video on how to use Crawlbase Crawler:

Creating the Crawlbase Crawler

To use the Crawler, you must first create it from your Crawlbase account dashboard. Depending on your need, you can create two types of Crawler, TCP or JavaScript. Use TCP Crawler to crawl static pages. Use the JS Crawler when the content you need to crawl is generated via JavaScript, either because it’s a JavaScript-built page (React, Angular, etc.) or because the content is dynamically generated on the browser.

For the example, we will create a TCP crawler from the dashboard.

Image showing Create Crawler Page without options

To create a Crawler, either we have to create a webhook or we can use Crawlbase Storage API. If you wish not to create your own webhook and store data your Crawler generates securely, Crawlbase offers a seamless solution through its Crawlbase Storage API.

By setting up your Crawler to use the Storage webhook endpoint, you can securely store your crawled data with added privacy and control—without worrying about storage limits. To do this, simply select the Crawlbase Storage option when creating your Crawler.

If you prefer not to use Crawlbase Storage, you can specify your own webhook endpoint to receive the data directly. The steps below explain how to create a webhook that meets Crawlbase Crawler’s requirements using the Python Django framework.

1. Creating a Webhook

A Webhook is an HTTP-based callback mechanism that allows one system to send real-time data to another when a specific event occurs. In the case of Crawlbase Crawler webhook should…

Be publicly reachable from Crawlbase servers
Be ready to receive POST calls and respond within 200ms
Respond within 200ms with a status code 200, 201, or 204 without content

Let’s create a simple webhook for receiving responses in the Python Django framework. Make sure you have Python and Django installed. To create a simple webhook receiver using Django in Python, follow these steps:

STEP 1

Create a new Django project and app using the following commands:

# Command to create the Project:

django-admin startproject webhook_project

# Go to webhook_project directory using terminal:

cd webhook_project

# Create webhook_app:

python manage.py startapp webhook_app

STEP 2

In the webhook_app directory, create a views.py file and define a view to receive the webhook data:

# webhook_app/views.py

from django.shortcuts import render
from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse
import gzip

def save_data_to_file(data):  # Assuming you want to save the data to a file named 'webhook_data.txt'
    with open('webhook_data.txt', 'a') as file:
        file.write(str(data) + '\n')

@csrf_exempt
def webhook_receiver(request):  # Add the request to the queue for asynchronous processing
    if request.method == 'POST':
        try:
            decompressed_data = gzip.decompress(request.body)
        except OSError as e:
            return HttpResponse('Error: Unable to decompress data', status=400)

        # Convert the decompressed byte data to string (or further processing)
        data_string = decompressed_data.decode('latin1')

        # save data to the file
        save_data_to_file(data_string)

    # return 204 to the crawler
    return HttpResponse(status=204)

The webhook_receiver function is decorated with @csrf_exempt to allow external services to send data without CSRF protection. It attempts to decompress Gzip-encoded data from the request body and, if successful, decodes it (assumed to be HTML) into a string. The data is then appended to a file named webhook_data.txt.

While this example simplifies things by saving the scraped HTML to a single file, in practice, you can extract and process any type of data from the HTML received via the webhook as needed.

STEP 3

Configure URL Routing In the webhook_project directory, edit the urls.py file to add a URL pattern for the webhook receiver:

# webhook_project/urls.py

from django.contrib import admin
from django.urls import path
from webhook_app.views import webhook_receiver

urlpatterns = [
    path('admin/', admin.site.urls),
    path('webhook/crawlbase/', webhook_receiver, name='webhook_receiver'),
]

STEP 4

Start the Django development server to test the webhook receiver:

1
2
3

# Command to start server
# Note: For Linux systems with Python version 3, use python3 at start
python manage.py runserver

The app will start running on localhost port 8000.

After creating a webhook, the next thing we need is to make the webhook publicly available on the internet.

To do this for this example, we are using ngrok. Since our webhook is running on localhost with port 8000, we need to run the ngrok on port 8000.

After running ngrok at port 8000, we can see that ngrok provides a public forwarding URL that we can use to create the crawler. With the free version of ngrok, this link will auto-expire after 2 hours.

Creating Crawlbase Crawler with Webhook

Now, let’s create a crawler from dashboard.

Create new Crawler from Crawlbase Crawler dashboard

Start by giving your Crawler a unique name, like “test-crawler” in our case, and specify your webhook URL in the callback option. In this example, the webhook URL will be a public Ngrok forwarding URL, followed by the webhook route address.

Pushing URLs to the Crawler

Now that you’ve created the “test-crawler”, the next step is to push the URLs you want it to crawl. To do this, you’ll need to use the Crawlbase Crawling API, along with two additional parameters: crawler=YourCrawlerName and callback=true. By default, you can push up to 30 URLs per second to the Crawler. If you need to increase this limit, you can request a change by contacting Crawlbase customer support.

Here’s an example in Python that uses the Crawlbase Python library to push URLs to the Crawler.

# To install the crawlbase library
!pip install crawlbase

# Importing CrawlingAPI
from crawlbase import CrawlingAPI

# Initializing CrawlingAPI with your TCP token
api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

# Using random Amazon URLs for the example
urls = [
    'https://www.amazon.com/AIRLITE-Microphone-Licensed-Microsoft-Accessories-x/dp/B08JR8HF2G',
    'https://www.amazon.com/Cabinet-Stainless-Kitchen-Drawer-Handles/dp/B07SPXKNXN',
    'https://www.amazon.com/Mkono-Decorative-Decoration-Organizer-Farmhouse/dp/B08292QMQR',
]

for url in urls:
    # Asynchronously sending the crawling request with specified options
    response = api.get(url, options={'callback': 'true', 'crawler': 'test-crawler'})

    # Printing the content of the response body
    print(response['body'])

After running the code, the Crawling API will push all URLs to the Crawler queue.

Example Output:

1
2
3

b'{"rid":"d756c32b0999b1c0507e364f"}'
b'{"rid":"455ee207f6907fbd6168ac1e"}'
b'{"rid":"e9eb6ce579dec207e8973615"}'

For every URL you push to the Crawler using the Crawling API, you’ll receive an RID (Request ID). You can use this RID to track your request. Once the Crawler processes the HTML data, it will automatically be pushed to the webhook you specified when creating the Crawler, keeping the process asynchronous.

The Crawler offers APIs that allow you to perform different actions, such as Find, Delete, Pause, Unpause, etc. You can learn more about them here.

Note: The total number of pages in all Crawler waiting queues is capped at 1 million. If the combined queues exceed this limit, your Crawler push will temporarily pause, and you will be notified via email. The push will automatically resume once the number of pages in the queue drops below 1 million.

Receiving data from the Crawler

After pushing the URLs to the Crawler, Crawler will crawl the page associated with every URL and push the response with crawled HTML as the body to the webhook.

Headers:
"Content-Type" => "text/plain"
"Content-Encoding" => "gzip"
"Original-Status" => 200
"PC-Status" => 200
"rid" => "The RID you received in the push call"
"url" => "The URL which was crawled"

Body:
The HTML of the page

The default format of the response is HTML. If you want to receive a response in JSON format, you can pass a query param “format=json” with Crawling API while pushing data to the Crawler. JSON response will look like this

Headers:
"Content-Type" => "gzip/json"
"Content-Encoding" => "gzip"

Body:
{
  "pc_status": 200,
  "original_status": 200,
  "rid": "The RID you received in the push call",
  "url": "The URL which was crawled",
  "body": "The HTML of the page"
}

Since we only pushed 3 URLs to the Crawler in the previous example, we received 3 requests from the Crawler on our webhook.

As in the webhook_receiver function, we have coded to save the request body into a .txt file. We will be able to see all the HTML content in that file like this.

Once you get the HTML at your webhook, you can scrape anything from it, depending on your needs.

Important Note: You can update the Webhook URL for your Crawler at any time through your Crawlbase dashboard. If the Crawler sends a response to your webhook but your server doesn’t return a successful response, the Crawler will automatically retry crawling the page and reattempt delivery. These retries are counted as successful requests and will be charged. Additionally, if your webhook goes down, the Crawlbase Monitoring Bot will detect it and pause the Crawler. The Crawler will resume once the webhook is back online. For any changes to these settings, you can contact Crawlbase technical support.

For a more comprehensive understanding, refer to Crawlbase Crawler documentation.

Enhanced Callback Functionality with Custom Headers

In addition to the standard callback mechanism, Crawlbase provides an optional feature that allows you to receive custom headers through the “callback_headers” parameter. This enhancement empowers you to pass additional data for identification purposes, facilitating a more personalized and efficient integration with your systems.

Custom Header Format:

The format for custom headers is as follows:

HEADER-NAME:VALUE|HEADER-NAME2:VALUE2|and-so-on

It’s crucial to ensure proper encoding for seamless data transfer and interpretation.

Usage Example

For these headers & values pairs { ”id”: 123, type: “etc” }

&callback_headers=id%3A123%7Ctype%3Aetc

Receiving Customer Headers

Crawler will send all the custom headers in the header section of the response. You can easily access them along with your crawled data.

Headers:
"Content-Type" => "gzip/json"
"Content-Encoding" => "gzip"
"id" => 123
"type" => "etc"
Body:
{
  "pc_status": 200,
  "original_status": 200,
  "rid": "The RID you received in the push call",
  "url": "The URL which was crawled",
  "body": "The HTML of the page"
}

With this upgrade, you now have greater flexibility and control over the information you receive through callbacks. By leveraging custom headers, you can tailor the callback data to your specific requirements, making aligning our services with your unique needs easier than ever.

Conclusion

Crawlbase Crawler provides a robust and efficient solution for web crawling and data scraping. With its powerful asynchronous capabilities, Crawlbase helps businesses quickly collect large amounts of data, receive real-time updates, and manage the data extraction process smoothly. Crawlbase Crawler is a popular tool for businesses that need to scrape large amounts of data, helping them stay ahead in today’s fast-moving digital world.

That said, while Crawlbase Crawler is a powerful tool, it’s essential to use it responsibly. Always ensure you comply with website terms of service, follow ethical scraping practices, and respect the guidelines of responsible data extraction. By doing so, we can all contribute to a healthy and sustainable web ecosystem. Let’s make the most of the web—responsibly and effectively.

Frequently Asked Questions

Q: What are the benefits of using the Crawlbase Crawler?

Efficiency: The Crawler’s asynchronous capabilities allow for faster data extraction from websites, saving valuable time and resources.
Ease of Use: With its user-friendly design, the Crawler simplifies the process of pushing URLs and receiving crawled data through webhooks.
Scalability: The Crawler can efficiently handle large volumes of data, making it ideal for scraping extensive websites and dealing with substantial datasets.
Real-time Updates: By setting the scroll time variable, you can control when the Crawler sends back the scraped website, providing real-time access to the most recent data.
Data-Driven Decision Making: The Crawler empowers users with valuable insights from web data, aiding in data-driven decision-making and competitive advantage.

Q: How does Crawlbase Crawler make web scraping asynchronous?

Crawlbase Crawler makes web scraping asynchronous by allowing users to push URLs to the Crawler and continue working without waiting for the scraping process to finish. When you submit URLs, the Crawler adds them to a queue and processes them in the background. It returns a Request ID (rid) instead of the scraped data, so you can track the progress while the Crawler works. Once the data is ready, it is automatically pushed to your specified webhook, allowing you to receive the results without needing to wait for the scraping to complete. This asynchronous approach speeds up the process and improves efficiency.

Q: Do I need to use Python to use the Crawlbase Crawler?

No, you do not need to use Python exclusively to use the Crawlbase Crawler. The Crawler provides multiple libraries for various programming languages, enabling users to interact with it using their preferred language. Whether you are comfortable with Python, JavaScript, Java, Ruby, or other programming languages, Crawlbase has you covered. Additionally, Crawlbase offers APIs that allow users to access the Crawler’s capabilities without relying on specific libraries, making it accessible to a wide range of developers with different language preferences and technical backgrounds. This flexibility ensures that you can seamlessly integrate the Crawler into your projects and workflows using the language that best suits your needs.

How to Use Crawlbase Crawler

Creating the Crawlbase Crawler

1. Creating a Webhook

STEP 1

STEP 2

STEP 3

STEP 4

Creating Crawlbase Crawler with Webhook

Pushing URLs to the Crawler

Receiving data from the Crawler

Enhanced Callback Functionality with Custom Headers

Conclusion

Frequently Asked Questions

Q: What are the benefits of using the Crawlbase Crawler?

Q: How does Crawlbase Crawler make web scraping asynchronous?

Q: Do I need to use Python to use the Crawlbase Crawler?

Hassan Rehan

Our solution

Crawler

Similar to "How to Use Crawlbase Crawler"

How To Extract Data Using Crawlbase’s Crawler

How To Crawl Websites With Node For Big Data

Best Practices for Scaling Your Web Scraping Projects in 2025

How to Build a Flask Callback Server to Store LinkedIn Profiles

Web Scraping in Java and Spring Boot in 2025

Most read from crawling scraping learning

Top Web Scraping Trends for E-Commerce in 2025

What Are The Main Advantages Of Cloud Storage And Why You Need One?

Advantages Of Web Scraping vs Manual Work

Start crawling and scraping the web today

How to Use Crawlbase Crawler

Creating the Crawlbase Crawler

1. Creating a Webhook

STEP 1

STEP 2

STEP 3

STEP 4

Creating Crawlbase Crawler with Webhook

Pushing URLs to the Crawler

Receiving data from the Crawler

Enhanced Callback Functionality with Custom Headers

Conclusion

Frequently Asked Questions

Q: What are the benefits of using the Crawlbase Crawler?

Q: How does Crawlbase Crawler make web scraping asynchronous?

Q: Do I need to use Python to use the Crawlbase Crawler?

Hassan Rehan

Our solution

Crawler

Share this post

Similar to "How to Use Crawlbase Crawler"

Most read from crawling scraping learning

Start crawling and scraping the web today