Most scraping tutorials show you how to fetch one page and parse it on the spot. That synchronous loop works fine until you need thousands of pages, at which point your script spends its life waiting: submit a URL, block until the response comes back, parse, repeat. Retries, queues, proxy rotation, and rendering all pile onto the same thread, and one slow target stalls the whole run. At scale you want a different shape entirely.

This guide shows you how to extract data using the Crawlbase Crawler, the asynchronous, push-based product built for bulk jobs. Instead of waiting on each request, you push a batch of URLs to the Crawler and it crawls them at scale on its own infrastructure, then delivers each finished result to a webhook endpoint you control. Submission and retrieval are decoupled, so your code is never blocked waiting on a page. By the end you will have a working callback server, a named crawler, and a script that pushes URLs and receives parsed data on the other side.

Synchronous vs asynchronous: pick the right tool

Crawlbase gives you two ways to fetch a page, and the difference is about timing, not capability. The Crawling API is synchronous: you send a request, you wait, the rendered HTML comes back in the same response. It is perfect when you need one page right now and want the result inline.

The Crawler is the asynchronous layer built on top of that same engine. You push a URL and get back an immediate acknowledgment with a request id, nothing more. The actual crawling happens in the background on Crawlbase servers, and when a page is ready the result is POSTed to your callback URL. You are never holding a connection open, so you can fire off thousands of URLs in seconds and let the results stream back to your endpoint as they finish.

When to reach for the Crawler

Use the Crawling API for interactive, one-off fetches where you want the page inline. Reach for the Crawler when you are crawling in bulk: large lists, recurring jobs, or anything where blocking on each request would cripple throughput. The Crawler absorbs queues, retries, proxy rotation, and JavaScript rendering for you, and hands back finished data through your webhook.

How the push model works

The flow has three moving parts and it helps to hold all three in your head before writing code.

First, a crawler: a named configuration you create once in the dashboard. It ties a callback URL to a request type (plain or JavaScript) so the engine knows where to deliver results and how to render. Second, the push request: you call the API with your token, the target URL, and your crawler name, and it returns a JSON acknowledgment containing a unique request id (the RID). Third, the callback: when the page is crawled, Crawlbase sends an HTTP POST to your callback URL with the page content and the same RID, so you can match each delivery back to the URL you submitted.

Your callback endpoint has to meet two conditions. It must be publicly reachable by Crawlbase servers, and it must answer fast: respond to the POST within a couple of hundred milliseconds with a 200, 201, or 204 status. The content arrives GZIP-compressed and, by default, as HTML; you can ask for parsed JSON instead by setting the format on the request. Because the work is asynchronous, your job on the receiving side is simply to acknowledge quickly and hand the payload off to a queue or a database, not to do heavy processing inline.

What you will build

A complete round trip in Python. You will stand up a small Flask webhook that receives crawled pages, expose it to the internet so Crawlbase can reach it, create a named crawler in the dashboard pointed at that public URL, and finally push target URLs using the official crawlbase client. We will use public test pages so you can run every step before pointing it at real targets.

Set up the environment

You need Python 3.8 or later. Confirm your version, create a virtual environment so the dependencies stay isolated, then install the two libraries: Flask for the webhook server and the official Crawlbase client for pushing requests.

bash
python --version

python -m venv crawler_env
source crawler_env/bin/activate

pip install flask crawlbase

On Windows, activate the environment with crawler_env\Scripts\activate instead of the source line. You will also need your Crawlbase token from the dashboard. Crawlbase offers two token types: the normal token for plain HTTP fetches and the JavaScript token for pages that render content client-side in a real browser. Pick the one that matches the sites you are targeting; most modern pages need the JavaScript token.

Step 1: Build the webhook that receives crawled data

The callback endpoint is where finished pages land. Create a file called webhook.py. The handler reads the POST body, logs the RID so you can correlate it with the push response, and returns 200 immediately. Flask decompresses the GZIP body for you, so the content is plain text by the time you read it.

python
from flask import Flask, request

app = Flask(__name__)

@app.route("/crawlbase", methods=["POST"])
def webhook():
    rid = request.headers.get("rid")
    original_url = request.headers.get("original_url")
    body = request.get_data(as_text=True)

    print(f"Received RID {rid} for {original_url}")
    print(f"Payload size: {len(body)} bytes")

    # Hand the payload to a queue or database here; keep this fast.
    return "", 200

if __name__ == "__main__":
    app.run(port=3000)

A few details matter here. Crawlbase sends the request id in a rid header and the crawled URL in original_url, so you never have to guess which submission a delivery belongs to. The handler does no heavy work: it acknowledges and returns. The two-hundred-millisecond response window is strict, so anything slow (parsing, writing to a slow store, calling another service) belongs on a background queue, not inside the request. Launch the server and leave it running in its own terminal.

bash
python webhook.py

Step 2: Expose the local server to the internet

Crawlbase servers must be able to reach your callback, and a server on localhost is not reachable from the outside. During development the simplest fix is a tunneling tool like ngrok, which gives your local port a public HTTPS URL. With the webhook still running on port 3000, open a second terminal and start the tunnel.

bash
ngrok http 3000

ngrok prints a public forwarding URL, something like https://random-id.ngrok-free.app. Your full callback URL is that host plus the route from the Flask app, so https://random-id.ngrok-free.app/crawlbase. Keep this terminal open too; the URL changes each time you restart the tunnel. In production you would point the crawler at a real, stable endpoint on your own infrastructure instead.

Production note

Tunneling is a development convenience, not a deployment strategy. For real workloads, host the webhook on a service with a stable public URL and verify each incoming request before trusting it, for example by checking that the RID matches one you actually pushed. Treat the callback as an untrusted public endpoint, because that is what it is.

Step 3: Create a crawler in the dashboard

A push request needs a named crawler so the engine knows where to deliver results. In your Crawlbase dashboard, go to the Crawler section and create a new crawler. You give it a unique name, paste your public callback URL (the ngrok URL plus /crawlbase), and choose the request type: normal for plain HTML or JavaScript for client-rendered pages. Save it, and the crawler shows up in your list, ready to receive pushes.

The name you choose is the value you pass on every push request, so keep it simple and memorable. A common pattern is one crawler per project or per data source, each pointed at a route your server can distinguish.

Crawlbase Crawler

The Crawler is the asynchronous, push-based way to scrape at scale. Push thousands of URLs in seconds and let finished pages stream back to your webhook while the engine handles queues, retries, proxy rotation, and JavaScript rendering on its own infrastructure. Create your first crawler on the free tier and point it at a public test page.

Step 4: Push URLs to the Crawler

With the webhook live, the tunnel open, and a crawler created, you are ready to push. The official crawlbase client wraps the API, so a push is a single get call with two extra options: callback set to true and crawler set to the name you registered. Create a file called push.py.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

target = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

response = api.get(target, {
    "callback": "true",
    "crawler": "my-crawler",
    "format": "json",
})

print(response["body"])

Run it with python push.py. The response is not the page content. It is an immediate acknowledgment with the request id, which is exactly what asynchronous means: the call returns before the crawl finishes. You get back something like this.

json
{ "rid": "e2bbac4e7ea9a4c4be57d2a4" }

A second or two later, the crawled page arrives at your webhook. Check the terminal running webhook.py and you will see the same RID printed, confirming the round trip closed: the engine crawled the page in the background and POSTed the finished result to your callback. Setting format to json on the push means the delivered payload is parsed JSON rather than raw HTML, which is usually what you want for downstream processing.

Pushing in bulk

One URL proves the wiring; the point of the Crawler is volume. Pushing a list is just a loop, and because each call returns immediately you can submit a large batch in seconds without waiting on any single crawl. The Crawler has a generous push queue, so you keep feeding it and let results arrive at the webhook on their own schedule.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

urls = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
    "https://books.toscrape.com/catalogue/soumission_998/index.html",
]

for url in urls:
    response = api.get(url, {
        "callback": "true",
        "crawler": "my-crawler",
        "format": "json",
    })
    print(f"Pushed {url} -> {response['body']}")

Each iteration returns its own RID, and your webhook receives a separate POST per URL as each crawl completes. Store the RID list on the push side and reconcile it against the deliveries on the callback side so you can spot anything that never came back and re-push it. That reconciliation loop is the backbone of a reliable bulk pipeline, and it slots naturally into a larger scalable web data pipeline.

Validating and using the harvested data

Receiving data is not the same as trusting it. Before the payload reaches your warehouse, validate it on the callback side: confirm the RID matches one you pushed, check the status code Crawlbase reports for the crawl, and verify the fields you expect are present and non-empty. A page can come back successfully at the HTTP level while a redesign or a soft block leaves the content you care about missing, so a quick schema check catches silent gaps early.

Once validated, the harvested data feeds the usual business needs: price and inventory monitoring across competitors, lead and contact enrichment, market and sentiment research, training sets for models, or keeping an internal catalog in sync with external sources. Because the Crawler delivers results continuously rather than in one blocking batch, it fits naturally into streaming and incremental pipelines where fresh data lands as soon as each page is crawled. For more on getting clean results at volume, see how to scrape websites without getting blocked.

If you would rather route your own traffic instead of using the push model, the Smart AI Proxy gives you the same rotating residential IPs as a drop-in endpoint, and the Crawling API returns pre-parsed JSON for supported sites when you want structured fields without managing parsing yourself.

Recap

Key takeaways

  • Asynchronous by design. The Crawler decouples submission from retrieval: you push URLs and get an immediate RID, then results are POSTed to your webhook as each crawl finishes.
  • Three moving parts. A named crawler in the dashboard, a push request carrying your token and crawler name, and a callback endpoint that receives the data and the matching RID.
  • Answer fast. Your webhook must be publicly reachable and respond within about 200 milliseconds with a 2xx, so acknowledge and offload heavy work to a queue.
  • Built for bulk. Because each push returns instantly, you can submit thousands of URLs in seconds and let the engine handle queues, retries, proxies, and rendering.
  • Validate before you trust. Reconcile RIDs and check that expected fields are present so silent gaps from redesigns or soft blocks do not slip into your data.

Frequently Asked Questions (FAQs)

What is the difference between the Crawler and the Crawling API?

The Crawling API is synchronous: you send a request and the rendered page comes back in the same response, which is ideal for one-off, interactive fetches. The Crawler is the asynchronous layer on top of the same engine: you push a URL, get an immediate request id, and the finished page is delivered later to your webhook. Use the Crawling API for inline results and the Crawler for bulk jobs where blocking on each request would limit throughput.

Why does my webhook need to be publicly accessible?

Crawlbase servers deliver crawled pages by sending an HTTP POST to your callback URL, so they have to be able to reach it over the internet. A server on localhost is invisible from outside your machine, which is why you expose it during development with a tunneling tool like ngrok. In production you host the webhook on a service with a stable public URL.

What does the push request return?

It returns a small JSON acknowledgment containing a unique request id, the RID, not the page content. That is the asynchronous contract: the call returns immediately while the crawl runs in the background. The actual page arrives later at your webhook, carrying the same RID in its headers so you can match each delivery back to the URL you submitted.

How fast does my callback have to respond?

Within about 200 milliseconds, with a 200, 201, or 204 status code. Crawlbase expects a quick acknowledgment, so your handler should read the payload, hand it to a queue or database, and return. Anything slow, such as parsing or writing to a slow store, belongs on a background worker rather than inside the request.

Can I receive parsed JSON instead of raw HTML?

Yes. By default the Crawler delivers HTML, but you can set the format to JSON on the push request, and the payload arrives parsed. Pick whichever shape your downstream code prefers; JSON is usually easier to work with for structured extraction, while HTML is handy when you want to run your own parser over the full page.

How do I make a bulk crawl reliable?

Track the RID for every URL you push and reconcile that list against the deliveries that reach your webhook. Anything that never arrives can be re-pushed. On the receiving side, validate each payload by confirming the RID, checking the reported crawl status, and verifying that expected fields are present, so a successful-looking response with missing content does not slip through unnoticed.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available