A web data pipeline is only as reliable as its collection layer. When dashboards go stale or numbers stop adding up, the cause is almost never the analytics code. It is the front of the pipeline: a scraper that broke after a site redesign, requests that started getting blocked, or pages that load in a browser but return an empty shell to a plain HTTP fetch. Treat acquisition as fragile and the whole pipeline inherits that fragility.
This guide shows you how to build a scalable web data pipeline with Crawlbase doing the collection and standard ETL tooling doing the rest. You will collect pages with the Crawling API for on-demand work and the async Crawler for high-volume jobs, transform and validate the raw HTML, load clean rows into storage, and schedule the whole thing with monitoring. Every step has runnable code you can adapt.
What a scalable web data pipeline looks like
The pattern is the classic ETL shape with one important separation of concerns. Crawlbase sits at the front as an ingestion layer and handles everything that makes scraping unstable: JavaScript rendering, IP rotation, request routing, and block mitigation. Your systems handle parsing, validation, storage, and analytics. The flow reads left to right:
Web -> Crawlbase (collect) -> Transform + Validate -> Storage -> BI / ML
The reason to draw the boundary here is durability. External websites are not stable dependencies; they ship layout changes, run experiments, and deploy anti-bot defenses without warning. By putting a managed collection layer in front, a site change becomes a configuration concern instead of a pipeline outage. Crawlbase gives you two collection tools for two workload shapes, and a production pipeline usually uses both.
- Crawling API for real-time, on-demand retrieval of known URLs. You send a URL, it returns the page.
- Async Crawler for large-scale, fire-and-forget collection. You push URLs, it fetches them asynchronously and POSTs results to your webhook.
This is the same separation any serious ecommerce web scraping operation ends up with: a fast path for targeted lookups and a bulk path for coverage. If you are new to the proxy mechanics underneath, what is a proxy server is useful background, though the point of a managed API is that you do not have to manage any of it.
Step 1: Collect with the Crawling API
The Crawling API takes a URL plus your token and returns the rendered page. You send an HTTP GET; it routes the request through a rotating IP pool, optionally renders JavaScript when you pass a JS token, and hands back the HTML (or parsed JSON). The simplest possible call is a single curl:
curl 'https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fexample.com%2Fproducts'
In a pipeline you want a small, reusable collector instead of raw curl. Install the official client and wrap the call so the rest of the pipeline gets clean HTML and never thinks about tokens or retries. Use the JS token for client-side-rendered pages and the normal token for static HTML.
python3 -m venv venv && source venv/bin/activate pip install crawlbase
from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) def collect(url, render=False): options = {'ajax_wait': True, 'page_wait': 3000} if render else {} response = api.get(url, options) status = response['status_code'] if status != 200: raise RuntimeError(f'collect failed for {url}: {status}') return response['body'].decode('utf-8') html = collect('https://example.com/products', render=True) print(len(html), 'bytes')
Two details make this pipeline-grade rather than a toy. First, it checks status_code and raises on anything that is not a clean fetch, so a bad page surfaces loudly instead of poisoning your warehouse with empty rows. Second, the render flag keeps your call sites honest about which pages need JavaScript: only pay the render cost where the content actually demands it. This collector is the unit your scheduler will call for every known URL.
Crawlbase gives you two tokens. The normal token returns static HTML fast and cheap; the JS token renders the page in a real browser first, which you need for client-side-rendered sites. Reach for the JS token only when a page returns an empty shell to a plain fetch, and pair it with ajax_wait and page_wait so late-loading content has time to appear.
Step 2: Scale volume with the async Crawler
The Crawling API is synchronous: one request, one response, and your code waits. That is exactly right for a few hundred known URLs. For tens of thousands, blocking on each call does not scale. The async Crawler flips the model. You push URLs into a named crawler, the request returns immediately with a Request ID, Crawlbase fetches the page in the background, and when it is done it POSTs the result to your callback endpoint. Nothing in your code blocks waiting for pages.
You opt into async mode by adding two parameters to the same endpoint: callback=true and crawler=YourCrawlerName (you create the crawler once in the dashboard and point it at your webhook URL). Pushing a URL looks like this:
curl 'https://api.crawlbase.com/?token=YOUR_TOKEN&callback=true&crawler=my-pipeline&url=https%3A%2F%2Fexample.com%2Fp%2F123'
Instead of the page body you get back a Request ID, which means the URL is queued:
{ "rid": "1e92e8bf4618772871c14d4" }
From your side, pushing a large batch is a tight loop. The point is throughput: you fire all the URLs without waiting for any of them to finish, and the queue absorbs the work.
from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) def push_batch(urls): options = {'callback': True, 'crawler': 'my-pipeline'} for url in urls: response = api.get(url, options) rid = response['body']['rid'] print(f'queued {url} -> {rid}') push_batch([ 'https://example.com/p/123', 'https://example.com/p/124', 'https://example.com/p/125', ])
The other half of async is the callback handler. Crawlbase POSTs the crawled page to the webhook you registered with the crawler, sending the HTML in the request body and metadata (the Request ID, original URL, and status) in the headers. Your handler should do the bare minimum: acknowledge fast with a 200 and hand the payload to your transform step. Doing heavy parsing inline risks the delivery timing out and being retried.
const express = require('express') const app = express() // Crawlbase POSTs raw HTML; capture the body as text app.use(express.text({ type: '*/*', limit: '10mb' })) app.post('/crawlbase/callback', (req, res) => { const rid = req.headers['rid'] const url = req.headers['url'] const status = req.headers['original_status'] // ack immediately, process out of band res.sendStatus(200) enqueueForTransform({ rid, url, status, html: req.body }) }) app.listen(8080, () => console.log('callback listening on :8080'))
If you would rather not run a webhook at all, point the crawler at Crawlbase Cloud Storage and poll it instead; the trade-off is a small delay versus zero infrastructure. Either way, the async model lets you collect millions of pages without your application ever blocking on a fetch.
One token covers both halves of collection: synchronous calls for known URLs and async pushes for volume, with rendering, rotating IPs, and block mitigation handled server-side. Start on the free tier, wire the callback to a throwaway endpoint, and watch results arrive before you build the rest of the pipeline.
Step 3: Transform and validate the raw HTML
Collection gives you HTML. The transform step turns that HTML into clean, typed records and throws out anything that does not pass a quality bar. This is where a lot of pipelines quietly rot: a job reports success, but the rows it wrote are empty because a selector drifted. Validate explicitly so a parsing failure looks like a failure.
Parse with whatever fits your stack; the example uses BeautifulSoup. The function extracts fields, normalizes them into native types, and refuses to emit a record with a missing name or an unparseable price.
import re from bs4 import BeautifulSoup def transform(html, source_url): soup = BeautifulSoup(html, 'html.parser') records = [] for card in soup.select('.product-card'): name = card.select_one('.title') price = card.select_one('.price') if not name or not price: continue # skip incomplete cards, do not emit junk digits = re.sub(r'[^\d.]', '', price.get_text()) if not digits: continue records.append({ 'name': name.get_text(strip=True), 'price': float(digits), 'source_url': source_url, }) if not records: raise ValueError(f'no records parsed from {source_url} (selectors may have drifted)') return records
The shape that matters: clean each field into a native type (a float price, a stripped string), drop incomplete records instead of writing blanks, and raise when a whole page yields nothing so a drifted selector is caught the same day it breaks rather than weeks later in a report. If you want to skip parsing entirely for supported sites, the Crawling API returns structured JSON directly and this step becomes a passthrough.
Step 4: Load into storage
With validated records in hand, write them somewhere queryable. The destination depends on scale and use: a relational database like PostgreSQL for transactional access, a warehouse like BigQuery for analytics, a search store, or a streaming platform downstream. SQLite is enough to show the pattern, and the pattern is what generalizes: upsert on a stable key so re-running the pipeline updates existing rows instead of duplicating them.
import sqlite3 from datetime import datetime, timezone def load(records, db_path='pipeline.db'): conn = sqlite3.connect(db_path) conn.execute(''' CREATE TABLE IF NOT EXISTS products ( source_url TEXT PRIMARY KEY, name TEXT NOT NULL, price REAL NOT NULL, collected_at TEXT NOT NULL )''') now = datetime.now(timezone.utc).isoformat() for r in records: conn.execute(''' INSERT INTO products (source_url, name, price, collected_at) VALUES (?, ?, ?, ?) ON CONFLICT(source_url) DO UPDATE SET name=excluded.name, price=excluded.price, collected_at=excluded.collected_at ''', (r['source_url'], r['name'], r['price'], now)) conn.commit() conn.close()
The upsert is what makes the load step idempotent: running the same batch twice leaves the table in the same state, which is exactly what you want when a scheduler retries a failed run. The collected_at timestamp gives you a freshness signal you will use for monitoring in the next step. Swap the SQLite calls for your warehouse client and the logic carries over unchanged.
Step 5: Automate, schedule, and monitor
The pieces compose into one pipeline function, and that function is what your scheduler calls. Wiring collect, transform, and load together with a per-URL try/except keeps one bad page from killing an entire run.
import logging logging.basicConfig(level=logging.INFO) log = logging.getLogger('pipeline') def run_pipeline(urls): ok, failed = 0, 0 for url in urls: try: html = collect(url, render=True) records = transform(html, url) load(records) ok += 1 except Exception as err: failed += 1 log.error('pipeline failed for %s: %s', url, err) log.info('run complete: %d ok, %d failed', ok, failed) if failed > ok: raise RuntimeError('majority of URLs failed, check upstream')
To run it on a schedule, the simplest option is cron. This entry runs the pipeline every six hours and appends output to a log you can tail or ship to your monitoring stack:
# run the pipeline every 6 hours 0 */6 * * * /path/to/venv/bin/python /path/to/run.py >> /var/log/pipeline.log 2>&1
Cron is fine for a handful of jobs. Once you have dependencies between steps, retries, and backfills, graduate to a workflow orchestrator like Apache Airflow or Prefect, which give you DAGs, automatic retries, and a UI for run history. For the async Crawler, there is no scheduler to run at all on the collection side: you push URLs and results stream into your callback as they finish.
Monitoring is the difference between a pipeline you trust and one you babysit. Track three things at minimum. Volume: row counts per run, so a sudden drop flags a collection problem. Freshness: the collected_at timestamps you stored, so you can alert when data goes stale. Failure rate: the ok-versus-failed tally from each run, so a creeping increase warns you a target site is changing before everything breaks. Pair that with sensible scraping hygiene; how to scrape websites without getting blocked covers the practices that keep the collection layer healthy at scale.
Key takeaways
- Collection is the weak link. Put a managed ingestion layer in front so a site change is a config tweak, not a pipeline outage.
- Two collection modes. The Crawling API serves synchronous, known-URL lookups; the async Crawler pushes high volume and POSTs results to your webhook without blocking.
- Validate in transform. Clean fields into native types, drop incomplete records, and raise when a page yields nothing so drifted selectors fail loudly.
- Make the load idempotent. Upsert on a stable key so retries and re-runs update rows instead of duplicating them.
- Schedule and monitor. Cron or an orchestrator drives runs; track volume, freshness, and failure rate to catch problems early.
Frequently Asked Questions (FAQs)
How do I build a scalable web data pipeline with Crawlbase?
Use Crawlbase as the collection layer and standard ETL tooling for the rest. Collect pages with the Crawling API for known URLs and the async Crawler for high volume, transform the returned HTML into validated typed records, load them into storage with an idempotent upsert, and schedule the run with cron or an orchestrator while monitoring volume, freshness, and failure rate. Crawlbase handles rendering, IP rotation, and block mitigation so your code only deals with clean data.
When should I use the Crawling API versus the async Crawler?
Use the Crawling API when you have a known list of URLs and want the page back immediately, which suits backend services, monitoring jobs, and real-time lookups. Use the async Crawler when you are collecting at high volume or want fire-and-forget delivery: you push URLs, get a Request ID instantly, and Crawlbase POSTs each result to your callback as it finishes. Many pipelines run both, the API for targeted retrieval and the Crawler for broad coverage.
How does the async Crawler callback work?
You create a named crawler in the dashboard and point it at your webhook URL, then push URLs with callback=true and crawler=YourCrawlerName. Each push returns a Request ID immediately. When Crawlbase finishes fetching a page, it sends an HTTP POST to your webhook with the HTML in the body and metadata in the headers. Your handler should return a 200 quickly and process the payload out of band so the delivery does not time out.
Do I still need to manage proxies or handle anti-bot defenses?
No. The Crawling API and Crawler route requests through a rotating IP pool, render JavaScript when you pass a JS token, and apply block mitigation server-side. You send a URL and get a page back, so you skip running a proxy pool and a headless browser fleet yourself. If you only need raw rotating IPs for your own stack, the Smart AI Proxy exposes the same network as a standard proxy endpoint.
How do I keep my pipeline from writing empty or bad data?
Validate in the transform step. Check the response status on collection and raise on anything that is not a clean fetch, then in parsing drop records with missing required fields and raise when an entire page yields zero records, since that usually means a selector drifted. Make the load idempotent with an upsert so retries do not duplicate rows, and store a collection timestamp so you can monitor freshness and alert when data goes stale.
Can this pipeline handle millions of pages?
Yes. The bottleneck in a naive design is blocking on each synchronous fetch, which the async Crawler removes by queuing work and delivering results via callback. Push large batches without waiting, let the queue absorb the load, and process results as they arrive. For very large or ongoing programs, an enterprise plan adds the throughput and support that high-volume collection needs.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

