Scraping one page is easy. Scraping thousands across many different sites in one run is where most setups fall apart: a plain loop blocks on every request, IPs get banned a few hundred calls in, duplicate URLs waste credits, and a single slow page stalls the whole job. The problem is not parsing. It is throughput, blocking, and bookkeeping at scale.
This guide shows you how to scrape multiple websites at once in Python the way a production job actually runs. You will use the Crawling API to fetch and render each page behind a real browser and a trusted IP, and the asynchronous Crawler to push thousands of URLs onto a managed queue that crawls them concurrently and delivers results to a webhook. We will cover queueing many URLs, controlling concurrency, deduplicating work, and collecting the rows into one place.
Why a single loop does not scale
A first scraper is almost always one loop: read a URL, fetch it, parse it, save it, repeat. That works for ten pages and breaks at ten thousand. Each fetch blocks the next, so total time is the sum of every request. Send those requests from one IP and a site starts returning 403s and CAPTCHAs within a few hundred calls. And nothing in the loop notices that you already crawled half the list yesterday, so you pay to re-fetch pages you already have.
Scaling across many sites at once means solving three separate problems. You need concurrency, so slow pages do not block fast ones. You need unblocking, so rotating IPs and real browser rendering keep you off the ban list. And you need bookkeeping, so duplicate URLs get skipped and finished results land in one store. The rest of this guide maps each problem to a tool and shows the code.
Keep the boundary clear. The Crawling API fetches and renders one page per call: it runs the JavaScript, rotates the IP, and hands back finished HTML. The async Crawler is the queue on top: you push many URLs, it crawls them concurrently, retries failures, and POSTs each result to a webhook you host. Use the API for a bounded batch you wait on, the Crawler for a large fire-and-collect job.
What you will build
Two runnable patterns over a list of URLs spanning different sites. First, a concurrent batch that loops a deduplicated URL set through the Crawling API and writes every result to a JSON file, which is the right shape for hundreds to a few thousand pages you want in hand when the run finishes. Second, an async push to the Crawler for jobs that run into the tens of thousands, where blocking on each fetch is no longer an option. Both use the official crawlbase Python client.
Set up the environment
You need Python 3.8 or later. Confirm your version, create a virtual environment so dependencies stay isolated, then install the client.
python --version python -m venv scrape_env source scrape_env/bin/activate pip install crawlbase
On Windows, activate the environment with scrape_env\Scripts\activate instead of the source line. The crawlbase package is the official client and wraps both the Crawling API and the async Crawler, so you do not assemble HTTP calls by hand. Grab two tokens from your Crawlbase dashboard after signing up: a normal token for static pages and a JavaScript (JS) token for client-rendered ones. Read them from environment variables rather than hard-coding them.
export CRAWLBASE_TOKEN=your_normal_token_here export CRAWLBASE_JS_TOKEN=your_js_token_here
The normal token fetches static HTML and is cheaper and faster. The JS token renders the page in a real browser first, which you need for any site that loads content client-side. When you scrape across many different sites at once, you will hit both kinds, so a common pattern is to default to the JS token and drop to the normal one for targets you know are static.
Build a deduplicated URL set
Before any fetching, clean the input. A real target list, stitched together from sitemaps, category pages, and previous runs, is full of duplicates and stale entries. Deduplicating up front is the single cheapest optimization you can make, because the request you never send costs nothing. Normalize each URL and keep a set of what you have already crawled.
import json import os from urllib.parse import urlparse, urlunparse def normalize(url): parts = urlparse(url.strip()) # Drop fragments and trailing slashes so near-duplicates collapse. path = parts.path.rstrip("/") or "/" return urlunparse((parts.scheme, parts.netloc, path, "", parts.query, "")) raw_urls = [ "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html", "https://quotes.toscrape.com/page/1/", "https://quotes.toscrape.com/page/1/#top", # duplicate after normalizing ] targets = sorted({normalize(u) for u in raw_urls}) print(f"{len(targets)} unique URLs to crawl")
The set comprehension collapses exact and near-duplicate URLs in one line. For a run that resumes across days, persist the crawled set to disk and subtract it from targets at the start of each run so you never re-fetch a page you already have. That is the bookkeeping layer doing its job before you spend a single credit.
Scrape the batch concurrently with the Crawling API
Now fetch the set. The naive version is a serial loop, but serial is exactly what does not scale, so run the requests through a thread pool. Each call to the Crawling API is I/O-bound, waiting on the network, which is the case a thread pool handles well. A modest worker count keeps many requests in flight without hammering any one site too hard.
from concurrent.futures import ThreadPoolExecutor, as_completed from crawlbase import CrawlingAPI api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]}) def fetch(url): options = {"ajax_wait": "true", "page_wait": 2000} response = api.get(url, options) status = response["headers"].get("pc_status") return { "url": url, "status": status, "html": response["body"].decode("utf-8", "ignore"), } results = [] with ThreadPoolExecutor(max_workers=10) as pool: futures = {pool.submit(fetch, u): u for u in targets} for future in as_completed(futures): url = futures[future] try: results.append(future.result()) except Exception as err: print(f"Failed {url}: {err}") print(f"Collected {len(results)} pages")
Two details make this robust. The pc_status header carries the original status the target returned, so you can tell a real 200 from a soft failure and decide whether to keep the row. And wrapping future.result() in a try/except means one bad URL logs and moves on instead of killing the whole batch. The Crawling API handles the rendering and IP rotation per call, so the only thing your code manages is concurrency.
One call fetches and renders a page behind a real browser and a rotating residential IP, so a batch across many different sites stays unblocked without you running a headless fleet or a proxy pool. Start with a public page on the free tier, then scale the same loop to thousands of URLs.
Parse and save the collected results
You now have raw HTML for every page in results. Parsing varies per site, but the collection step is the same: pull the fields you want and write one structured record per page. Keep the URL and a capture timestamp on every row so the output doubles as an audit trail of what ran and when.
from datetime import datetime, timezone from bs4 import BeautifulSoup def parse_title(html): soup = BeautifulSoup(html, "html.parser") title = soup.find("title") return title.text.strip() if title else None rows = [] for item in results: if item["status"] != "200": continue rows.append({ "url": item["url"], "title": parse_title(item["html"]), "captured_at": datetime.now(timezone.utc).isoformat(), }) with open("scraped.json", "w") as f: json.dump(rows, f, indent=2) print(f"Wrote {len(rows)} rows to scraped.json")
This installs beautifulsoup4 alongside the client (pip install beautifulsoup4). Skipping rows that did not return a clean 200 keeps soft failures, an empty body or a challenge page, out of your dataset, which is the kind of silent corruption that poisons a large crawl quietly. For well-known targets like major retailers or marketplaces, you can skip hand-written parsing entirely and let the Crawling API return pre-parsed JSON instead.
Scale past the batch with the async Crawler
The thread-pool batch is the right tool up to a few thousand URLs you want to wait on. Past that, blocking your process while tens of thousands of pages crawl is no longer practical, and that is where the asynchronous Crawler takes over. It is a push-based managed queue: you submit URLs through the same client, each gets a request ID, the system crawls them concurrently and retries failures for you, then POSTs each finished page to a webhook on your server.
from crawlbase import CrawlingAPI crawler = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]}) # Push each URL to the async Crawler; results arrive at your webhook. for url in targets: response = crawler.post(url, { "callback": "https://your-app.example.com/webhook", "callback_headers": "X-Job-Id:bulk-run-01", }) body = json.loads(response["body"]) print(f"Queued {url} as request {body['rid']}")
Each post returns a request ID (rid) you can log to track the job. The Crawler crawls the queue in the background with its own concurrency and retry logic, so your script finishes the moment every URL is submitted instead of waiting on the crawl. When a page completes, the system POSTs the result to your callback URL, and the callback_headers field lets you tag a run so the receiving handler knows which job a delivery belongs to.
Collect the deliveries
The async model inverts collection: instead of pulling pages, you receive them. Your webhook runs the same parse-and-save logic from the batch version, only the trigger changes. A minimal handler in Flask looks like this.
from flask import Flask, request app = Flask(__name__) @app.route("/webhook", methods=["POST"]) def webhook(): rid = request.headers.get("rid") original_url = request.headers.get("original_url") html = request.get_data(as_text=True) row = { "url": original_url, "title": parse_title(html), "captured_at": datetime.now(timezone.utc).isoformat(), } with open("bulk.jsonl", "a") as f: f.write(json.dumps(row) + "\n") return "", 200
Appending to a JSON Lines file means each delivery is one self-contained write, so concurrent callbacks never clobber each other the way a single re-serialized JSON array would. The Crawler delivers the original URL and request ID in the response headers, so the same parse_title and row shape from the batch version carry straight over. This is what lets the pipeline grow from a few thousand pages to hundreds of thousands without your process ever sitting and waiting.
At volume you cannot eyeball a crawl, so lean on the built-in monitoring. The Crawlbase dashboard tracks request volume, success and failure rates, and credits used, and the live monitor shows queue depth in real time. A creeping rise in failures usually means one target started challenging traffic, and you want to catch that within minutes, not after a run finishes with half the rows missing.
Concurrency, rate limits, and staying unblocked
More workers is not always faster. Push concurrency too high and you either exhaust your plan's request rate or hammer a single domain hard enough to trigger its defenses, which slows the run down with retries. The fix is to control concurrency per domain rather than globally: ten in-flight requests spread across ten sites is gentle, while ten against one site is aggressive. Group your URL set by host and cap how many you keep in flight against any one of them.
Because the Crawling API and the Crawler both rotate residential IPs and render behind a real browser server-side, the heaviest part of staying unblocked is handled for you. If you would rather route your own client through a rotating pool, the Smart AI Proxy gives you the same residential IP rotation as a drop-in proxy endpoint. Either way, pace requests, vary targets, and watch the status codes so you can back off the moment a site starts pushing back. The full playbook lives in how to scrape websites without getting blocked.
Scrape responsibly
Scraping at scale is a responsibility, not just a capability. Stick to publicly available data; do not scrape content behind a login, paywalled material, or anything personal or copyrighted without a clear right to it. Read each site's robots.txt and terms of service, and honor the access rules they state. And rate-limit yourself: spacing requests and capping concurrency per domain keeps you off block lists and keeps your load off a site's servers. Restraint is not only the ethical choice, it is the operational one, because a job that respects limits stays online far longer than one that does not.
Key takeaways
- Split the problem. Scraping many sites at once is three problems, not one: concurrency, unblocking, and bookkeeping. Map each to a tool instead of cramming them into one loop.
- Deduplicate before you fetch. Normalize URLs and skip ones you already crawled, because the cheapest request is the one you never send.
- Use a thread pool for bounded batches. The Crawling API call is I/O-bound, so a modest pool of workers collects hundreds to a few thousand pages far faster than a serial loop.
- Push to the async Crawler at scale. For tens of thousands of URLs, submit them to the queue and receive results at a webhook, so concurrency, retries, and monitoring come for free.
- Control concurrency per domain. Spread load across hosts and cap in-flight requests per site so you stay unblocked instead of triggering defenses.
- Scrape responsibly. Public data only, respect robots.txt and terms of service, and rate-limit yourself so the job keeps running.
Frequently Asked Questions (FAQs)
How do I scrape multiple websites at once in Python?
Build a deduplicated set of URLs, then fetch them concurrently rather than in a serial loop. For a bounded batch, run the Crawling API through a ThreadPoolExecutor so slow pages do not block fast ones, and collect each result into a list you write to disk. For very large jobs, push the URLs to the asynchronous Crawler instead, which queues and crawls them in the background and delivers each finished page to a webhook you host.
What is the difference between the Crawling API and the async Crawler?
The Crawling API is synchronous: you send one URL and wait for the rendered page in the response, which is ideal for a single scrape or a small concurrent batch. The async Crawler is built for scale: you push many URLs, it crawls them in the background with its own concurrency and retries, and it POSTs each result to your webhook. Both share the same rendering and anti-block backbone, so you pick the one that fits your throughput.
How do I avoid getting blocked when scraping many sites?
Rotate IPs and render pages behind a real browser, and pace your requests so you do not overload any single domain. The Crawling API and Crawler handle IP rotation and rendering server-side, so most blocking is taken care of. If you route your own client, use a rotating endpoint like the Smart AI Proxy, control concurrency per domain, and watch status codes so you can back off when a site starts challenging traffic.
How do I handle duplicate URLs across thousands of pages?
Normalize each URL by stripping fragments and trailing slashes, then store them in a set so exact and near-duplicates collapse automatically. For runs that resume over time, persist the set of already-crawled URLs to disk and subtract it from your target list at the start of each run. That bookkeeping keeps you from paying to re-fetch pages you already have.
How many concurrent requests should I run?
Start modest, around ten workers, and tune from there based on your plan's request rate and how the targets respond. The number that matters is concurrency per domain, not the global total: ten requests spread across ten sites is gentle, while ten against one site is aggressive. Group URLs by host and cap how many you keep in flight against any single one to stay unblocked.
Is it legal to scrape thousands of websites?
Scraping publicly available data is generally accepted, but the legality depends on each site's terms of service, copyright, and data-protection laws like GDPR and CCPA. Stay on public data, avoid content behind logins or paywalls and anything personal or copyrighted, follow robots.txt, and rate-limit yourself. When in doubt about a specific target, check its terms and get legal advice before you run a large job against it.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
