Scraping one product page is a tutorial. Scraping a few million of them, every day, without your pipeline grinding to a halt is a systems problem. Large scale e commerce scraping is what powers price monitoring, catalog and assortment tracking, competitor benchmarking, and demand signals: the kind of analysis that needs the whole shelf, not a sample. The jump from hundreds of pages to millions does not just add work, it changes what can break and how you have to build.
This guide is scoped to public data: product listings, prices, availability, ratings, and review counts that anyone can see without logging in. It does not touch user accounts, carts, orders, or personal data. We will walk through why teams do this, the challenges that only show up at volume, and an architecture that holds up, with the Crawlbase Crawling API, async Crawler, and Smart AI Proxy framed as the scale layer.
Why scrape e-commerce at large scale
A single product page tells you what one price looks like right now. The value at scale comes from breadth and history: thousands of SKUs tracked across many retailers, sampled often enough to see movement. That is what turns raw listings into decisions. Four use cases drive most large scale e commerce scraping work.
- Price monitoring. Track competitor prices across a catalog over time so you can reprice, spot promotions, and catch undercutting the day it happens rather than the week after.
- Catalog and assortment tracking. Map which products a retailer carries, what is in stock, and how listings are described, so you can find gaps in your own assortment and watch competitors expand into new categories.
- Competitor benchmarking. Compare your prices, ratings, and review velocity against rivals on the same SKUs to see where you win and where you are losing the buy box.
- Demand signals. Review counts, rating trends, and stock churn are public proxies for what is selling. Watching them across many stores surfaces rising products before they show up in your own sales data.
This is the same shape of problem as any ecommerce web scraping job. The difference at scale is purely one of volume, and volume is exactly where the easy approaches fall apart.
What changes when you go from hundreds to millions of pages
A script that scrapes a few hundred pages on your laptop will not survive being pointed at a few million. The problems below are mild at small volume and become the whole job at large scale.
Volume and concurrency
Fetching pages one at a time is fine for a demo and hopeless for a catalog. Millions of pages means you need many requests in flight at once, which means request scheduling, backpressure, and a way to not lose work when a worker dies mid-run. Doing this synchronously from a single process is the first thing that breaks.
Anti-bot defenses
Large retailers run sophisticated bot detection. Datacenter IPs, repetitive request patterns, and missing browser fingerprints get challenged with CAPTCHAs or blocked outright. The more pages you pull, the more traffic you generate from a given source, and the faster you trip those defenses. What worked for a hundred requests gets you banned at a hundred thousand.
IP rotation
The answer to blocking is spreading requests across many IPs so no single address looks abusive. At small scale a handful of proxies is enough. At large scale you need a deep pool of residential proxies and a rotation strategy that keeps any one IP under the rate limits, which is real infrastructure to build and keep healthy.
Freshness
Price data is only useful if it is current. A full catalog crawl that takes three days gives you prices that are three days stale, which for repricing is worthless. Freshness forces you to crawl fast and on a schedule, which pushes concurrency and anti-bot pressure even higher.
Data quality
At volume, a small percentage of malformed pages, layout variants, and partial loads becomes thousands of bad rows. Site structures differ across retailers and change without notice, so extraction that assumed one layout silently returns empty fields. Without validation and monitoring, you will not notice until the bad data is already in a report.
At a hundred pages, a 5% failure rate is five retries you barely notice. At a million pages, it is fifty thousand failures that have to be detected, retried, and reconciled without losing data or double-counting. Large scale e commerce scraping is won or lost on retries, monitoring, and idempotency far more than on clever parsing.
The architecture that handles scale
A pipeline that survives millions of pages separates into four stages, each of which can scale independently: discover URLs, fetch pages, parse into structured rows, and store with validation. Treating them as one monolithic script is what makes small scrapers impossible to grow.
- URL discovery. Walk category and search pages to build the list of product URLs you need. This is a crawl in its own right and the input to everything downstream.
- Fetching. Pull each URL behind rotating IPs, with rendering when the page needs it, retries on failure, and enough concurrency to hit your freshness target.
- Parsing. Turn HTML into clean rows: name, price, currency, availability, rating, review count. Either with your own selectors or with auto-parsing.
- Storage and validation. Write rows to a queryable store, validate them on the way in, and flag anomalies so quality problems surface immediately.
The fetching stage is where most of the difficulty lives, and it is the stage Crawlbase is built to absorb. Instead of running a headless browser fleet and a proxy pool yourself, you make calls against the Crawling API and let it handle rendering, rotation, and unblocking server-side.
One fetch with the Crawling API
Here is a single synchronous fetch of a public category page, followed by a parse. The Crawling API takes your token and the target URL, routes the request through a trusted IP, and returns the HTML for you to extract from.
const { CrawlingAPI } = require('crawlbase') const cheerio = require('cheerio') const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }) const categoryURL = 'https://example-store.com/c/laptops?page=1' async function scrapeCategory(url) { const response = await api.get(url, { ajax_wait: true, page_wait: 3000 }) const $ = cheerio.load(response.body) const products = [] $('.product-card').each((i, el) => { products.push({ name: $(el).find('.product-title').text().trim(), price: $(el).find('.price').text().trim(), inStock: $(el).find('.availability').text().trim(), rating: $(el).find('.rating').attr('data-value'), }) }) return products } scrapeCategory(categoryURL).then((rows) => console.log(rows))
The two options matter for modern stores. ajax_wait tells the API to wait for asynchronous content, and page_wait holds for a fixed number of milliseconds so late-rendering prices appear before the HTML comes back. The selectors are placeholders: inspect your target in dev tools and map each field to a real one.
Skip the selectors with the Scraper API
Writing and maintaining selectors for every retailer is its own tax. The Crawlbase Crawling API returns structured JSON for supported e-commerce pages directly, so you get name, price, availability, and ratings as fields without parsing HTML yourself. At scale, fewer custom parsers means fewer things that silently break when a site changes its markup.
Large scale e commerce scraping needs rendering, rotation, and unblocking on every request, at volume. The Crawling API runs the page behind residential IPs server-side and hands you finished HTML or auto-parsed JSON, so you skip operating a headless fleet and a proxy pool yourself. Point it at a public category page on the free tier first.
Go asynchronous with the Crawler
Synchronous calls work for thousands of pages. For millions, blocking on each request and managing retries by hand does not scale. The async Crawler flips the model: you push URLs to it, it crawls them in the background with rotation and retries handled for you, and it delivers each result to a webhook you control as it finishes. Your code stops waiting on responses and just receives parsed pages.
const { CrawlingAPI } = require('crawlbase') const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }) const urls = [ 'https://example-store.com/p/sku-1001', 'https://example-store.com/p/sku-1002', 'https://example-store.com/p/sku-1003', ] async function queueAll(list) { for (const url of list) { await api.post(url, { callback: 'https://your-app.com/webhook' }) } } queueAll(urls).then(() => console.log('Queued', urls.length, 'URLs'))
This is the pattern that fits a freshness schedule: enqueue your whole URL list, let the Crawler chew through it concurrently, and process results as they land. Failed pages are retried inside the service, so your webhook handler only deals with finished work. That is the difference between a script and a pipeline.
When you want the proxy, not the parser
Some teams already have a scraping stack and only need the unblocking layer. The Crawlbase Smart AI Proxy is a single endpoint that sits in front of your existing HTTP client or headless browser and routes traffic through a rotating residential pool, so you keep your code and gain the rotation. If you are weighing how rotation works under the hood, rotating residential proxies covers the mechanics.
Keeping data quality high at volume
Clean data at scale is a process, not a parser. A few habits keep a large crawl trustworthy.
- Validate on write. Reject or flag rows with a missing price, an unparseable currency, or a name that is empty, so bad pages do not pollute the dataset.
- Track extraction rate. If the share of pages that yield a full record drops, a site likely changed its markup. Alert on it rather than discovering it in a report.
- Make the pipeline idempotent. Key rows on URL plus timestamp so retries and re-runs do not double-count, which matters the moment you add retries.
- Sample and spot-check. Pull a handful of rows against the live page periodically to confirm prices and availability still line up with reality.
The honest part: ToS and robots
Scraping large commercial retailers sits in a legal gray area, and whether it is allowed depends on the platform's terms of service, your jurisdiction, and what you do with the data. Many retailers restrict automated access in their terms, so scraping can run against those terms regardless of how careful your tooling is. None of the tooling here changes that; it just makes the technical part work.
A few lines worth holding to. Collect only public data: listings, prices, availability, and ratings anyone can see without an account. Respect each site's robots.txt and its stated rate expectations, and keep your request volume low enough that you are not straining anyone's servers. Never collect personal data, anything tied to individual user accounts, or anything behind a login. If you plan to reuse the data commercially, get permission or an official data agreement rather than assuming silence is consent. For the broader playbook, see how to scrape websites without getting blocked.
Key takeaways
- Scale changes the problem. Going from hundreds to millions of pages turns scraping into a reliability problem: concurrency, retries, freshness, and idempotency matter more than parsing.
- The hard parts are anti-bot and rotation. Big retailers block datacenter IPs and repetitive patterns, so a deep residential pool and smart rotation are non-negotiable at volume.
- Split the pipeline into stages. Discover URLs, fetch, parse, then store with validation, so each stage scales on its own.
- Use Crawlbase as the scale layer. The Crawling API handles rendering and unblocking, the async Crawler runs millions of URLs in the background with retries, and Smart AI Proxy gives you rotation for an existing stack.
- Validate quality continuously. Check rows on write, watch your extraction rate, and spot-check against live pages so bad data surfaces fast.
- Stay on public data. Respect ToS and robots.txt; no accounts, no personal data, no actions behind a login.
Frequently Asked Questions (FAQs)
What counts as large scale e commerce scraping?
There is no hard line, but the term usually means crawling product catalogs at a volume where a single script and a few proxies stop working: tens of thousands to millions of pages, refreshed on a schedule. At that point the work shifts from writing selectors to running reliable infrastructure, with concurrency, IP rotation, retries, and data validation doing most of the heavy lifting.
How do I avoid getting blocked when scraping millions of product pages?
Spread requests across a large pool of rotating residential IPs so no single address trips a rate limit, keep your per-IP rate low, and render pages when the site needs JavaScript. The Crawling API and Smart AI Proxy handle rotation and unblocking server-side; if you build your own stack, that is the part to invest in. Watch your status codes and back off the moment challenges start appearing.
Should I use the synchronous Crawling API or the async Crawler?
Use the synchronous Crawling API for interactive or smaller jobs where you want the response immediately. Use the async Crawler for large batches: you push URLs, it crawls them in the background with rotation and retries handled for you, and it pushes each finished result to your webhook. For millions of pages on a freshness schedule, the async model is what keeps your code from blocking on every request.
How do I keep scraped price data fresh?
Crawl on a schedule tight enough for your use case, which for repricing often means daily or faster. Freshness pushes up concurrency and anti-bot pressure, so a service that handles rotation and retries lets you crawl the whole catalog inside your window. Queue the full URL list to the async Crawler and process results as they land rather than waiting on a serial run.
Do I have to write parsers for every retailer?
Not necessarily. The Scraper API returns structured JSON for supported e-commerce pages, so you get name, price, availability, and ratings as fields without writing selectors. For sites it does not cover, fetch the HTML with the Crawling API and parse with a library like Cheerio. Fewer custom parsers means fewer things that break silently when a site updates its markup.
Is it legal to scrape e-commerce sites at scale?
It depends on the site's terms of service, your jurisdiction, and your purpose, and many retailers restrict automated access. Keep strictly to public listing data, respect robots.txt and rate expectations, and never touch accounts, personal data, or actions behind a login. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
