Prices on a marketplace move constantly, and a single page view only tells you what something costs right now. Price intelligence is the discipline of turning that moving target into structured data you can track: pull competitor and marketplace prices from public product pages, normalize them into clean rows, store them over time, and read the trend. This guide shows you how to use web scraping for price intelligence end to end, with runnable Python you can point at real listings today.
To keep this honest, the whole walkthrough stays on public product data: names, prices, currencies, and listing URLs that anyone can see without logging in. It does not touch user accounts, login-walled pages, checkout actions, or personal data. There is a short ToS note near the end that is not boilerplate, so read it before you run this at volume.
What price intelligence actually needs
It is easy to think of this as "scrape a price." In practice a useful price-intelligence system has four jobs, and the scraping is only the first one.
- Collect prices from the public pages you care about, reliably enough to run on a schedule.
- Normalize the messy raw values (currency symbols, thousands separators, "from" prices) into clean numbers.
- Store each observation with a timestamp so you have history, not just a snapshot.
- Analyze the history: compare across sources, compute averages, and flag moves worth acting on.
This is the same shape of problem as any ecommerce web scraping job. The difference with price intelligence is that the value lives in the time series, so the collection has to be repeatable and the data has to land somewhere you can query later.
Why collection is the hard part
If you point a bare HTTP client at a major marketplace search page, you usually get one of two disappointing results: a 200 response with almost no product data in the body, or a block. Two things work against you. Many marketplaces render their listings in the browser with JavaScript, so the initial HTML is a shell that fills in only after scripts run. And they flag automated traffic quickly: datacenter IPs and request patterns that do not look like a real browser get challenged before they ever see the rendered content.
So reliable collection needs two things in one request: a renderer for client-side pages, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of residential proxies, but keeping that fleet healthy is most of the work. The Crawling API folds both into a single call. For the big marketplaces it also ships ready-made parsers, so you can skip writing selectors entirely.
The Crawling API returns the raw rendered HTML for any URL, which you then parse yourself. The Crawling API and the Crawling API's built-in scrapers go one step further: for supported sites like Amazon and eBay they return clean JSON, so there is no HTML parsing to maintain. This guide uses the built-in scrapers for collection and falls back to raw HTML when a target is not supported.
Set up the project
You need Python 3 installed. Create a directory, then install the four libraries this walkthrough uses: requests for HTTP, price_parser to normalize currency strings, and pandas for the analysis step.
mkdir price-intelligence && cd price-intelligence python -m venv .venv && source .venv/bin/activate pip install requests price_parser pandas
You also need a Crawlbase account and an API token, which you get from the dashboard after signing up. New accounts come with free requests, so you can test everything below before committing to anything. Drop the token in wherever you see YOUR_CRAWLBASE_TOKEN.
Collect prices from a marketplace
Start with the search pages you want to track. For a product like a phone, an Amazon and an eBay search for the same query give you two competing sources to compare. Because both are supported by the Crawling API's built-in scrapers, you pass a scraper parameter and get structured product JSON back instead of HTML.
import requests import urllib.parse API_TOKEN = "YOUR_CRAWLBASE_TOKEN" API_ENDPOINT = "https://api.crawlbase.com/" def collect(url, scraper, country="US"): params = { "token": API_TOKEN, "url": url, "scraper": scraper, "country": country, } resp = requests.get(API_ENDPOINT, params=params, timeout=90) resp.raise_for_status() return resp.json()["body"]["products"] def search_url(host, path, query): q = urllib.parse.quote_plus(query) return f"https://www.{host}/{path}{q}"
The scraper parameter is what does the heavy lifting: amazon-serp and ebay-serp tell the API to return parsed product lists rather than raw markup. The country parameter routes the request through an IP in that region, which matters because prices and availability are localized. One small wrapper now drives both sources.
def collect_amazon(query, country="US"): url = search_url("amazon.com", "s?k=", query) return collect(url, "amazon-serp", country) def collect_ebay(query, country="US"): url = search_url("ebay.com", "sch/i.html?_nkw=", query) return collect(url, "ebay-serp", country)
Each call returns a list of product dictionaries. The shape differs by source (Amazon gives you name and a flat price string; eBay nests the current price under price.current.to), which is exactly why the next step exists.
Normalize into one clean shape
Raw price data is never analysis-ready. You get currency symbols, thousands separators, "from" ranges, and a different field layout per source. Normalize at capture so everything downstream sees the same columns: a source, a product name, a numeric price, a currency, and the listing URL. Normalizing once, here, is what keeps the storage and analysis code simple.
from price_parser import Price def to_row(source, name, raw_price, url): parsed = Price.fromstring(raw_price or "") if parsed.amount is None: return None return { "source": source, "product": name.strip(), "price": float(parsed.amount), "currency": parsed.currency or "", "url": url, } def normalize(query, country="US"): rows = [] for item in collect_amazon(query, country): row = to_row("Amazon", item["name"], item.get("price"), item["url"]) if row: rows.append(row) for item in collect_ebay(query, country): raw = item["price"]["current"]["to"] row = to_row("eBay", item["title"], raw, item["url"]) if row: rows.append(row) return rows
price_parser handles the currency parsing for you: it reads "£1,138.00" or "$709.00" and hands back a clean amount and currency code, so a price-comparison job never has to care which symbol a source used. After this step every observation looks the same regardless of where it came from.
[ { "source": "Amazon", "product": "Apple iPhone 15 Pro Max 256GB", "price": 1138.0, "currency": "USD", "url": "https://www.amazon.com/dp/B0DGTJ6Y1S" }, { "source": "eBay", "product": "Apple iPhone 15 Pro Max 256GB Blue Titanium", "price": 709.0, "currency": "USD", "url": "https://www.ebay.com/itm/236096139018" } ]
Price intelligence lives or dies on reliable collection. The Crawling API renders client-side pages behind rotating residential IPs in one call, and for big marketplaces its built-in scrapers return clean product JSON, so you skip both a headless fleet and most of your parsing code. Point it at a public search page on the free tier first.
Store each run with a timestamp
A single normalized list is a snapshot. Price intelligence is about the trend, so every run has to land in storage with a timestamp attached. A flat CSV with an appended captured_at column is enough to start, and it loads straight into pandas or a spreadsheet later.
import csv, os from datetime import datetime, timezone FIELDS = ["captured_at", "source", "product", "price", "currency", "url"] def store(rows, path="price_history.csv"): stamp = datetime.now(timezone.utc).isoformat() new_file = not os.path.exists(path) with open(path, "a", newline="") as f: writer = csv.DictWriter(f, fieldnames=FIELDS) if new_file: writer.writeheader() for row in rows: writer.writerow({"captured_at": stamp, **row}) if __name__ == "__main__": rows = normalize("Apple iPhone 15 Pro Max 256GB", country="US") store(rows) print(f"stored {len(rows)} rows")
Run this on a schedule (a cron job every few hours, or hourly if your tier allows it) and price_history.csv grows into a real time series. When you outgrow a flat file, write the same rows into a database table instead; the normalized shape means nothing else changes. If you are collecting across many products and regions, the asynchronous Crawler lets you push large batches of URLs and receive results via webhook rather than blocking on each request.
Analyze: compare sources and spot moves
With history on disk, the analysis is short. Load the CSV into pandas, group by source, and compare. Here is the classic price-intelligence question: for a given product, where is it cheaper right now, and by how much?
import pandas as pd df = pd.read_csv("price_history.csv", parse_dates=["captured_at"]) # Latest run only, for a head-to-head comparison latest = df[df["captured_at"] == df["captured_at"].max()] by_source = latest.groupby("source")["price"].agg(["mean", "min", "count"]).round(2) print(by_source) # Day-over-day move per source, from the stored history daily = df.set_index("captured_at").groupby("source")["price"] trend = daily.resample("D").mean().round(2) print(trend.pct_change().round(3))
The first block tells you who is cheaper today; the second turns your stored history into a daily trend and a percentage change, which is the signal you actually act on. A drop past a threshold can trigger an alert; a steady climb tells you the market is moving and your own pricing may be due for a look. Everything here is plain pandas because the hard work happened upstream in collection and normalization.
Optional: layer AI on top
You do not need machine learning to do price intelligence, but two problems get easier with it once you are collecting at scale.
The first is product matching. The same item is titled differently on every site ("iPhone 15 Pro Max 256GB" vs "Apple iPhone 15 Pro Max (256 GB) Blue Titanium"), so comparing like for like means clustering listings that refer to the same product. Embedding the titles and grouping by similarity does this far better than string matching, and it is the difference between a real comparison and noise.
The second is anomaly detection. Over a long enough history, most price moves are normal seasonal drift. A simple rolling statistic (flag any observation more than a few standard deviations from a product's trailing mean) catches the genuine events, a sudden undercut or a pricing error, without you watching a dashboard. Start with that rule; reach for a model only when the simple version stops being enough.
Staying unblocked at scale
Even with rendering and IPs handled by the API, a few habits keep a recurring collection job healthy, and they apply to any hard commercial target.
- Pace your requests. The Crawling API's default rate is generous for e-commerce, but hammering the same search in a tight loop still invites throttling. Spread runs out and vary your queries. If you start seeing 429s, that is the rate-limit signal.
- Lean on rotation. A pool of residential proxies spreads requests across many real-user IPs so no single address trips a limit. The API does this for you; if you build your own stack, this is the part to get right. The Smart AI Proxy exposes the same rotation as a standard proxy endpoint if you prefer that integration.
- Read the status codes. You are not charged for failed requests, so a failed crawl is cheap to retry. A run that starts returning challenges is telling you the current tier is no longer enough.
For the full playbook, see how to scrape websites without getting blocked. If your collection is growing past a few products into thousands of SKUs across regions, large-scale e-commerce scraping covers the architecture for that volume.
The honest part: ToS and public data
Scraping a large commercial marketplace sits in a legal gray area, and whether it is allowed depends on the platform's terms of service, your jurisdiction, and what you do with the data. Most marketplace terms restrict automated access, so collection can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work.
A few lines worth holding to. Collect only public data: product names, prices, currencies, and listing URLs that anyone can see without an account. Respect each site's robots.txt and its stated rate expectations, and keep your volume low enough that you are not straining anyone's servers. Never collect personal data, including anything tied to individual seller or buyer accounts. And if you plan to reuse the data commercially, get permission or an official data agreement rather than assuming silence is consent. This guide is scoped to public listing data on purpose, because that is the line that keeps the work defensible.
Key takeaways
- Price intelligence is four jobs, not one. Collect, normalize, store with a timestamp, then analyze. The scraping is only the first step.
- Reliable collection needs rendering and a trusted IP. The Crawling API does both in one call, and its built-in scrapers return clean JSON for supported marketplaces so there is no HTML parsing to maintain.
- Normalize at capture. Parse currency strings into numbers once, into one shape, and every storage and analysis step stays simple.
-
The value is the time series. Append each run with a
captured_atstamp so you can read trends and day-over-day moves, not just a snapshot. - AI is optional polish. Embeddings help match the same product across sites; a rolling-stat rule flags real price anomalies. Reach for them only when the simple version stops scaling.
- Stay on public data. Respect ToS and robots.txt; no accounts, no personal data.
Frequently Asked Questions (FAQs)
What is web scraping for price intelligence?
It is the practice of automatically collecting prices from public product pages, normalizing them into clean numbers, and tracking them over time so you can compare competitors and spot market moves. The scraping gathers the raw observations; the intelligence comes from storing a time series and analyzing the trend rather than reading a single snapshot.
Do I have to parse HTML to collect prices?
Not for the big marketplaces. The Crawling API's built-in scrapers (and the Scraper API) return parsed product JSON for supported sites like Amazon and eBay, so you skip selectors entirely. You only fall back to parsing raw HTML when a target site is not covered, in which case the API still hands you the rendered page to work with.
How often should I collect prices?
It depends on how fast your market moves and your request budget. Hourly is plenty for most catalogs; fast-moving categories may want more, slow ones less. Whatever the cadence, append every run with a timestamp so you build real history. Pace requests and vary queries so a recurring job does not look like a burst attack.
How do I compare the same product across different sites?
Titles differ on every marketplace, so exact string matching fails. Normalize each listing into the same fields at capture, then match products by similarity rather than identical text. For a handful of SKUs a manual mapping works; at scale, embedding the titles and clustering by similarity is the reliable approach.
Will I get blocked collecting prices at scale?
You can, if you send scraper-shaped traffic from a single IP. Keep the per-IP rate low, vary your search parameters, and route through rotating residential IPs so no one address trips a limit. The Crawling API and Smart AI Proxy manage rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. You are not charged for failed requests, so retrying a blocked crawl is cheap.
Is it legal to scrape prices for price intelligence?
It depends on the target's terms of service, your jurisdiction, and your purpose, and most marketplace terms restrict automated access. Keep strictly to public listing data (names, prices, currencies, URLs), respect robots.txt and rate expectations, and never touch accounts or personal data. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

