Web Scraping for Price Intelligence

Prices on a marketplace move constantly, and a single page view only tells you what something costs right now. Price intelligence is the discipline of turning that moving target into structured data you can track: pull competitor and marketplace prices from public product pages, normalize them into clean rows, store them over time, and read the trend. This guide shows you how to use web scraping for price intelligence end to end, with runnable Python you can point at real listings today.

To keep this honest, the whole walkthrough stays on public product data: names, prices, currencies, and listing URLs that anyone can see without logging in. It does not touch user accounts, login-walled pages, checkout actions, or personal data. There is a short ToS note near the end that is not boilerplate, so read it before you run this at volume.

What price intelligence actually needs

It is easy to think of this as "scrape a price." In practice a useful price-intelligence system has four jobs, and the scraping is only the first one.

Collect prices from the public pages you care about, reliably enough to run on a schedule.
Normalize the messy raw values (currency symbols, thousands separators, "from" prices) into clean numbers.
Store each observation with a timestamp so you have history, not just a snapshot.
Analyze the history: compare across sources, compute averages, and flag moves worth acting on.

This is the same shape of problem as any ecommerce web scraping job. The difference with price intelligence is that the value lives in the time series, so the collection has to be repeatable and the data has to land somewhere you can query later.

Why collection is the hard part

If you point a bare HTTP client at a major marketplace search page, you usually get one of two disappointing results: a 200 response with almost no product data in the body, or a block. Two things work against you. Many marketplaces render their listings in the browser with JavaScript, so the initial HTML is a shell that fills in only after scripts run. And they flag automated traffic quickly: datacenter IPs and request patterns that do not look like a real browser get challenged before they ever see the rendered content.

So reliable collection needs two things in one request: a renderer for client-side pages, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of residential proxies, but keeping that fleet healthy is most of the work. The Crawling API folds both into a single call. For the big marketplaces it also ships ready-made parsers, so you can skip writing selectors entirely.

Two ways to collect

The Crawling API returns the raw rendered HTML for any URL, which you then parse yourself. The Crawling API and the Crawling API's built-in scrapers go one step further: for supported sites like Amazon and eBay they return clean JSON, so there is no HTML parsing to maintain. This guide uses the built-in scrapers for collection and falls back to raw HTML when a target is not supported.

Set up the project

You need Python 3 installed. Create a directory, then install the four libraries this walkthrough uses: requests for HTTP, price_parser to normalize currency strings, and pandas for the analysis step.

bash

mkdir price-intelligence && cd price-intelligence
python -m venv .venv && source .venv/bin/activate
pip install requests price_parser pandas

You also need a Crawlbase account and an API token, which you get from the dashboard after signing up. New accounts come with free requests, so you can test everything below before committing to anything. Drop the token in wherever you see YOUR_CRAWLBASE_TOKEN.

Collect prices from a marketplace

Start with the search pages you want to track. For a product like a phone, an Amazon and an eBay search for the same query give you two competing sources to compare. Because both are supported by the Crawling API's built-in scrapers, you pass a scraper parameter and get structured product JSON back instead of HTML.

python

import requests
import urllib.parse

API_TOKEN = "YOUR_CRAWLBASE_TOKEN"
API_ENDPOINT = "https://api.crawlbase.com/"

def collect(url, scraper, country="US"):
    params = {
        "token": API_TOKEN,
        "url": url,
        "scraper": scraper,
        "country": country,
    }
    resp = requests.get(API_ENDPOINT, params=params, timeout=90)
    resp.raise_for_status()
    return resp.json()["body"]["products"]

def search_url(host, path, query):
    q = urllib.parse.quote_plus(query)
    return f"https://www.{host}/{path}{q}"

The scraper parameter is what does the heavy lifting: amazon-serp and ebay-serp tell the API to return parsed product lists rather than raw markup. The country parameter routes the request through an IP in that region, which matters because prices and availability are localized. One small wrapper now drives both sources.

python

def collect_amazon(query, country="US"):
    url = search_url("amazon.com", "s?k=", query)
    return collect(url, "amazon-serp", country)

def collect_ebay(query, country="US"):
    url = search_url("ebay.com", "sch/i.html?_nkw=", query)
    return collect(url, "ebay-serp", country)

Each call returns a list of product dictionaries. The shape differs by source (Amazon gives you name and a flat price string; eBay nests the current price under price.current.to), which is exactly why the next step exists.

Normalize into one clean shape

Raw price data is never analysis-ready. You get currency symbols, thousands separators, "from" ranges, and a different field layout per source. Normalize at capture so everything downstream sees the same columns: a source, a product name, a numeric price, a currency, and the listing URL. Normalizing once, here, is what keeps the storage and analysis code simple.

python

from price_parser import Price

def to_row(source, name, raw_price, url):
    parsed = Price.fromstring(raw_price or "")
    if parsed.amount is None:
        return None
    return {
        "source": source,
        "product": name.strip(),
        "price": float(parsed.amount),
        "currency": parsed.currency or "",
        "url": url,
    }

def normalize(query, country="US"):
    rows = []
    for item in collect_amazon(query, country):
        row = to_row("Amazon", item["name"], item.get("price"), item["url"])
        if row: rows.append(row)
    for item in collect_ebay(query, country):
        raw = item["price"]["current"]["to"]
        row = to_row("eBay", item["title"], raw, item["url"])
        if row: rows.append(row)
    return rows

price_parser handles the currency parsing for you: it reads "£1,138.00" or "$709.00" and hands back a clean amount and currency code, so a price-comparison job never has to care which symbol a source used. After this step every observation looks the same regardless of where it came from.

json

[
  {
    "source": "Amazon",
    "product": "Apple iPhone 15 Pro Max 256GB",
    "price": 1138.0,
    "currency": "USD",
    "url": "https://www.amazon.com/dp/B0DGTJ6Y1S"
  },
  {
    "source": "eBay",
    "product": "Apple iPhone 15 Pro Max 256GB Blue Titanium",
    "price": 709.0,
    "currency": "USD",
    "url": "https://www.ebay.com/itm/236096139018"
  }
]

Crawlbase Crawling API

Price intelligence lives or dies on reliable collection. The Crawling API renders client-side pages behind rotating residential IPs in one call, and for big marketplaces its built-in scrapers return clean product JSON, so you skip both a headless fleet and most of your parsing code. Point it at a public search page on the free tier first.

Start free

Store each run with a timestamp

A single normalized list is a snapshot. Price intelligence is about the trend, so every run has to land in storage with a timestamp attached. A flat CSV with an appended captured_at column is enough to start, and it loads straight into pandas or a spreadsheet later.

python

import csv, os
from datetime import datetime, timezone

FIELDS = ["captured_at", "source", "product", "price", "currency", "url"]

def store(rows, path="price_history.csv"):
    stamp = datetime.now(timezone.utc).isoformat()
    new_file = not os.path.exists(path)
    with open(path, "a", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=FIELDS)
        if new_file:
            writer.writeheader()
        for row in rows:
            writer.writerow({"captured_at": stamp, **row})

if __name__ == "__main__":
    rows = normalize("Apple iPhone 15 Pro Max 256GB", country="US")
    store(rows)
    print(f"stored {len(rows)} rows")

Run this on a schedule (a cron job every few hours, or hourly if your tier allows it) and price_history.csv grows into a real time series. When you outgrow a flat file, write the same rows into a database table instead; the normalized shape means nothing else changes. If you are collecting across many products and regions, the asynchronous Crawler lets you push large batches of URLs and receive results via webhook rather than blocking on each request.

Analyze: compare sources and spot moves

With history on disk, the analysis is short. Load the CSV into pandas, group by source, and compare. Here is the classic price-intelligence question: for a given product, where is it cheaper right now, and by how much?

python

import pandas as pd

df = pd.read_csv("price_history.csv", parse_dates=["captured_at"])

# Latest run only, for a head-to-head comparison
latest = df[df["captured_at"] == df["captured_at"].max()]
by_source = latest.groupby("source")["price"].agg(["mean", "min", "count"]).round(2)
print(by_source)

# Day-over-day move per source, from the stored history
daily = df.set_index("captured_at").groupby("source")["price"]
trend = daily.resample("D").mean().round(2)
print(trend.pct_change().round(3))

The first block tells you who is cheaper today; the second turns your stored history into a daily trend and a percentage change, which is the signal you actually act on. A drop past a threshold can trigger an alert; a steady climb tells you the market is moving and your own pricing may be due for a look. Everything here is plain pandas because the hard work happened upstream in collection and normalization.

Optional: layer AI on top

You do not need machine learning to do price intelligence, but two problems get easier with it once you are collecting at scale.

The first is product matching. The same item is titled differently on every site ("iPhone 15 Pro Max 256GB" vs "Apple iPhone 15 Pro Max (256 GB) Blue Titanium"), so comparing like for like means clustering listings that refer to the same product. Embedding the titles and grouping by similarity does this far better than string matching, and it is the difference between a real comparison and noise.

The second is anomaly detection. Over a long enough history, most price moves are normal seasonal drift. A simple rolling statistic (flag any observation more than a few standard deviations from a product's trailing mean) catches the genuine events, a sudden undercut or a pricing error, without you watching a dashboard. Start with that rule; reach for a model only when the simple version stops being enough.

Staying unblocked at scale

Even with rendering and IPs handled by the API, a few habits keep a recurring collection job healthy, and they apply to any hard commercial target.

Pace your requests. The Crawling API's default rate is generous for e-commerce, but hammering the same search in a tight loop still invites throttling. Spread runs out and vary your queries. If you start seeing 429s, that is the rate-limit signal.
Lean on rotation. A pool of residential proxies spreads requests across many real-user IPs so no single address trips a limit. The API does this for you; if you build your own stack, this is the part to get right. The Smart AI Proxy exposes the same rotation as a standard proxy endpoint if you prefer that integration.
Read the status codes. You are not charged for failed requests, so a failed crawl is cheap to retry. A run that starts returning challenges is telling you the current tier is no longer enough.

For the full playbook, see how to scrape websites without getting blocked. If your collection is growing past a few products into thousands of SKUs across regions, large-scale e-commerce scraping covers the architecture for that volume.

The honest part: ToS and public data

Scraping a large commercial marketplace sits in a legal gray area, and whether it is allowed depends on the platform's terms of service, your jurisdiction, and what you do with the data. Most marketplace terms restrict automated access, so collection can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work.

A few lines worth holding to. Collect only public data: product names, prices, currencies, and listing URLs that anyone can see without an account. Respect each site's robots.txt and its stated rate expectations, and keep your volume low enough that you are not straining anyone's servers. Never collect personal data, including anything tied to individual seller or buyer accounts. And if you plan to reuse the data commercially, get permission or an official data agreement rather than assuming silence is consent. This guide is scoped to public listing data on purpose, because that is the line that keeps the work defensible.

Recap

Key takeaways

Price intelligence is four jobs, not one. Collect, normalize, store with a timestamp, then analyze. The scraping is only the first step.
Reliable collection needs rendering and a trusted IP. The Crawling API does both in one call, and its built-in scrapers return clean JSON for supported marketplaces so there is no HTML parsing to maintain.
Normalize at capture. Parse currency strings into numbers once, into one shape, and every storage and analysis step stays simple.
The value is the time series. Append each run with a captured_at stamp so you can read trends and day-over-day moves, not just a snapshot.
AI is optional polish. Embeddings help match the same product across sites; a rolling-stat rule flags real price anomalies. Reach for them only when the simple version stops scaling.
Stay on public data. Respect ToS and robots.txt; no accounts, no personal data.

Frequently Asked Questions (FAQs)

What is web scraping for price intelligence?

It is the practice of automatically collecting prices from public product pages, normalizing them into clean numbers, and tracking them over time so you can compare competitors and spot market moves. The scraping gathers the raw observations; the intelligence comes from storing a time series and analyzing the trend rather than reading a single snapshot.

Do I have to parse HTML to collect prices?

Not for the big marketplaces. The Crawling API's built-in scrapers (and the Scraper API) return parsed product JSON for supported sites like Amazon and eBay, so you skip selectors entirely. You only fall back to parsing raw HTML when a target site is not covered, in which case the API still hands you the rendered page to work with.

How often should I collect prices?

It depends on how fast your market moves and your request budget. Hourly is plenty for most catalogs; fast-moving categories may want more, slow ones less. Whatever the cadence, append every run with a timestamp so you build real history. Pace requests and vary queries so a recurring job does not look like a burst attack.

How do I compare the same product across different sites?

Titles differ on every marketplace, so exact string matching fails. Normalize each listing into the same fields at capture, then match products by similarity rather than identical text. For a handful of SKUs a manual mapping works; at scale, embedding the titles and clustering by similarity is the reliable approach.

Will I get blocked collecting prices at scale?

You can, if you send scraper-shaped traffic from a single IP. Keep the per-IP rate low, vary your search parameters, and route through rotating residential IPs so no one address trips a limit. The Crawling API and Smart AI Proxy manage rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. You are not charged for failed requests, so retrying a blocked crawl is cheap.

Is it legal to scrape prices for price intelligence?

It depends on the target's terms of service, your jurisdiction, and your purpose, and most marketplace terms restrict automated access. Keep strictly to public listing data (names, prices, currencies, URLs), respect robots.txt and rate expectations, and never touch accounts or personal data. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What price intelligence actually needs

Why collection is the hard part

Set up the project

Collect prices from a marketplace

Normalize into one clean shape

Store each run with a timestamp

Analyze: compare sources and spot moves

Optional: layer AI on top

Staying unblocked at scale

The honest part: ToS and public data

Key takeaways

Frequently Asked Questions (FAQs)

What is web scraping for price intelligence?

Do I have to parse HTML to collect prices?

How often should I collect prices?

How do I compare the same product across different sites?

Will I get blocked collecting prices at scale?

Is it legal to scrape prices for price intelligence?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Beyond Vibe Coding: Scale AI Agents with Infrastructure-First Retrieval

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

The infrastructure brief, in your inbox.