Build a Price Comparison Engine with Python

Q: Why use the Crawling API instead of plain requests?

Plain requests only downloads the initial HTML, so it misses prices that load client-side with JavaScript, and a single IP gets blocked once you fetch from several stores repeatedly. The Crawling API renders the page and routes each request through a rotating residential IP in one call, so the price is actually in the HTML you parse and you stay unblocked across sources.

The same product almost never costs the same at two retailers. A laptop listed for one price on a marketplace might be ten percent cheaper on a brand store, and that gap moves daily. Checking each site by hand does not scale past a couple of products, which is exactly the kind of repetitive lookup a script should own.

This tutorial shows you how to build a price comparison tool in Python that pulls the same product from two or more public ecommerce pages, normalises the data into one clean shape, matches the listings across sources, stores the result, and reports the lowest price. We fetch every page through the Crawling API so rendering and IP rotation are handled for us, and we keep the whole walkthrough scoped to public catalog data: product name, price, currency, retailer, and the product URL.

What you will build

A small Python project that turns a handful of product URLs into a tidy comparison table. Each scraped listing becomes a record with a fixed set of fields, so two retailers that describe a product differently still line up. By the end you will have:

Name. The product title as the retailer lists it.
Price. A numeric value parsed out of the messy on-page text.
Currency. The currency code so you never compare dollars to euros.
Retailer. Which source the listing came from.
URL. The exact page, so a human can verify the deal.
Stored output. Every record written to JSON and CSV, plus a lowest-price summary.

Why a plain request fails on retail pages

You can get a long way with the requests library on simple sites, but modern ecommerce pages fight back in two ways. First, many of them render the price client-side: the initial HTML is a near-empty shell and JavaScript fills in the catalog after load. A plain requests.get only sees the shell, so the price you want is simply not in the text you downloaded.

Second, retailers watch for automated traffic. A few requests from one datacenter IP, in a recognisable pattern, gets rate-limited, CAPTCHA-walled, or blocked. To compare prices across several stores you need each page rendered and each request to arrive from a trusted, rotating address. That is the combination the Crawling API gives you in a single call, which is why we route every fetch through it instead of hitting the sites directly. If JavaScript rendering on retail pages is new to you, our guide to scraping JavaScript pages with Python covers the background.

Prerequisites

Three things before you write code.

Python 3.8 or later. Check with python --version. Any recent 3.x release works; download the current version from the official Python site if you need it.

A Crawlbase account and token. Sign up for a free account and copy your token from the dashboard. You get a normal token (for static HTML) and a JavaScript token (for client-rendered pages); keep both handy.

Basic Python. You should be comfortable running a script from the terminal, installing packages with pip, and reading a dictionary. Familiarity with HTML and CSS selectors helps, since you will adapt them to the pages you target.

Set up the project

Create a folder and install the dependencies. We use the official crawlbase client for fetching, beautifulsoup4 for parsing HTML, and pandas for storing and displaying the comparison. The json and csv modules are built in, so there is nothing to install for them.

bash

mkdir price-comparison && cd price-comparison

pip install crawlbase beautifulsoup4 pandas

If you want to refresh on the parsing side specifically, our walkthrough of how to use BeautifulSoup in Python goes deeper than the snippets below.

Step 1: Fetch the first source

Start with a single page from one retailer. The Crawling API client takes your token, fetches the URL behind a rotating residential IP, and hands back the finished HTML. For a JavaScript-rendered page, pass ajax_wait and page_wait so the API waits for async content and holds a moment after load before capturing. Use your JavaScript token for those pages and your normal token for plain static ones.

python

from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def fetch_page(url):
    response = api.get(url, {"ajax_wait": "true", "page_wait": 3000})
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    raise RuntimeError(f"Crawl failed: {response['status_code']}")

html = fetch_page("https://store-a.example.com/product/smartphone-xyz")
print(html[:500])

Running this prints the first 500 characters of rendered HTML. If you see the real page markup rather than an empty shell, the fetch worked and the API handled both rendering and the IP for you. The status_code check keeps a soft failure from flowing downstream as if it were good HTML.

Crawlbase Crawling API

That single fetch_page call you just ran did the two hard jobs for you: it rendered the JavaScript so the price was actually in the HTML, and it sent the request through a rotating residential IP so the retailer did not block you. You skip standing up a headless browser fleet and a proxy pool of your own, which matters the moment you fetch from a second and third store on a schedule. Point it at the pages that were coming back empty.

Start free

Step 2: Fetch the second source

A comparison needs at least two retailers. The fetch logic does not change: the same fetch_page works for any URL, because the API handles whatever each site throws at it. What changes is the parsing, since every retailer marks up its title and price differently. Write one small parser per source that reads the page and returns a normalised record.

python

import re
from bs4 import BeautifulSoup

def parse_price(text):
    # Pull a number out of strings like "$279.99" or "1,299.00 USD"
    match = re.search(r"[\d,]+\.?\d*", text.replace(",", ""))
    return float(match.group()) if match else None

def parse_store_a(html, url):
    soup = BeautifulSoup(html, "html.parser")
    return {
        "name": soup.select_one("h1.product-title").get_text(strip=True),
        "price": parse_price(soup.select_one(".price").get_text()),
        "currency": "USD",
        "retailer": "Store A",
        "url": url,
    }

def parse_store_b(html, url):
    soup = BeautifulSoup(html, "html.parser")
    return {
        "name": soup.select_one("#productName").get_text(strip=True),
        "price": parse_price(soup.select_one("span.amount").get_text()),
        "currency": "USD",
        "retailer": "Store B",
        "url": url,
    }

Both parsers return the same five keys, which is the whole point: name, price, currency, retailer, and url. The selectors (h1.product-title, .price, #productName, span.amount) are placeholders. Open each target in your browser's dev tools, find the elements that actually hold the title and price, and swap them in. The parse_price helper does the dirty work of turning "$279.99" into the number 279.99 so prices are comparable.

Selectors drift

Retailers redesign often, and a selector that worked last month can quietly start returning None. Treat per-source selectors as something you maintain. When a price comes back empty, re-inspect the live page and update the one parser, not the rest of the pipeline.

Step 3: Normalise and match across sources

Now wire fetching and parsing together. Map each URL to the parser that understands it, build a list of normalised records, and group records that describe the same product so you can compare like with like. The simplest reliable match is an explicit product key you assign per URL, rather than guessing from slightly different titles.

python

# Each entry: (product key, retailer URL, parser)
sources = [
    ("Smartphone XYZ", "https://store-a.example.com/product/smartphone-xyz", parse_store_a),
    ("Smartphone XYZ", "https://store-b.example.com/p/smartphone-xyz", parse_store_b),
]

def collect(sources):
    products = {}
    for key, url, parser in sources:
        try:
            record = parser(fetch_page(url), url)
        except Exception as err:
            print(f"Skipping {url}: {err}")
            continue
        products.setdefault(key, []).append(record)
    return products

The result is a dictionary keyed by product, with a list of per-retailer records under each key. The shape mirrors the data model from the original tutorial, a product name with a list of stores and prices, but here every record is scraped live and normalised rather than read from a static file. Wrapping each fetch in try/except means one dead URL logs and the run continues instead of crashing halfway through.

Step 4: Compare and store the results

With matched records in hand, finding the lowest price is a short loop, and storing the output is a few lines. The comparison keeps the exact logic from the classic version: walk each product's records and keep the cheapest.

python

def find_lowest_price(records):
    lowest_price = float("inf")
    best = None
    for record in records:
        if record["price"] is not None and record["price"] < lowest_price:
            lowest_price = record["price"]
            best = record
    return best

Then assemble the full script. It collects every source, prints the best deal per product, and writes the flat list of records to both JSON and CSV with pandas so you can open the comparison in a spreadsheet or feed it to another tool.

python

import json
import pandas as pd

def main():
    products = collect(sources)
    rows = []

    for name, records in products.items():
        rows.extend(records)
        best = find_lowest_price(records)
        if best:
            print(f"Product: {name}")
            print(f" - Best Price: {best['currency']} {best['price']} at {best['retailer']}")

    # Store the full comparison as JSON and CSV
    with open("comparison.json", "w") as f:
        json.dump(rows, f, indent=2)

    pd.DataFrame(rows).to_csv("comparison.csv", index=False)
    print(f"Saved {len(rows)} records.")

if __name__ == "__main__":
    main()

Run it with python compare.py. You get a best-price line per product in the terminal and two files on disk: comparison.json for programmatic use and comparison.csv for a spreadsheet. If you would rather store the history of prices over time, swap the CSV write for an SQLite insert with the built-in sqlite3 module; the record shape stays the same, so nothing upstream changes.

What the output looks like

The JSON file holds one object per scraped listing, all sharing the five normalised fields. For two retailers selling the same phone it looks like this:

json

[
  {
    "name": "Smartphone XYZ",
    "price": 299.99,
    "currency": "USD",
    "retailer": "Store A",
    "url": "https://store-a.example.com/product/smartphone-xyz"
  },
  {
    "name": "Smartphone XYZ",
    "price": 279.99,
    "currency": "USD",
    "retailer": "Store B",
    "url": "https://store-b.example.com/p/smartphone-xyz"
  }
]

And the terminal summary, driven by find_lowest_price, reads:

bash

Product: Smartphone XYZ
 - Best Price: USD 279.99 at Store B
Saved 2 records.

The CSV holds the same rows in flat form, one header line then one line per listing, which opens straight into any spreadsheet for sorting and charting.

Scaling to more products and sources

The design scales by adding rows to the sources list, not by rewriting logic. To track ten products across four retailers, add a parser per new retailer and one tuple per (product, retailer) pair. Run the script on a schedule (a cron job once or twice a day is plenty for price tracking) and append each run to a dated file or an SQLite table to build price history.

A few habits keep it healthy at volume. Pace your requests so you are not fetching dozens of pages in a tight burst. Handle each fetch in isolation, which the try/except in collect already does, so one failing page never kills the run. And cache fetched HTML to disk while you iterate on selectors, so you are not re-hitting live sites on every code change. If a page later changes from static to JavaScript-rendered, the Crawling API absorbs that for you, since rendering is just an option on the same call. For pages that match a known layout, the Crawling API can auto-parse the product fields and return structured JSON directly, which lets you skip writing a per-source parser when a supported template fits.

Building a price tracker responsibly

A price comparison tool only touches the same public catalog pages a shopper would, but that does not make anything fair game. Read each retailer's terms of service and check its robots.txt before you point a loop at it, and keep your request volume modest: a couple of pulls a day per product is enough to track prices and is far less intrusive than constant hammering. Stick to public product data (name, price, currency, URL) and stay out of anything behind a login, anything personal, and copyrighted media you would redistribute.

Where a retailer offers an official product API or an affiliate data feed, prefer it. Those channels are built for exactly this use, usually return cleaner structured data than scraping, and keep you on the right side of the relationship. Treat scraping as the fallback for sources that have no such option, and you get a tool that is both reliable and considerate of the sites it depends on. For the broader anti-block playbook, our guide on how to scrape websites without getting blocked goes into the details.

Recap

Key takeaways

Normalise before you compare. Map every retailer's page into the same five fields (name, price, currency, retailer, URL) so different markup still lines up.
One parser per source, one shared pipeline. Fetching is identical across sites; only the per-source selectors change, which keeps the rest of the code stable.
The Crawling API handles rendering and IPs. A single fetch_page call returns rendered HTML behind a rotating residential IP, so you skip a headless browser fleet and a proxy pool.
Store flat, in JSON and CSV. A flat record list serialises cleanly to both, and swaps to SQLite when you want price history, with no upstream changes.
Scale by data, not code. Add products and retailers by extending the sources list and running on a schedule, not by rewriting the comparison logic.

Frequently Asked Questions (FAQs)

What is a price comparison tool?

A price comparison tool collects the price of the same product from several retailers and shows you which is cheapest. In code terms, it scrapes each retailer's public product page, normalises the data into one shape, matches listings that describe the same item, and reports the lowest price. The Python project in this tutorial does exactly that across two or more sources.

Why use the Crawling API instead of plain requests?

Plain requests only downloads the initial HTML, so it misses prices that load client-side with JavaScript, and a single IP gets blocked once you fetch from several stores repeatedly. The Crawling API renders the page and routes each request through a rotating residential IP in one call, so the price is actually in the HTML you parse and you stay unblocked across sources.

How do I match the same product across different retailers?

The most reliable approach is an explicit product key you assign per URL, as the sources list does in this tutorial, rather than trying to match slightly different product titles automatically. For larger catalogs you can match on a shared identifier like a model number, GTIN, or SKU when the retailers expose one, which is more dependable than fuzzy title matching.

Should I store the data as JSON or CSV?

Use both, since they serve different needs. JSON preserves the full nested structure and is easy to read back into another script, while CSV is flat and opens directly in a spreadsheet for sorting and charting. The script in this tutorial writes both from the same record list. For tracking prices over time, move to SQLite so you can query history without juggling many files.

Can I track prices automatically over time?

Yes. Run the script on a schedule, for example a cron job once or twice a day, and append each run's records to a dated file or an SQLite table. Modest, regular pulls are enough to spot price drops and build a history without hammering any retailer. Keep the cadence light and respect each site's terms.

Do I need a headless browser like Selenium for this?

Not when you fetch through the Crawling API. A headless browser such as Selenium or Playwright can render JavaScript pages, but each instance is a full browser that costs memory and CPU, and you still have to manage proxies yourself at scale. The Crawling API renders the page server-side and rotates IPs for you, so a plain Python script with BeautifulSoup is enough.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why a plain request fails on retail pages

Prerequisites

Set up the project

Step 1: Fetch the first source

Step 2: Fetch the second source

Step 3: Normalise and match across sources

Step 4: Compare and store the results

What the output looks like

Scaling to more products and sources

Building a price tracker responsibly

Key takeaways

Frequently Asked Questions (FAQs)

What is a price comparison tool?

Why use the Crawling API instead of plain requests?

How do I match the same product across different retailers?

Should I store the data as JSON or CSV?

Can I track prices automatically over time?

Do I need a headless browser like Selenium for this?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Inside Modern Anti-Bot Evasion: A Systems View

How to Scrape Local Business Listings with Python: names, addresses, ratings, and more

Build a Website Change Tracker with Python: snapshots and SHA-256 diffs

The infrastructure brief, in your inbox.

We use cookies

Customize cookies