Plenty of pages are worth watching for changes: a competitor's pricing page, a product's stock status, a policy or terms document, a job board, a release-notes page. The information is public, but the change is the signal you actually care about, and refreshing a tab by hand does not scale past a page or two. What you want is a script that checks for you and tells you only when something is different.

This guide shows you how to build a website change tracker in Python the reliable way. You build a small, runnable tool that fetches a page through the Crawling API, extracts the meaningful text, computes a SHA-256 fingerprint, stores each snapshot on disk, compares the new fingerprint against the last one to detect a change, and runs the whole check on a schedule. The approach is generic: it works on any public page you point it at, not one specific site.

What you will build

A Python script that takes one or more public URLs, retrieves each page through the Crawling API, reduces it to comparable text, fingerprints that text, and reports whether the page changed since the last run. Each component is a small function so you can read and reuse it. The pieces are:

  • Fetcher retrieves the page HTML through the Crawling API so blocks and JavaScript rendering are handled for you.
  • Extractor strips scripts, styles, navigation, and footers, leaving the readable body text.
  • Fingerprint a SHA-256 hash of the cleaned text, so one changed word produces a completely different value.
  • Store a JSON file mapping each URL to its last fingerprint, plus the last text for diffing.
  • Comparator loads the previous fingerprint, compares, and reports changed or unchanged.
  • Scheduler a loop with a sleep interval, or a cron entry, so the check runs on its own.

Why a plain request is not enough

You could write this with a bare HTTP client and skip an API entirely, and on a simple static page it would even work. The trouble starts on real targets. Many sites throttle or block automated requests outright: a datacenter IP hitting the same URL on a fixed interval is an easy pattern to flag, and a monitor is automated traffic by definition. Other pages build their content in the browser with JavaScript, so the raw HTML a plain request returns is a near-empty shell, and your fingerprint ends up tracking the shell instead of the content you meant to watch.

So a dependable tracker needs two things from each fetch: an IP the site reads as a real visitor, and, when the page is client-rendered, a browser that runs the scripts before the HTML comes back. You can assemble that yourself with a headless browser and a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call: send it the URL, optionally a JavaScript token, and it returns finished HTML for you to fingerprint.

There is a second reason the comparison itself has to be careful, separate from how you fetch. Raw HTML changes constantly in ways you do not care about: inline scripts, ad slots, embedded timestamps, CSRF tokens, dynamic widgets. If you hash the raw response you get a false positive on almost every run. Reducing the page to its readable text first is what makes the fingerprint a real signal instead of noise.

Prerequisites

A few things need to be in place first. None of them take long.

Basic Python. You should be comfortable writing and running a script and installing packages with pip. If BeautifulSoup is new to you, our guide to using BeautifulSoup in Python covers the parsing this tutorial assumes.

Python 3.10 or later. Confirm your version with python --version. The code uses the str | None type-hint syntax, which needs 3.10. If you do not have it, install it from python.org.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token from the account docs page. The free tier includes 1,000 requests, which is plenty to test a tracker. Treat the token like a password and keep it out of version control: the code below reads it from the CRAWLBASE_TOKEN environment variable.

Set up the project

Create a virtual environment so dependencies stay isolated, then install the two third-party libraries. The hashing, storage, scheduling, and diffing all come from the Python standard library (hashlib, json, time, and difflib), so there is nothing extra to install for those.

bash
python --version

python -m venv tracker_env
source tracker_env/bin/activate

pip install requests beautifulsoup4

On Windows, activate the environment with tracker_env\Scripts\activate instead of the source line. Two dependencies do the work: requests sends the HTTP call to the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out the readable text.

Step 1: Fetch the page through the Crawling API

Start by confirming you can retrieve the page at all. The function below reads your token, builds the Crawling API request URL with the target page URL-encoded into it, sends the request, and returns the HTML. Checking the response status keeps failures loud instead of silent.

python
import os
from urllib.parse import quote
import requests

CRAWLBASE_API_URL = "https://api.crawlbase.com"

def fetch_page(url: str, token: str | None = None) -> str:
    api_token = token or os.environ.get("CRAWLBASE_TOKEN", "")
    if not api_token:
        raise ValueError("Set CRAWLBASE_TOKEN or pass token=")
    api_url = f"{CRAWLBASE_API_URL}/?token={api_token}&url={quote(url)}"
    response = requests.get(api_url, timeout=30)
    response.raise_for_status()
    return response.text

if __name__ == "__main__":
    html = fetch_page("https://example.com")
    print(html[:300])

Run this with your token set (export CRAWLBASE_TOKEN="your_token") and you should see the first few hundred characters of real page HTML. That single confirmation matters: it proves the request is reaching the page and coming back with content before you build anything on top of it. The timeout=30 and raise_for_status() calls are deliberate, and the error-handling section later builds on both. If the target renders its content with JavaScript, add a JavaScript token instead of the standard one so the page is rendered before the HTML is returned.

Crawlbase Crawling API

That first fetch_page call returned real HTML without you managing a single proxy. The Crawling API handles the blocking, throttling, and CAPTCHA challenges that sink a bare request loop, and rotates through trusted IPs server-side, so a long-running monitor keeps getting clean HTML to fingerprint instead of getting flagged. Add a JavaScript token and it renders client-side pages before returning them. Point it at a public page on the free tier first.

Step 2: Extract and fingerprint the content

Comparing raw HTML is unreliable, so the next step reduces the page to readable text and then hashes it. The extractor loads the HTML into BeautifulSoup, drops the elements that change without meaning anything (script, style, nav, footer), pulls the visible text, and collapses whitespace so cosmetic reflows do not register as changes.

python
import hashlib
from bs4 import BeautifulSoup

def extract_monitorable_text(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = soup.get_text(separator=" ", strip=True)
    return " ".join(text.split())

def content_fingerprint(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

The fingerprint is a SHA-256 hash of that cleaned text. A hash is a fixed-length string derived from the input, and any change to the input, even a single character, produces a completely different output. That property is exactly what a tracker wants: instead of storing and byte-comparing whole pages, you store one 64-character string per URL and compare those. The comparison is fast, the storage is tiny, and small edits are still caught. Pair this with our general Python scraping guide if you want to extend the extractor to target a specific region of the page rather than the whole body.

Step 3: Store snapshots and compare

To detect a change, the tool has to remember the last run. Two small JSON files hold the state: snapshots.json maps each URL to its last fingerprint, and snapshots_text.json keeps the last extracted text so you can show a human-readable diff when something moves. The load function returns an empty dict on the first run rather than failing.

python
import json
from pathlib import Path

def load_json(path: str | Path) -> dict[str, str]:
    p = Path(path)
    if not p.exists():
        return {}
    with open(p, encoding="utf-8") as f:
        return json.load(f)

def save_json(data: dict[str, str], path: str | Path) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2)

def check_for_change(url: str, current_hash: str, snapshots: dict[str, str]) -> bool:
    previous = snapshots.get(url)
    if previous is None:
        return True
    return previous != current_hash

The comparison logic is the heart of the tracker, and it is deliberately simple. check_for_change looks up the URL's stored fingerprint. If there is none, this is the first time you have seen the page, so it reports a change and the new fingerprint gets saved. If there is one, it returns whether the two differ. The first run on any URL always reports changed for exactly this reason, which is expected, not a bug.

Now wire the pieces together into one run. The function below loops the URLs, fetches and fingerprints each, decides whether it changed, emits a unified diff against the stored text when it did, and saves the updated state at the end so the next run has something to compare against. The diff uses the standard-library difflib module, no extra dependency.

python
import difflib

def run_once(urls: list[str], hash_path="snapshots.json",
             text_path="snapshots_text.json") -> None:
    snapshots = load_json(hash_path)
    snapshot_texts = load_json(text_path)

    for url in urls:
        html = fetch_page(url)
        text = extract_monitorable_text(html)
        if not text:
            print(f"[warn] empty text, skipping {url}")
            continue
        fingerprint = content_fingerprint(text)

        if check_for_change(url, fingerprint, snapshots):
            print(f"[changed] {url}")
            old = snapshot_texts.get(url, "")
            diff = difflib.unified_diff(
                old.split(), text.split(),
                lineterm="", n=0)
            print(" ".join(diff)[:500])
        else:
            print(f"[no change] {url}")

        snapshots[url] = fingerprint
        snapshot_texts[url] = text

    save_json(snapshots, hash_path)
    save_json(snapshot_texts, text_path)

Saving both the fingerprint and the text is what lets future runs detect a change and explain it. The fingerprint answers "did anything change," and the stored text lets difflib answer "what changed." If you only ever need the yes/no signal, you can drop the text file and keep the fingerprint map alone.

Storage beyond JSON

JSON files are perfect for a handful of URLs and easy to inspect by hand. Once you are tracking hundreds of pages, swap the load and save functions for SQLite via the standard-library sqlite3 module: it handles concurrent reads, scales to large URL lists, and keeps all state in one portable file. The rest of the script does not change.

Step 4: Run the tracker on a schedule

A change tracker is only useful if it runs on its own. There are two clean ways to do that. The first is built into the script: an optional interval loop that re-checks every URL every N seconds until you stop it. The second is to let the operating system run a single pass on a timer with cron. Below is the CLI entry point with both the one-shot and interval modes.

python
import argparse
import time

def main() -> None:
    parser = argparse.ArgumentParser(description="Website change tracker")
    parser.add_argument("urls", nargs="+", help="public URLs to monitor")
    parser.add_argument("--interval", type=float, metavar="SECONDS",
        help="re-check every SECONDS (e.g. 3600 for hourly); Ctrl+C to stop")
    args = parser.parse_args()

    while True:
        run_once(args.urls)
        if args.interval is None:
            break
        time.sleep(args.interval)

if __name__ == "__main__":
    main()

Run a single check, or keep it looping hourly:

bash
export CRAWLBASE_TOKEN="your_token"

# one pass, then exit
python tracker.py https://example.com

# check every hour until you stop it
python tracker.py https://example.com --interval 3600

The interval loop is the simplest option and keeps the process in one place, which is handy while you are testing. For an unattended production setup, cron is usually the better fit: it survives reboots and does not tie up a terminal. A crontab entry that runs a single pass every hour and appends output to a log looks like this:

bash
# run at the top of every hour
0 * * * * cd /path/to/project && \
  CRAWLBASE_TOKEN=your_token \
  ./tracker_env/bin/python tracker.py https://example.com >> tracker.log 2>&1

On Windows the equivalent is Task Scheduler running the same one-pass command on a trigger. Either way, drop the --interval flag when cron or Task Scheduler owns the timing, since the scheduler already handles the repeat.

What the output looks like

The script prints one line per URL per run, plus a trimmed diff when a page changed. The first time you check a URL it always reports changed, because there is no stored fingerprint yet, and the snapshot is written for next time:

bash
# first run: no snapshot exists yet
[changed] https://example.com

# later run, content edited
[changed] https://example.com
--- +++ @@ -Old pricing copy +New pricing copy

# later run, nothing moved
[no change] https://example.com

The state on disk is just as readable. snapshots.json is a flat map of URL to fingerprint, which is all the comparison needs:

json
{
  "https://example.com": "3e1f9c...a7d2",
  "https://example.com/pricing": "b04c88...11ef"
}

Handling failures and scaling up

A long-running monitor will hit failures, and how it handles them decides whether it keeps running. Three cases come up constantly. Timeouts: the requests.get(timeout=30) call raises an exception if the API does not answer in time, so wrap the fetch and retry with exponential backoff rather than letting one slow response kill the run. HTTP errors: raise_for_status() turns 4xx and 5xx responses into exceptions; log the status and the URL, then skip that URL and carry on with the rest. Empty extractions: if extract_monitorable_text returns an empty string, skip the comparison and log a warning instead of recording a spurious change, which the if not text guard in run_once already does.

python
def fetch_with_retry(url: str, retries: int = 3,
                     backoff: float = 2.0) -> str:
    for attempt in range(retries):
        try:
            return fetch_page(url)
        except requests.exceptions.RequestException:
            if attempt < retries - 1:
                time.sleep(backoff ** attempt)
            else:
                raise

Scaling from one page to many follows naturally. Tracking more URLs is just a longer list passed to run_once. To speed up large lists, fetch in parallel with concurrent.futures.ThreadPoolExecutor since the work is I/O-bound. To track hundreds of pages, move state from JSON to SQLite as noted above. And if any of your targets render content client-side, switch the fetch to a JavaScript token so the page is rendered before you extract from it: our notes on scraping JavaScript pages with Python cover when that is necessary.

Tracking changes responsibly

A change tracker is automated traffic, so run it the way you would want one run against your own site. Poll at a cadence that matches how often the page actually changes: every 15 to 60 minutes for fast-moving news or dashboards, every few hours for pricing and listings, and daily or weekly for policy and documentation pages. Checking a static page every minute adds request cost and load without improving detection, so pick the slowest interval that still catches what you need.

Stay on the right side of the source as well. Track only public pages, the ones anyone can load without an account, and read the site's terms of service and robots.txt before you point a recurring job at it; treat both as the boundary for what you collect. Keep your request volume low enough that you are not straining the server, space out checks across targets instead of hammering one URL, and back off when you see errors or challenges instead of retrying harder. If a site offers an official API or change feed, prefer it: it is the path the site intends for this, and it is usually more stable than parsing HTML.

Recap

Key takeaways

  • Fingerprints beat raw comparison. Hashing cleaned text with SHA-256 turns "did this page change" into a fast comparison of two 64-character strings instead of whole pages.
  • Extract before you hash. Stripping scripts, styles, nav, and footers and collapsing whitespace is what stops timestamps and ad slots from triggering false positives.
  • Store the text, not just the hash. Keeping the last extracted text alongside the fingerprint lets difflib show exactly what changed, not just that something did.
  • The fetch layer is where blocks happen. The Crawling API handles rendering and trusted-IP rotation, so a long-running monitor keeps getting clean HTML instead of getting flagged.
  • Schedule it and be polite. A sleep loop or a cron entry runs the check on its own; match the interval to how often the page really changes and respect the source's terms and robots.txt.

Frequently Asked Questions (FAQs)

Why fingerprint the text instead of comparing the raw HTML?

Raw HTML changes on almost every request in ways you do not care about: inline scripts, ad slots, embedded timestamps, and CSRF tokens all shift while the actual content stays the same. Comparing raw HTML gives you a false positive nearly every run. Reducing the page to readable text first, then hashing that, makes the fingerprint track the content you meant to watch instead of the noise around it.

Does this work on JavaScript-heavy websites?

Yes, with one change. Use a JavaScript token with the Crawling API instead of the standard one. That renders the full page in a real browser before returning HTML, so client-side content is present when BeautifulSoup extracts the text. Without it, a client-rendered page returns a near-empty frame and your fingerprint ends up tracking the frame rather than the content.

Can it monitor multiple pages at once?

Yes. Pass several URLs on the command line and the script processes them in turn, keeping one fingerprint per URL in the snapshot file. For large lists, fetch in parallel with concurrent.futures.ThreadPoolExecutor since the work is I/O-bound, and consider moving storage to SQLite so the state scales cleanly.

How do I get an alert when something changes?

The core script reports changes to stdout, which is enough if cron mails you its output. To push a real alert, call out wherever check_for_change returns True: post to a Slack or Discord webhook, send an email through a transactional API, or hit any HTTP endpoint. It is a few lines added to the changed branch of run_once.

What is the best storage option for tracking hundreds of URLs?

Replace the JSON files with SQLite via the standard-library sqlite3 module. It handles concurrent reads, scales to large URL lists, and keeps all state in one portable file. The load and save functions are the only code that changes; the fetch, extract, fingerprint, and compare logic stay exactly the same.

How often should the tracker run?

Match the interval to how often the page actually changes. Fast-moving news pages and dashboards justify every 15 to 60 minutes; pricing and product listings are usually fine at a few hours; policy and documentation pages can be checked daily or weekly. Running far more often than the page changes only adds request cost and load on the source without catching anything you would otherwise miss.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available