In one sentence: This tutorial shows you how to build a Python website monitoring script that fetches pages via Crawlbase, generates SHA-256 content fingerprints, and alerts you when anything changes; no proxy infrastructure required.

To build a website change tracker, the easiest approach is to compare it with a previous version. A script fetches the page, extracts the relevant text, and generates a fingerprint from that content. The next time it runs, it performs the same steps again and checks whether the fingerprint still matches. If it does not, something on the page changed.

We will build a Python script that handles this workflow in this tutorial. It retrieves page HTML through the Crawlbase Crawling API, extracts the readable text from the page, and generates a SHA-256 hash of the cleaned content. That hash is stored locally so the script can compare it the next time the page is checked.

By the end, you’ll have a working change tracker that can monitor one or several URLs, store snapshots, output structured results, and run automatically on a schedule.

How Website Change Tracking Works

Website change tracking follows a repeatable six-step pipeline that converts raw page content into a comparable signal.

  • Step 1 — Fetch the page content. Retrieve the full HTML of the target URL. Using a reliable API like Crawlbase avoids blocks and ensures JavaScript-rendered content is included.
  • Step 2 — Extract the part of the page you want to monitor. Strip out navigation, scripts, footers, and ads. You want only the meaningful body text.
  • Step 3 — Normalize the text. Collapse whitespace, remove formatting artifacts, and standardize encoding so that cosmetic changes don’t trigger false positives.
  • Step 4 — Generate a content fingerprint. A content fingerprint is a fixed-length cryptographic hash (SHA-256 in this tutorial) derived from the cleaned page text. Even a single word change produces a completely different hash, making fingerprints a fast and storage-efficient way to detect updates.
  • Step 5 — Compare with the stored fingerprint. Load the fingerprint saved from the last run and compare it to the one you just generated. If they differ, the page has changed.
  • Step 6 — Record or report the result. Save the new fingerprint for the next run and optionally emit a diff showing exactly what changed.

The main challenge is avoiding false positives. Raw HTML often includes elements that change frequently, such as scripts, advertisements, timestamps, or dynamic widgets. Comparing cleaned text instead of raw HTML produces more accurate results.

Why Use Crawlbase for Page Tracking

You could build a tracking script using direct HTTP requests, but many websites block or throttle automated requests. Some pages also rely heavily on JavaScript, meaning the raw HTML returned by a standard request may not contain the actual content.

Crawlbase solves these problems by handling page retrieval for you.

Key advantages include:

  • Reliable page retrieval across a wide range of websites
  • Built-in handling for blocking, throttling, and CAPTCHAs
  • JavaScript rendering via the JS token ,
  • No proxy infrastructure to manage or maintain

Consistent HTML output that’s suitable for repeatable comparison

Your monitoring script focuses only on extracting and comparing content while Crawlbase acts as the retrieval layer.

Prerequisites and Technical Requirements

Before starting, make sure your environment includes the following.

Environment requirements:

RequirementDetail
Pythonversion 3.10 or later
Crawlbase API tokenFree tier includes 1,000 requests
Operating systemLinux, macOS, or Windows

The tutorial uses these Python packages:

PackagePurpose
requestsHTTP requests to the Crawlbase API
beautifulsoup4HTML parsing and text extraction
hashlibSHA-256 fingerprint generation
jsonLocal snapshot storage
difflibGenerating human-readable diffs

Step 1: Install Dependencies

From the project directory, download requirements.txt, and run:

1
pip install -r requirements.txt

This will install dependencies such as requests (v2.28.0) and beautifulsoup4 (v4.11.0).

Step 2: Fetch a Web Page Using Crawlbase

The next step is verifying that you can retrieve the page HTML successfully.

The script sends a request to the Crawlbase Crawling API and returns the response content.

Get the complete code example on ScraperHub - fetch.py

1
2
3
4
5
6
7
8
def fetch_page(url: str, token: str | None = None) -> str:
api_token = token or os.environ.get("CRAWLBASE_TOKEN", "")
if not api_token:
raise ValueError("Crawlbase token required: set CRAWLBASE_TOKEN or pass token=")
api_url = f"{CRAWLBASE_API_URL}/?token={api_token}&url={quote(url)}"
response = requests.get(api_url, timeout=30)
response.raise_for_status()
return response.text

This function:

• Reads the Crawlbase token
• Sends the target URL to the Crawling API
• Retrieves the page HTML
• Returns the content for processing

Using Crawlbase ensures the monitoring and tracking tool receives reliable HTML output.

Step 3: Extract the Content to Track

Comparing raw HTML is unreliable because pages contain many elements that change frequently.

To reduce noise, the script extracts readable page text and removes unnecessary elements.

Code example on ScraperHub - extract.py

1
2
3
4
5
6
def extract_monitorable_text(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text(separator=" ", strip=True)
return " ".join(text.split())

This function performs several steps:

• Removes scripts and styles
• Removes navigation and footer elements
• Extracts readable text
• Normalizes whitespace

The result is a consistent text representation of the page content.

Step 4: Generate a Content Fingerprint

Instead of storing entire page snapshots, the tool generates a fingerprint using a cryptographic hash.

A hash converts text into a fixed-length string. If the content changes, the hash changes as well.

Example (fingerprint.py):

1
2
def content_fingerprint(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()

This creates a SHA-256 fingerprint of the cleaned text.

Benefits of using hashes:

• Fast comparison
• Minimal storage requirements
• Reliable detection of small changes

Even a small change to the text will produce a different hash.

Step 5: Store Previous Snapshots

To detect updates, the tool must remember the fingerprints from previous runs.

This will store two snapshot files:

  • snapshots.json - Stores URL → fingerprint mappings.

  • snapshots_text.json - Stores the normalized text for each page so differences can be shown when content changes.

Example (storage.py):

1
2
3
4
5
6
7
8
9
10
def load_snapshots(path: str | Path) -> dict[str, str]:
p = Path(path)
if not p.exists():
return {}
with open(p, encoding=“utf-8”) as f:
return json.load(f)

def save_snapshots(snapshots: dict[str, str], path: str | Path) -> None:
with open(path, “w”, encoding=“utf-8”) as f:
json.dump(snapshots, f, indent=2)

When the monitor runs again, it loads the stored fingerprints and compares them with the newly generated ones.

Step 6: Compare Current vs Previous Version

Once the current fingerprint is generated, the script compares it with the stored fingerprint.

Example (monitor.py):

1
2
3
4
5
def check_for_change(url: str, current_hash: str, snapshots: dict[str, str]) -> bool:
previous = snapshots.get(url)
if previous is None:
return True
return previous != current_hash

If the fingerprints are different, the script reports a change.

Possible results:

Changed
No change

The first time a URL is checked, the script always reports Changed because no previous snapshot exists yet. The current fingerprint and page text are then stored for future comparisons.

When the page content changes, the tool also generates a unified diff showing what changed. Example output might look like this:

1
2
3
4
--- previous
+++ current
- Old sentence
+ New sentence

This diff is generated using Python’s difflib module and helps identify exactly what changed between page versions.

Step 7: Save Updated Snapshot

After checking for changes, the script updates the stored snapshot so future runs can detect new updates.

In monitor.py, the script stores both the fingerprint and the extracted text.

1
2
3
4
snapshots[url] = fingerprint
snapshot_texts[url] = text
save_snapshots(snapshots, path)
save_snapshot_texts(snapshot_texts, path)

Saving both values allows the tool to detect future changes and generate readable diffs.

Step 8: Run the Monitor on a Schedule

Monitoring tools are most useful when they run automatically.

Several scheduling options are available with common approaches such as:

  • Cron jobs on Linux or macOS
  • Windows Task Scheduler
  • Cloud-based job schedulers

This tool also supports built-in interval scheduling.

Example CLI configuration in main.py:

1
2
3
4
5
6
7
8
parser.add_argument("--interval", type=float, metavar="SECONDS",
help="Run continuously: re-check all URLs every SECONDS (e.g. 3600 for hourly). Ctrl+C to stop.")
# ...
while True:
results = run_once(args.url, args.snapshots, args.json)
if args.interval is None:
break
time.sleep(args.interval)

Example usage:

1
python main.py https://example.com --interval 3600

Full Working Script

The complete implementation combines all components into a modular monitoring and tracking tool.

ScraperHub Repository layout:

FileRole
fetch.pyFetch HTML using Crawlbase
extract.pyClean HTML and normalize text
fingerprint.pyGenerate SHA-256 fingerprint
storage.pyLoad and store snapshot data
monitor.pyCompare snapshots and detect changes
main.pyCLI entry point and scheduler

How to Run the Script

Set your Crawlbase token first.

1
export CRAWLBASE_TOKEN="your_token"

Then run the script.

1
python main.py https://targeturl.com

To monitor multiple pages:

1
python main.py https://targeturl1.com https://targeturl2.com ...

The first run always reports Changed, since no snapshot exists yet.

Error Handling Strategies

A production Python website monitoring script needs to handle three common failure modes gracefully.

  • Network timeouts: The requests.get(timeout=30) call raises requests.exceptions.Timeout if the Crawlbase API does not respond within 30 seconds. Wrap fetch calls in a try/except and implement exponential backoff for retries:
1
2
3
4
5
6
7
8
9
10
import time
def fetch_with_retry(url: str, token: str, retries: int = 3, backoff: float = 2.0) -> str:
for attempt in range(retries):
try:
return fetch_page(url, token)
except requests.exceptions.Timeout:
if attempt < retries - 1:
time.sleep(backoff ** attempt)
else:
raise

  • HTTP errors: response.raise_for_status() surfaces 4xx/5xx responses as exceptions. Log the status code and URL, then skip the affected URL rather than halting the entire run.
  • Malformed HTML: BeautifulSoup handles most broken HTML gracefully, but extremely malformed pages can produce empty text. Add a check after extraction: if extract_monitorable_text() returns an empty string, skip the fingerprint comparison and log a warning rather than recording a spurious change.

Scaling the Tool for Multiple URLs

The tutorial focuses on a minimal implementation, but the system can be extended for larger monitoring workloads.

Possible improvements include:

• Tracking many pages simultaneously
• Parallel request processing
• Using databases instead of JSON storage
• Adding structured logging and retries

These changes make the system more robust for production environments.

Limitations and Best Practices

A simple change tracker works well for many pages, but real websites can introduce a few complications.

Dynamic content

Some sites load content with JavaScript after the initial page request. If the part of the page you want to track is generated this way, a normal Crawlbase request may not return the full content. In that case, switch to the Crawlbase JavaScript token so the page is rendered before the HTML is returned.

Authentication

For pages that require a login, the request must include valid session cookies.

Fix: Pass authenticated cookies via the Crawlbase cookies parameter so the crawler accesses the logged-in version of the page.

Rate limits

  • Default limit: 20 requests per second
  • For most monitoring workloads, this is sufficient
  • Contact Crawlbase support to request a higher limit for large-scale jobs

Monitoring intervals

Choose check frequency based on how often the page actually changes:

  • News sites/dashboards: every 15–60 minutes
  • Product listings/pricing: every 1–6 hours
  • Policy pages/documentation: daily or weekly

Running checks too frequently adds request costs without improving detection accuracy.

What’s Next

With the script from this guide, you already have a working website change tracker built with Python and Crawlbase. From here, you can extend it depending on your needs. For example, you could add alert notifications, store results in a database, or monitor a larger list of URLs in parallel.

If you want to try it yourself, create a Crawlbase account and use the 1,000 free requests to test the tracker and start monitoring and tracking pages right away.

Frequently Asked Questions

Can this monitor multiple pages at once?

Yes. Pass multiple URLs to the CLI: python main.py https://site1.com https://site2.com. The script processes them sequentially by default; enable parallel processing with ThreadPoolExecutor for faster runs across large URL lists.

How often should checks run?

It depends on how frequently the content changes. Hourly is a reasonable default for most monitoring use cases. High-frequency pages (live scores, breaking news) may warrant checks every 10–15 minutes. Static documentation pages are fine with daily checks.

Does it work on JavaScript-heavy websites?

Yes, with one configuration change. Use the Crawlbase JavaScript token instead of the standard token. This renders the full page in a headless browser before returning HTML, ensuring dynamic content is captured.

Can it send alerts when something changes?

The core script outputs change results to stdout and optionally to a JSON file. Integrating alerts requires a small extension — call an email API (e.g., SendGrid), post to a Slack webhook, or trigger any HTTP endpoint when check_for_change() returns True.

What’s the best storage option for tracking hundreds of URLs?

Replace the default JSON files with SQLite using the sqlite3 standard library module. It handles concurrent reads, scales to large URL lists, and keeps all state in a single portable file. See the Scaling section above for a ready-to-use implementation.