Plenty of the data worth collecting on the modern web never shows up in the page source. A product grid that fills in as you scroll, a table that updates when you change a filter, a dashboard that loads its numbers a beat after the layout: all of these use AJAX (Asynchronous JavaScript and XML) to fetch content in the background and slot it into a page that has already loaded. It makes for a smooth experience, and it quietly breaks the simplest scrapers.

This guide shows you how to scrape data from AJAX-driven websites with Python. You will build a small, runnable scraper that renders the page through the Crawling API, waits for the asynchronous content to arrive, captures the data the page loads over XHR, parses it, and exports clean JSON and CSV. The walkthrough uses a neutral placeholder listing so you can follow the mechanics and then point the same flow at your own target.

What you will build

A Python script that fetches an AJAX page through Crawlbase, reads the data the page loaded asynchronously, and turns each item into a structured record. The running example is a generic public listing where each card carries a name, a price, and a category. We pull these fields per item:

  • Name the title shown on each listing card.
  • Price the numeric price rendered for the item.
  • Category the group or tag the item belongs to.
  • Link the URL of the item's detail page.

You will see two routes to the same data, replicating the AJAX call directly and rendering the full page, both ending in the same export step.

Why a plain request fails on AJAX pages

Request an AJAX-driven URL with a bare HTTP client and you get back status 200 and almost none of the data you came for. The reason is timing. The server sends a thin HTML shell, the browser runs the page's JavaScript, and only then does the script fire the background requests (the AJAX calls) that return the real content and inject it into the DOM. A plain requests.get stops at the shell: it never runs the JavaScript, so it never triggers the follow-up calls, and the body you parse is mostly empty layout.

There are two honest ways around this. The first is to find the AJAX endpoint the page calls in the background and request it directly, which is fast because you skip rendering. The second is to render the whole page in a real browser so the asynchronous content loads, then parse the finished HTML, which is the route you want when the endpoint is signed or hard to reproduce. The crawling JavaScript websites guide covers the rendering side in depth.

Normal token vs JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Because AJAX content only appears after scripts run, you use the JS token here, paired with ajax_wait and page_wait so the API holds for the background calls to finish before it captures the page.

Prerequisites

A few things need to be in place before writing any code.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If parsing is new to you, the BeautifulSoup guide pairs well with this tutorial.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org and make sure Python is on your PATH.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token. Crawlbase includes 1,000 free requests to start, which is plenty for this guide. Treat the token like a password, and keep it out of version control.

Set up the project

Create a virtual environment so project dependencies stay isolated, then install the libraries the scraper needs.

bash
python --version

python -m venv ajax_env
source ajax_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with ajax_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML when you go the rendered-page route. Both json and csv ship with the standard library, so there is nothing more to install for the export step.

Step 1: Identify the AJAX request

Before any code, find the background call the page makes. Open the target in Chrome, right-click and choose Inspect (or press Ctrl+Shift+I), and switch to the Network tab. Filter for XHR, which isolates the XMLHttpRequest and fetch calls from images and stylesheets, then reload the page. As the content fills in, the request that carried it appears. Click it to read the request URL, its query parameters, and the JSON it returned.

For the placeholder used here, the page loads its items from a JSON endpoint that looks like this:

bash
https://example.com/api/items?page=1&limit=20

That endpoint returns the same data the page shows, only as clean JSON instead of rendered HTML. When a call like this exists and is reachable, requesting it directly is the simplest path. When it is signed, tied to a session, or otherwise awkward to reproduce, you render the page instead. Both routes follow below.

Step 2: Fetch the AJAX endpoint through Crawlbase

Even a clean JSON endpoint can rate-limit or block automated traffic from a datacenter IP. Routing the call through Crawlbase gives you a trusted IP and built-in rotation, so the request reads like a real visitor. Import the CrawlingAPI class, initialize it with your token, and request the endpoint. Checking pc_status before you parse keeps failures loud instead of silent.

python
import json
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

endpoint = "https://example.com/api/items?page=1&limit=20"

def fetch_json(url):
    response = api.get(url)
    if response["headers"]["pc_status"] == "200":
        return json.loads(response["body"].decode("utf-8"))
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

if __name__ == "__main__":
    data = fetch_json(endpoint)
    print(data if data else "No data returned")

Run the script with python ajax_scraper.py and you should see the raw JSON the page would have loaded in the browser, fetched in a single call without rendering anything. That confirms the endpoint is reachable before you write a line of parsing.

Crawlbase Crawling API

The fetch above reached an AJAX endpoint without you running a browser or managing IPs, which is exactly what the Crawling API handles. Give it a normal token for a clean JSON endpoint, or the JS token plus ajax_wait and page_wait when you need the full page rendered. It rotates residential IPs server-side and returns finished content, so you skip running a headless fleet and a proxy pool yourself. Point it at a public page on the free tier first.

Step 3: Parse the JSON response

The endpoint returns structured JSON, so there is no HTML to parse. Walk the object to the list of items and pull the fields you want. The exact key names depend on your target, so inspect the response from Step 1 and map them. For the placeholder, the items sit under an items key, each with name, price, category, and url.

python
def parse_items(data):
    records = []
    for item in data.get("items", []):
        records.append({
            "name": item.get("name"),
            "price": item.get("price"),
            "category": item.get("category"),
            "link": item.get("url"),
        })
    return records

Using dict.get instead of square-bracket access means a missing key returns None rather than raising a KeyError, so one malformed item does not end the run. Feed the JSON from Step 2 into parse_items and you get a tidy list of records, ready to export.

Step 4: Render the full page when there is no clean endpoint

Sometimes the AJAX call is signed, tied to a session cookie, or split across several requests, and replicating it is more trouble than it is worth. In that case, render the whole page with the JS token, let Crawlbase wait for the asynchronous content, then parse the finished markup with BeautifulSoup the same as any static page.

python
from bs4 import BeautifulSoup

RENDER_OPTIONS = {
    "ajax_wait": "true",
    "page_wait": 5000,
}

def fetch_rendered(url):
    response = api.get(url, RENDER_OPTIONS)
    if response["headers"]["pc_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

def parse_cards(html):
    soup = BeautifulSoup(html, "html.parser")
    records = []
    for card in soup.select("div.item-card"):
        link = card.select_one("a.item-link")
        records.append({
            "name": text_of(card, "h2.item-name"),
            "price": text_of(card, "span.item-price"),
            "category": text_of(card, "span.item-category"),
            "link": link["href"] if link else None,
        })
    return records

The two wait options carry the load here. ajax_wait tells the API to hold for asynchronous content to finish loading, and page_wait adds a fixed pause in milliseconds after load so late-rendering cards appear before capture. Five seconds is a reasonable start; raise it if items come back thin. The parse_cards helper then reads each div.item-card and maps the same four fields, so its output matches parse_items exactly. The text_of helper used here is defined in the full script below.

Step 5: Handle pagination and assemble the script

One page is rarely the whole dataset. Most AJAX listings paginate through a query parameter (here page), so you loop over page numbers, collect records from each, and stop when a page comes back empty. Wire that loop together with the fetch, parse, and export steps into one runnable script.

python
import csv
import json
import time
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})
BASE = "https://example.com/api/items?limit=20&page="

def fetch_json(url):
    response = api.get(url)
    if response["headers"]["pc_status"] == "200":
        return json.loads(response["body"].decode("utf-8"))
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

def text_of(node, selector):
    el = node.select_one(selector)
    return el.get_text(strip=True) if el else None

def parse_items(data):
    records = []
    for item in data.get("items", []):
        records.append({
            "name": item.get("name"),
            "price": item.get("price"),
            "category": item.get("category"),
            "link": item.get("url"),
        })
    return records

def collect_all(max_pages=5):
    all_records = []
    for page in range(1, max_pages + 1):
        data = fetch_json(f"{BASE}{page}")
        if not data:
            break
        records = parse_items(data)
        if not records:
            break
        all_records.extend(records)
        time.sleep(2)
    return all_records

def save_outputs(records):
    with open("items.json", "w") as f:
        json.dump(records, f, indent=2)
    if not records:
        return
    with open("items.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=records[0].keys())
        writer.writeheader()
        writer.writerows(records)

def main():
    records = collect_all(max_pages=5)
    save_outputs(records)
    print(f"Saved {len(records)} items")

if __name__ == "__main__":
    main()

The script walks up to five pages of the AJAX endpoint, parses each into records, stops as soon as a page returns nothing, and paces the loop with a two-second sleep. save_outputs writes both a JSON file and a CSV, using the keys of the first record as the header. If your target has no clean endpoint, swap fetch_json plus parse_items for the fetch_rendered plus parse_cards pair from Step 4; the export step does not change.

What the output looks like

Run the full script with python ajax_scraper.py and you get a clean structured record per item, ready for analysis, a database, or a spreadsheet.

json
[
  {
    "name": "Wireless Keyboard",
    "price": "49.00",
    "category": "Accessories",
    "link": "https://example.com/items/wireless-keyboard"
  },
  {
    "name": "Standing Desk",
    "price": "299.00",
    "category": "Furniture",
    "link": "https://example.com/items/standing-desk"
  }
]

The matching CSV carries the same columns, one row per item, which drops straight into pandas or any spreadsheet for filtering by price band or category. To take the analysis further, using pandas to analyze data picks up where this export leaves off, and JSON vs CSV covers which format suits which job.

Staying unblocked at scale

Even with rendering and a trusted IP handled, an AJAX target watches for scraper-shaped traffic. A few habits keep a longer run healthy.

  • Pace your requests. Firing calls in a tight loop is the fastest way to get throttled. The two-second sleep above is the floor, not the ceiling; widen it for larger jobs.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you.
  • Read the status codes. A run that starts returning non-200 pc_status values is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off.

For larger crawls, the async Crawler queues requests and delivers results to a webhook, which suits running many AJAX pages without holding open connections. For the broader playbook, see how to scrape websites without getting blocked and scraping JavaScript pages with Python.

Scraping responsibly

Keep this work scoped to public data, and treat the target's rules as the boundary. Read the site's terms of service and its robots.txt before you point a scraper at it, and collect only data that any visitor can see without an account. Pace your requests so you are not straining the server, and never touch anything behind a login or any attempt to bypass authentication. When the data involves identifiable people, privacy law such as GDPR or CCPA applies, so avoid personal or contact details unless you have a clear lawful basis to collect them. If a target offers an official API for the data you need, that is usually the cleaner and more durable route than scraping the rendered page.

Recap

Key takeaways

  • AJAX content loads after the shell. A plain request stops at the initial HTML and never runs the scripts that fetch the real data, so the body you parse is mostly empty.
  • Two routes reach the same data. Replicate the background XHR endpoint directly for speed, or render the full page with the JS token when the endpoint is signed or awkward to reproduce.
  • Wait for the content. On the rendered route, ajax_wait and page_wait hold for asynchronous calls to finish before Crawlbase captures the page.
  • Normalise then export. Map both routes to the same record shape, paginate through the query parameter, and write the results to JSON and CSV from one function.
  • Scrape responsibly. Respect terms of service and robots.txt, stay on public data, pace requests, and apply GDPR or CCPA rules whenever personal data is involved.

Frequently Asked Questions (FAQs)

What is AJAX and why does it make scraping harder?

AJAX (Asynchronous JavaScript and XML) is a technique that lets a page fetch content in the background and update part of the DOM without reloading. It makes scraping harder because the data is not in the initial HTML; it arrives only after the browser runs the page's JavaScript and the background calls return. A plain HTTP request never runs that JavaScript, so it captures a thin shell with the real content missing.

Can I scrape AJAX content without rendering a browser?

Often, yes. Filter your dev tools Network tab for XHR and find the request that carries the data. If that endpoint is reachable, you can request it directly and parse the JSON it returns, which is faster than rendering the page. When the endpoint is signed, tied to a session, or split across several calls, rendering with a JS token is the more reliable route.

Do I need the normal token or the JS token?

It depends on the route. For a clean JSON endpoint you found in the Network tab, the normal token is enough because there is nothing to render. To load a full page whose content only appears after scripts run, use the JS token together with ajax_wait and page_wait so Crawlbase waits for the asynchronous calls to finish before capturing the HTML.

What do ajax_wait and page_wait actually do?

ajax_wait tells the API to hold until the page's asynchronous requests have finished loading, rather than capturing the moment the initial HTML arrives. page_wait adds a fixed pause in milliseconds after load, which covers content that renders a beat late. Five seconds is a sensible starting point; raise it if items come back thin and lower it once you confirm the page settles faster.

My parsed list is empty. What went wrong?

Check three things in order. First, confirm pc_status came back 200; a non-200 value means the request failed. Second, on the endpoint route, re-inspect the JSON keys, since they may differ from the placeholder names used here. Third, on the rendered route, increase page_wait and re-check your CSS selectors against the live page, since class names on generated markup change without notice.

How do I scale this to many pages?

Loop over the pagination parameter and stop when a page returns no items, as the collect_all function above does, keeping a short sleep between requests. For large jobs, move to the async Crawler so requests queue and results arrive at a webhook instead of holding open connections, and lean on built-in IP rotation so no single address trips a rate limit.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available