How to Scrape JavaScript Pages With Python

Q: Why does requests return no data on a JavaScript page?

Because requests downloads only the HTML the server sends and never executes JavaScript. A client-rendered page ships a thin skeleton and then builds its real content in the browser by calling an API after load. Since that step never happens in a plain fetch, the data nodes do not exist when BeautifulSoup parses the response, so your selectors match nothing.

Q: What is the difference between ajax_wait and page_wait?

ajax_wait tells the Crawling API to wait until the page's asynchronous (XHR/fetch) requests have settled before capturing the HTML, which is what fills in client-rendered data. page_wait adds a fixed delay in milliseconds after load, giving late-rendering elements extra time to appear. Use both for client-rendered targets, and raise page_wait if fields come back empty.

You write a few lines of Python, point requests at a product listing or a search results page, hand the response to BeautifulSoup, and get back almost nothing. The title is there, the layout is there, but the data you actually wanted is missing. This is the single most common wall people hit when they try to scrape JavaScript pages with Python: the page renders its content in the browser, after the initial HTML arrives, so a plain HTTP fetch only ever sees the empty shell.

This guide explains why that happens, walks through the three real ways to get the rendered data (headless browsers, the underlying JSON API, and a rendering API), and shows a clean, runnable example that fetches a finished page through the Crawling API and parses it with BeautifulSoup. By the end you will know which approach fits which job and how to keep a run from getting blocked.

Why requests plus BeautifulSoup returns an empty shell

To see the problem instead of reading about it, fetch a client-rendered page the naive way and look at what comes back.

python

import requests
from bs4 import BeautifulSoup

url = "https://example-shop.com/search?q=smartwatch"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")

products = soup.select("[data-product-title]")
print(f"found {len(products)} products")
# found 0 products

Status code 200, a full-looking HTML document, and zero products. The reason is the page's lifecycle. The server sends a lightweight HTML skeleton: a few div mount points, some <script> tags, maybe a loading spinner. Only when those scripts execute does the browser call an API, receive the product data as JSON, and build the DOM nodes that hold it. The requests library does not run JavaScript. It downloads the skeleton and stops, so the product nodes never exist for BeautifulSoup to find.

The fix for every approach below is the same in principle: get the page into a state where the JavaScript has already run, then parse that state. The approaches differ only in how they reach that rendered state and what it costs you in speed, infrastructure, and the odds of being blocked.

How to tell quickly

Right-click the page and choose "View Source" to see the raw HTML the server sent, which is exactly what requests gets. Then open dev tools and look at the Elements panel, which shows the live DOM after scripts run. If your target data appears in Elements but not in View Source, the page is client-rendered and a plain fetch will not work.

Approach 1: drive a real browser with Selenium or Playwright

The most direct fix is to use a tool that actually runs a browser. Selenium and Playwright both launch Chromium (headless or visible), load the URL, wait for scripts to finish, and let you read the rendered DOM. Because a genuine browser engine executes the JavaScript, the data that was missing from a plain fetch is now present.

A minimal Playwright example looks like this:

python

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example-shop.com/search?q=smartwatch")
    page.wait_for_selector("[data-product-title]")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "html.parser")
titles = [t.get_text(strip=True) for t in soup.select("[data-product-title]")]
print(titles)

The key line is wait_for_selector. Rather than guessing with a fixed sleep, you tell the browser to wait until the element you care about actually exists, which is both faster and more reliable. Selenium offers the same idea through its WebDriverWait and expected-conditions helpers.

This approach works, and it is the right tool when you need to click, scroll, fill forms, or step through multi-page flows. But it carries real costs. Each browser instance eats hundreds of megabytes of RAM and a full CPU core, so running many in parallel is expensive. Setup is fiddly: you manage browser binaries, driver versions, and a brittle dependency chain. And rendering alone does not make you invisible. A headless browser from a datacenter IP, with a default automation fingerprint, gets flagged and blocked by serious anti-bot systems just as fast as a raw request does. Rendering solves the JavaScript problem; it does nothing for the detection problem. For a fuller comparison of the engines, see choosing a headless browser for web scraping and this walkthrough of scraping dynamic content with Selenium and BeautifulSoup.

Approach 2: skip the browser and call the underlying API

Here is the insight most tutorials miss. When a client-rendered page builds itself, it almost always fetches its data from a backend JSON endpoint. If you can find that endpoint, you can call it directly and skip rendering entirely, getting clean structured JSON with no browser at all.

To find it, open dev tools, go to the Network tab, filter to Fetch/XHR, and reload the page. You are looking for a request whose response contains your data, usually a URL with /api/, /graphql, or a query-heavy path. Once you spot it, replicate it in Python.

python

import requests

api = "https://example-shop.com/api/search"
params = {"q": "smartwatch", "page": 1}
headers = {"Accept": "application/json"}

data = requests.get(api, params=params, headers=headers).json()
for item in data["results"]:
    print(item["title"], item["price"])

When this works, it is by far the most efficient option: no browser overhead, structured data instead of HTML you have to parse, and built-in pagination through the API's own parameters. It is always worth ten minutes in the Network tab before you reach for anything heavier.

The catch is that it does not always work. The endpoint may require a signed token, a session cookie, or a specific header set that the page generates dynamically. It may be protected by the same anti-bot layer as the page itself. And it can change without notice, since an internal API carries no stability promise. When the API is reachable, take it. When it is locked down, you are back to needing a rendered page, which brings us to the third approach.

Approach 3: render through the Crawling API and parse the result

The two previous approaches each solve half the problem. A headless browser renders but does not hide you. A direct API call is clean but often blocked. What you usually want is both at once: a real browser that executes the page's JavaScript, sitting behind an IP the site reads as a genuine visitor, returning finished HTML in a single call so your Python stays simple.

That is what the Crawling API does. You send it a URL with a JavaScript token, it loads the page in a real browser on its side, rotates through residential IPs server-side, and hands back the fully rendered HTML. You never run a browser fleet or maintain a proxy pool; you make one HTTP request and parse the response with the same BeautifulSoup you already know.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. For a client-rendered target you need the JS token, otherwise you get back the same empty shell a plain fetch returns and there is nothing to parse.

Install the official client and BeautifulSoup, then fetch the rendered page.

bash

python -m venv scraper_env
source scraper_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with scraper_env\Scripts\activate instead of the source line. Now fetch the page with the JS token and the two wait options that matter for client-rendered content.

python

from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    page_url = "https://example-shop.com/search?q=smartwatch"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

The two wait options do the work for a client-rendered target. ajax_wait tells the API to wait for asynchronous requests to settle before capturing the page, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear. Five seconds is a sensible start; raise it if your fields come back empty. Run this and you should see real markup in the first 500 characters, not the skeleton a plain fetch returns. That confirms rendering works before you write a single selector.

Crawlbase Crawling API

Rendering a JavaScript page behind a trusted IP, in one call, is exactly what the Crawling API is for. Pass a JS token, it runs the page in a real browser, rotates residential IPs server-side, and returns finished HTML, so you skip running a headless fleet and a proxy pool yourself. Try it on a real page on the free tier first.

Start free

Parse the rendered HTML with BeautifulSoup

Once crawl returns rendered HTML, the parsing step is ordinary BeautifulSoup, because the JavaScript has already run server-side and the data nodes are present. Wrap field access in a small helper so one missing element does not crash the run.

python

import json
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    return None

def parse_products(html):
    soup = BeautifulSoup(html, "html.parser")
    items = []
    for card in soup.select("div.product-card"):
        title = card.select_one("[data-product-title]")
        price = card.select_one("span.price")
        items.append({
            "title": title.get_text(strip=True) if title else None,
            "price": price.get_text(strip=True) if price else None,
        })
    return items

def main():
    url = "https://example-shop.com/search?q=smartwatch"
    html = crawl(url)
    if not html:
        return
    products = parse_products(html)
    print(json.dumps(products, indent=2))

if __name__ == "__main__":
    main()

Run it with python scraper.py and you get a clean structured list, ready to write to JSON, CSV, or a database.

json

[
  {
    "title": "Aero Fit Smartwatch 2",
    "price": "$129.00"
  },
  {
    "title": "Pulse Sport Band Pro",
    "price": "$89.99"
  }
]

Selectors drift

Class names and data attributes change as sites redesign, so a selector that worked last month can return nothing today. When a field comes back as None, re-inspect the live page in dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Common pitfalls when scraping JavaScript pages

A few issues account for most failed runs against client-rendered targets. Knowing them in advance saves a lot of debugging.

Capturing too early. The most frequent mistake is parsing before the content exists. Prefer waiting for a specific selector or, with the Crawling API, lean on ajax_wait and a generous page_wait rather than a blind fixed delay.
Content behind interaction. Some data only appears after a scroll, a tab click, or a "load more" press. A direct fetch or a single render will not trigger that. This is where a browser you script step by step, or rendering with a scroll instruction, earns its cost.
Lazy-loaded and paginated lists. Infinite-scroll pages load in chunks as you scroll. Either drive the scroll in a browser or, better, find the paginated API behind it and request each page directly.
Getting blocked despite rendering. Rendering is not stealth. A datacenter IP or an obvious automation fingerprint still gets challenged. Residential IP rotation is what actually keeps a run alive at volume.

Choosing an approach

There is no single right tool, only the right tool for the job in front of you.

Reach for a direct API call first. If the Network tab reveals an open JSON endpoint, it is the cleanest and fastest path, with no rendering overhead at all. Always check before doing anything heavier.

Use a scripted browser when you need interaction. Logins, multi-step forms, clicks, and scroll-triggered content all call for Selenium or Playwright, where you control the session step by step. Accept the memory and setup cost as the price of that control.

Use a rendering API when you need finished HTML at scale without getting blocked. When the job is "fetch many JavaScript pages reliably and parse them," the Crawling API removes the two hardest parts, running browsers and rotating IPs, and leaves you with one HTTP call plus BeautifulSoup. If you would rather route your own browser traffic through a rotating pool, the Smart AI Proxy (also called the AI Proxy) gives you residential rotation as a drop-in proxy endpoint. For a broader tour of these patterns, see how to crawl JavaScript websites.

Recap

Key takeaways

Plain fetches see only the skeleton. requests does not run JavaScript, so client-rendered data is missing from the HTML it downloads.
Three real fixes exist. Drive a real browser, call the underlying JSON API directly, or render through an API that returns finished HTML.
Check for an open API first. A direct JSON endpoint is the fastest, cleanest path when it is reachable, with zero rendering cost.
Rendering is not stealth. A headless browser on a datacenter IP still gets blocked; residential IP rotation is what keeps a run alive.
The Crawling API folds both together. A JS token renders the page behind a trusted IP in one call; ajax_wait and page_wait control how long it waits before BeautifulSoup parses the result.

Frequently Asked Questions (FAQs)

Why does requests return no data on a JavaScript page?

Because requests downloads only the HTML the server sends and never executes JavaScript. A client-rendered page ships a thin skeleton and then builds its real content in the browser by calling an API after load. Since that step never happens in a plain fetch, the data nodes do not exist when BeautifulSoup parses the response, so your selectors match nothing.

Do I always need a headless browser to scrape JavaScript pages with Python?

No. A headless browser is one option, but it is often the heaviest. Before launching Selenium or Playwright, open the Network tab and look for the JSON endpoint the page calls. If it is reachable, calling it directly with requests is faster and cleaner. Reach for a browser, or a rendering API, only when no open endpoint is available or the data requires interaction.

What is the difference between ajax_wait and page_wait?

ajax_wait tells the Crawling API to wait until the page's asynchronous (XHR/fetch) requests have settled before capturing the HTML, which is what fills in client-rendered data. page_wait adds a fixed delay in milliseconds after load, giving late-rendering elements extra time to appear. Use both for client-rendered targets, and raise page_wait if fields come back empty.

Why does my headless browser still get blocked?

Because rendering and stealth are separate problems. Running a real browser solves the JavaScript execution problem, but the request still comes from a recognizable IP and automation fingerprint. Anti-bot systems flag datacenter IPs and default headless signatures regardless of rendering. Rotating residential IPs, which the Crawling API and Smart AI Proxy provide, is what addresses the blocking side.

Can I use BeautifulSoup with the Crawling API?

Yes, and that is the intended workflow. The Crawling API returns fully rendered HTML, so you parse it with BeautifulSoup exactly as you would any static page. The difference is that the JavaScript has already run server-side, so the data nodes your selectors target are present in the HTML you receive.

How do I scrape JavaScript pages that load more content on scroll?

Infinite-scroll pages load in chunks as the user scrolls, so a single fetch or render captures only the first batch. You have two options: script the scroll in Selenium or Playwright and wait for each batch, or find the paginated API the scroll triggers in the Network tab and request each page directly. The direct-API route is usually faster and more reliable when the endpoint is reachable.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why requests plus BeautifulSoup returns an empty shell

Approach 1: drive a real browser with Selenium or Playwright

Approach 2: skip the browser and call the underlying API

Approach 3: render through the Crawling API and parse the result

Parse the rendered HTML with BeautifulSoup

Common pitfalls when scraping JavaScript pages

Choosing an approach

Key takeaways

Frequently Asked Questions (FAQs)

Why does requests return no data on a JavaScript page?

Do I always need a headless browser to scrape JavaScript pages with Python?

What is the difference between ajax_wait and page_wait?

Why does my headless browser still get blocked?

Can I use BeautifulSoup with the Crawling API?

How do I scrape JavaScript pages that load more content on scroll?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Inside Modern Anti-Bot Evasion: A Systems View

How to Scrape Local Business Listings with Python: names, addresses, ratings, and more

Build a Website Change Tracker with Python: snapshots and SHA-256 diffs

The infrastructure brief, in your inbox.

We use cookies

Customize cookies