Playwright Web Scraping Guide

Q: How do I capture API or JSON data with Playwright?

Attach a handler with page.on("response", ...), filter responses by URL fragment to find the endpoint that carries the data, and call response.json() on it. Use the Network tab in your browser dev tools to identify the right call first. Reading the underlying JSON is faster and far less brittle than parsing rendered HTML.

Playwright web scraping has become the default choice for anyone who needs to pull data from pages that only assemble themselves after JavaScript runs. Built by Microsoft, Playwright drives a real browser, waits for content the way a person would, and exposes one consistent API across Chromium, Firefox, and WebKit. That combination makes it far less brittle than the older automation tools most scrapers started on.

This guide is a hands-on walkthrough of Playwright web scraping in Python, with a short Node note where the API differs. You will install Playwright and its browsers, launch headless Chromium, navigate to a page, wait for the right selector, extract text and attributes, handle a "load more" interaction and pagination, take a screenshot, and capture a JSON network response. We close with the honest operational reality: Playwright still gets blocked at scale, and where a managed render-and-rotate service earns its place.

Why Playwright over older automation tools

If you have written scrapers with Selenium or raw Puppeteer, the first thing you notice in Playwright is that the flaky sleep() calls disappear. A few design decisions are responsible.

Auto-waiting. Before Playwright clicks, fills, or reads an element, it waits for that element to be attached, visible, stable, and actionable. You stop sprinkling arbitrary delays through your code, and the resulting scrapers are dramatically more reliable on slow or animated pages.
Three browser engines, one API. The same script runs against Chromium, Firefox, or WebKit. When a site behaves differently in one engine, you switch with a one-word change instead of rewriting your driver setup.
Robust selectors. Beyond CSS and XPath, Playwright ships locators and text selectors that resolve lazily and re-query the DOM at action time, so they survive re-renders that would break a cached element handle.
Async by design. The API is built around async I/O, which makes it natural to run many pages in parallel within a single browser process when you scale up.

For background on why a real browser is sometimes unavoidable, see headless browsers for web scraping. If you have an existing Selenium stack and want a side-by-side comparison, scraping dynamic content with Selenium and BeautifulSoup covers that path.

Prerequisites

You need three things before writing any code, and none of them take long.

Python 3.8 or later. Confirm your version with python --version. Playwright also has a first-class Node.js binding if you prefer JavaScript; the concepts in this guide map one to one, and a short Node example appears later.

Comfort with selectors. You should be able to open your browser dev tools, inspect an element, and read off a CSS selector. Extraction is mostly a selector exercise once the page has rendered.

A target you are allowed to scrape. Use a site whose terms permit it, keep to public data, and respect robots.txt and sensible rate limits. The techniques here are general; the responsibility for where you point them is yours.

Install Playwright and its browsers

Create a virtual environment so dependencies stay isolated, install the Playwright package, then run its installer to download the browser binaries. That second step is the one people forget; the pip package alone does not bundle the browsers.

bash

python -m venv pw_env
source pw_env/bin/activate

pip install playwright
playwright install chromium

On Windows, activate the environment with pw_env\Scripts\activate instead of the source line. The playwright install chromium command downloads a pinned Chromium build; pass no argument to fetch all three engines. If you ever see an error about a missing executable, it almost always means this install step was skipped.

Launch a browser and open a page

Start with the smallest useful script: launch headless Chromium, open a page, navigate to a URL, and read the title. The synchronous API keeps the first example readable; we move to async when it matters for scale.

python

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto("https://quotes.toscrape.com/js/")
        print(page.title())
        browser.close()

if __name__ == "__main__":
    main()

A few notes on the choices here. headless=True runs without a visible window, which is what you want for unattended jobs; flip it to False while developing so you can watch the browser work. The chosen URL is a deliberately JavaScript-rendered demo page: the quotes only appear after a script runs, which is exactly the case where a plain HTTP request returns an empty container and Playwright shines.

Context vs page

For anything beyond a one-off, create a browser context with browser.new_context() before new_page(). A context is an isolated session with its own cookies, storage, and user agent, so you can run several independent pages without their state leaking into each other. Calling new_page() directly, as above, uses a default context, which is fine for a single page.

Wait for a selector, then extract text and attributes

This is the heart of Playwright web scraping. Instead of guessing how long the page needs, you wait for the specific element that signals the data is present, then read it. Playwright's locators auto-wait, so a single call both waits and selects.

python

from playwright.sync_api import sync_playwright

def scrape_quotes(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        page.wait_for_selector("div.quote")

        results = []
        for quote in page.query_selector_all("div.quote"):
            text = quote.query_selector("span.text").inner_text()
            author = quote.query_selector("small.author").inner_text()
            link = quote.query_selector("a").get_attribute("href")
            results.append({"text": text, "author": author, "link": link})

        browser.close()
        return results

The line that matters most is page.wait_for_selector("div.quote"). It blocks until at least one quote element exists in the DOM, which means the JavaScript has run and the data is there. After that, query_selector_all returns every matching element, and inner_text() and get_attribute() pull text and attributes respectively. Reading the href off the anchor shows the attribute case; reading the quote and author shows the text case. No fixed sleeps anywhere.

Handle "load more" clicks and pagination

Real targets rarely show everything at once. Two patterns cover most of them: a "load more" button that appends content in place, and numbered or "next" pagination that swaps the page. Playwright handles both because it can click and then wait for the result.

For an in-place "load more" button, click it in a loop until it disappears, waiting after each click for new content to settle.

python

def load_all(page):
    while True:
        button = page.query_selector("button.load-more")
        if not button or not button.is_visible():
            break
        button.click()
        page.wait_for_load_state("networkidle")

For classic pagination, follow the "next" link until it is gone, scraping each page as you go. Because locators re-query the DOM on each call, you do not have to worry about stale handles after the navigation.

python

def scrape_all_pages(page, url):
    page.goto(url)
    rows = []
    while True:
        page.wait_for_selector("div.quote")
        for q in page.query_selector_all("div.quote span.text"):
            rows.append(q.inner_text())
        next_link = page.query_selector("li.next a")
        if not next_link:
            break
        next_link.click()
    return rows

Note wait_for_load_state("networkidle") in the first snippet: it waits until there are no in-flight network requests for a short window, a good signal that lazily loaded content has arrived. Use it after actions that trigger background fetches.

Take a screenshot

Screenshots are useful for debugging a scraper that returns empty results and for archiving what a page looked like at capture time. Playwright captures the visible viewport by default, or the full scrollable page with one flag.

python

page.screenshot(path="page.png", full_page=True)

When a run comes back with no data, a full-page screenshot taken right before extraction usually tells you why in seconds: a cookie wall, a CAPTCHA, or a block page sitting where your content should be.

Capture network and JSON responses

Often the cleanest data is not in the HTML at all but in a JSON API the page calls in the background. Rather than parse rendered markup, you can listen to network responses and grab that JSON directly. This is faster and far less fragile than scraping the DOM, because the API shape changes less often than the layout.

python

captured = []

def on_response(response):
    if "/api/" in response.url and response.ok:
        try:
            captured.append(response.json())
        except Exception:
            pass

page.on("response", on_response)
page.goto("https://example.com/listings")
page.wait_for_load_state("networkidle")

The page.on("response", ...) hook fires for every network response. Filtering by URL fragment isolates the calls you care about, and response.json() parses the body for you. Open the Network tab in dev tools first to find which endpoint carries the data, then match it here. If a site is heavy on these XHR calls, see how to scrape JavaScript pages with Python for more on the API-first approach.

The same script in Node.js

If your stack is JavaScript, the Node binding mirrors the Python one almost exactly. The method names match, everything is promise-based, and you install browsers the same way with npx playwright install chromium.

javascript

const { chromium } = require("playwright");

(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto("https://quotes.toscrape.com/js/");
  await page.waitForSelector("div.quote");
  const texts = await page.$$eval("div.quote span.text", els => els.map(e => e.textContent));
  console.log(texts);
  await browser.close();
})();

The Python wait_for_selector becomes waitForSelector, and $$eval runs a function in the page to extract many elements at once. Pick whichever language your team already maintains; the scraping logic is identical.

The stealth reality: Playwright still gets blocked

Here is the part most tutorials skip. Driving a real browser solves rendering, but it does not make you invisible. Modern anti-bot systems look at far more than whether JavaScript runs. They fingerprint the browser, inspect TLS and HTTP/2 signatures, score behavioral signals, and rate-limit by IP. A vanilla headless Playwright run carries tells, and at any real volume the bigger problem is your IP: a handful of datacenter addresses hammering the same host gets flagged fast.

You can fight this. People add stealth plugins, randomize user agents and viewports, slow requests down, and wire in a proxy pool. Each helps, and each is a maintenance burden. Running a fleet of headless browsers is itself operational overhead: they are memory-hungry, they crash, they need a pinned browser version, and parallelizing them across machines is real infrastructure work. Doing all of that and keeping a healthy rotating proxy pool on top is, frankly, most of the job.

For the deeper playbook on staying unblocked, see how to scrape websites without getting blocked.

Crawlbase Crawling API

When Playwright starts hitting blocks at scale, the Crawling API takes over the hard part. It renders the page in a real browser and routes the request through rotating residential IPs server-side, then hands you finished HTML or parsed data in one call, so you skip running a headless fleet and a proxy pool yourself. You can still keep Playwright locally for the interaction-heavy flows that genuinely need a driver.

Start free

Where the managed API fits, and where Playwright still wins

This is not Playwright versus a managed API; it is knowing which tool fits which job. Reach for the Crawling API when your bottleneck is blocks, CAPTCHAs, or IP reputation, when you are crawling many pages and do not want to operate browser and proxy infrastructure, or when you just need rendered HTML back reliably and at volume. Because rendering and rotation happen server-side, you make a simple request and parse the result, with no fleet to babysit.

Keep Playwright local when the task is genuinely interactive: multi-step forms, authenticated flows behind a login you control, drag-and-drop, file uploads, or anything where you need to script a precise sequence of user actions and watch the result. The two compose well. Many teams prototype and handle interaction-heavy flows in Playwright, then route their high-volume fetch traffic through the managed API once blocks become the limiting factor. If you want IP rotation as a drop-in endpoint while keeping your own browser, the Smart AI Proxy gives you residential rotation behind a standard proxy interface.

Recap

Key takeaways

Playwright fixes the flakiness. Auto-waiting, three browser engines behind one API, lazy locators, and async make it more reliable than older drivers for rendered pages.
Wait, then extract. Use wait_for_selector to confirm the data has rendered, then read text with inner_text() and attributes with get_attribute().
Click and paginate natively. Loop a "load more" button until it disappears, or follow a "next" link, waiting on networkidle after actions that fetch.
Grab the JSON when you can. Listening on page.on("response", ...) for a background API call is faster and less fragile than parsing the DOM.
Rendering is not stealth. Playwright still gets fingerprinted and IP-blocked at scale; a managed render-and-rotate API removes the fleet and proxy overhead, while Playwright stays ideal for interaction-heavy local flows.

Frequently Asked Questions (FAQs)

Is Playwright good for web scraping?

Yes. Playwright drives a real browser, so it handles JavaScript-rendered pages that a plain HTTP request cannot. Its auto-waiting removes most of the timing flakiness that plagues older tools, it supports Chromium, Firefox, and WebKit through one API, and its lazy locators survive re-renders. For interaction-heavy or client-rendered targets it is one of the strongest options available.

Should I use Playwright with Python or Node.js?

Either works; the API is nearly identical across both. Method names differ only in casing (wait_for_selector in Python becomes waitForSelector in Node), and both install browsers with a single command. Pick the language your team already maintains so the scraper fits the rest of your stack.

How do I wait for content to load in Playwright?

Wait for the specific element that signals the data is present with page.wait_for_selector("your.selector"), which blocks until that element exists. For background fetches triggered by a click, use page.wait_for_load_state("networkidle") to wait until network activity quiets down. Avoid fixed sleeps; Playwright's auto-waiting and these explicit waits are more reliable.

Can you get blocked while scraping with Playwright?

Yes. Running a real browser solves rendering but not detection. Anti-bot systems fingerprint the browser, inspect network signatures, and rate-limit by IP, so vanilla headless runs get flagged and datacenter IPs get blocked at volume. Slowing down, randomizing fingerprints, and rotating residential IPs all help; a managed Crawling API folds rendering and rotation together so you do not maintain that stack yourself.

How do I capture API or JSON data with Playwright?

Attach a handler with page.on("response", ...), filter responses by URL fragment to find the endpoint that carries the data, and call response.json() on it. Use the Network tab in your browser dev tools to identify the right call first. Reading the underlying JSON is faster and far less brittle than parsing rendered HTML.

When should I use a Crawling API instead of Playwright?

Switch to the Crawling API when blocks, CAPTCHAs, or IP reputation become your bottleneck, or when you are crawling many pages and do not want to run browser and proxy infrastructure. It renders and rotates IPs server-side and returns finished HTML in one call. Keep Playwright for genuinely interactive local flows like authenticated multi-step forms, and route high-volume fetch traffic through the API.

Muhammad Atif

Senior Full Stack Developer · Crawlbase

Senior full stack developer at Crawlbase, building the platform and writing about scraping architecture, proxies, and data pipelines.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why Playwright over older automation tools

Prerequisites

Install Playwright and its browsers

Launch a browser and open a page

Wait for a selector, then extract text and attributes

Handle "load more" clicks and pagination

Take a screenshot

Capture network and JSON responses

The same script in Node.js

The stealth reality: Playwright still gets blocked

Where the managed API fits, and where Playwright still wins

Key takeaways

Frequently Asked Questions (FAQs)

Is Playwright good for web scraping?

Should I use Playwright with Python or Node.js?

How do I wait for content to load in Playwright?

Can you get blocked while scraping with Playwright?

How do I capture API or JSON data with Playwright?

When should I use a Crawling API instead of Playwright?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Inside Modern Anti-Bot Evasion: A Systems View

How to Scrape Local Business Listings with Python: names, addresses, ratings, and more

Build a Website Change Tracker with Python: snapshots and SHA-256 diffs

The infrastructure brief, in your inbox.

We use cookies

Customize cookies