A headless browser is a real browser engine that runs without a visible window: it loads pages, runs JavaScript, applies CSS, and builds the same DOM Chrome or Firefox would, but it does it in the background under your script's control. For headless browser web scraping, that matters because so much of the modern web only exists after JavaScript runs. A plain HTTP request hands you the initial HTML shell; a headless browser hands you the page a human actually sees.
This guide is a practical, runnable walkthrough. You will spin up a modern headless stack in Node (Puppeteer, then Playwright), load a JavaScript-heavy page, wait for the right content, extract structured data, and capture a screenshot. Then we get honest about where this approach hurts at scale, and show the one-call alternative: rendering a page server-side through the Crawlbase Crawling API with a JavaScript token.
What a headless browser actually is
A normal browser draws pixels to a screen. A headless browser skips the visible UI but keeps everything underneath: the JavaScript engine, the layout engine, the network stack, cookies, and the full DOM. You drive it programmatically instead of clicking, so it is ideal for automated testing, generating screenshots, and scraping pages that build themselves in the client.
Modern headless Chrome and Firefox ship the same rendering code as their visible counterparts, so a page behaves the same way it would for a real visitor. That fidelity is the whole point: when a site loads its content with fetch calls after the initial response, only something that runs that JavaScript will ever see the data.
Why JavaScript-heavy sites break plain HTTP scrapers
If you request a single-page app or an infinite-scroll listing with a bare HTTP client, you usually get status 200 and a near-empty body. The markup you want is not in that response. It gets injected after the browser runs the page's scripts, makes its XHR or fetch calls, and renders the result into the DOM.
Tools like Cheerio or Beautiful Soup parse whatever HTML you give them, but they cannot run JavaScript, so they only see that empty shell. A headless browser closes the gap: it executes the page exactly like a real visitor's browser, then lets you read the finished DOM. For static, server-rendered pages you do not need this overhead, but for anything client-rendered it is the difference between data and an empty array.
Reach for a headless browser when the content you want appears only after scripts run, when you need to click or scroll to reveal more, or when you want a screenshot. For static HTML that already contains your data, a plain HTTP fetch plus a parser is faster and cheaper. Match the tool to the page, not the other way around.
Set up the project
You need Node.js (version 18 or newer) and npm installed. Confirm both, create a project, and install Puppeteer. Puppeteer downloads a compatible Chrome build for you on install, so there is nothing else to configure.
node --version npm --version mkdir headless-scraper && cd headless-scraper npm init -y npm install puppeteer
One thing worth knowing up front: a headless browser is heavy. Each instance is a full Chrome process with its own memory and CPU footprint. That is fine on your laptop for one page at a time, and it becomes the central scaling problem the moment you want hundreds of pages in parallel. Hold that thought; we come back to it.
Launch a headless browser and load a page
The core loop with Puppeteer is always the same: launch the browser, open a new page, navigate to a URL, do your work, then close the browser so you do not leak processes. Here is the minimal version that loads a page and prints its title.
const puppeteer = require('puppeteer') async function run() { const browser = await puppeteer.launch({ headless: true }) const page = await browser.newPage() await page.goto('https://quotes.toscrape.com/js/', { waitUntil: 'networkidle2', }) console.log(await page.title()) await browser.close() } run().catch((err) => console.error(err))
Run it with node index.js. The target here, the JavaScript version of Quotes to Scrape, renders its quotes client-side on purpose, so it is a clean test bed: a plain fetch returns an empty list, while the headless browser sees real content. The waitUntil: 'networkidle2' option tells goto not to resolve until the network has been quiet for a moment, which is your first and bluntest waiting strategy.
Wait for the right content, not a fixed timer
Waiting is where most headless scrapers go wrong. A fixed sleep is fragile: too short and you parse before the data arrives, too long and every run crawls. The better approach is to wait for a specific signal that the content you want is actually present.
Puppeteer gives you several options, in rough order of preference:
-
waitForSelectorblocks until a specific element appears in the DOM. This is the most reliable signal because it ties the wait to the data you care about. -
waitForFunctionblocks until an arbitrary JavaScript condition is true, for example a list reaching a certain length. Use it when presence alone is not enough. -
waitUntilongoto(load,domcontentloaded,networkidle2) controls when navigation resolves. Good as a baseline, weak as your only guarantee.
Prefer waiting for a selector over a hard timer wherever you can. It is both faster on average and far more robust when the network is slow.
await page.goto('https://quotes.toscrape.com/js/', { waitUntil: 'domcontentloaded', }) // Block until the first quote is actually in the DOM. await page.waitForSelector('.quote') // Or wait for a richer condition: at least 10 quotes loaded. await page.waitForFunction(() => { return document.querySelectorAll('.quote').length >= 10 })
Extract structured data from the rendered DOM
Once the content is present, page.evaluate runs a function inside the page's own context, where you have the full DOM and the standard browser APIs. Whatever you return is serialized back to your Node script. This keeps extraction simple: you write ordinary querySelectorAll code as if you were in the browser console.
const quotes = await page.evaluate(() => { const cards = document.querySelectorAll('.quote') return Array.from(cards).map((card) => ({ text: card.querySelector('.text').innerText.trim(), author: card.querySelector('.author').innerText.trim(), tags: Array.from(card.querySelectorAll('.tag')).map((t) => t.innerText), })) }) console.log(quotes)
The result is a clean array of objects you can write to JSON, push to a database, or feed into a pipeline. A trimmed sample of the output looks like this:
[ { "text": "The world as we have created it is a process of our thinking.", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking"] }, { "text": "It is our choices that show what we truly are.", "author": "J.K. Rowling", "tags": ["abilities", "choices"] } ]
Capture a screenshot
One thing only a real rendering engine can give you is a faithful screenshot, which is useful for visual QA, archiving a page's state, or debugging a scrape that returned nothing. Puppeteer captures the viewport or the full scrollable page in one call.
await page.screenshot({ path: 'quotes.png', fullPage: true, })
If screenshots are the main thing you need at volume, running and maintaining a browser fleet just to take pictures is overkill. The Crawlbase Screenshots API renders the page server-side and returns the image directly, with no browser to manage on your side.
The same job in Playwright
Playwright, maintained by Microsoft, is the other modern choice. It drives Chromium, Firefox, and WebKit from one API, and its auto-waiting behavior makes a lot of the explicit waits above unnecessary: actions like click and locator reads wait for the element to be ready by default. The structure mirrors Puppeteer closely, so porting between them is straightforward.
const { chromium } = require('playwright') async function run() { const browser = await chromium.launch({ headless: true }) const page = await browser.newPage() await page.goto('https://quotes.toscrape.com/js/') await page.waitForSelector('.quote') const quotes = await page.$$eval('.quote .text', (els) => els.map((el) => el.innerText.trim()), ) console.log(quotes) await browser.close() } run().catch((err) => console.error(err))
Both libraries are excellent. For a deeper comparison with a different language and tool, web scraping with Python and Selenium walks through the same ideas in a Selenium and Python build.
Where headless scraping hurts at scale
Everything above works beautifully for one page on your machine. The trouble starts when you need volume, and it shows up in two distinct ways.
First, resources. Every headless browser instance is a full Chrome process eating hundreds of megabytes of RAM. Running a handful in parallel is fine; running enough to scrape thousands of pages an hour means standing up a fleet, managing memory leaks and zombie processes, recycling crashed instances, and paying for the servers underneath. The browser that was a one-line launch call becomes infrastructure.
Second, anti-bot defenses. Serious commercial sites do not just render content; they actively look for automation. Default headless browsers leak signals (the navigator.webdriver flag, missing or odd browser fingerprints, datacenter IPs) that detection systems read instantly. You end up bolting on stealth plugins, rotating residential proxies so requests come from real-user IPs, and solving CAPTCHAs, and each of those is its own ongoing maintenance burden. The scraping itself stops being the hard part.
For the broader playbook on staying unblocked, see how to scrape websites without getting blocked. The short version: rendering is solvable on your own, but rendering reliably, at scale, from IPs a target trusts is a different and much larger problem.
The one-call alternative: Crawling API with a JS token
This is the pain point a managed API removes. Instead of running and hardening your own browser fleet, you send a URL to the Crawlbase Crawling API with a JavaScript token. The API renders the page in a real browser on its side, behind a rotating pool of trusted residential IPs, and returns the finished HTML for you to parse. Rendering and the IP problem collapse into a single request.
Install the client and make one call. Sign up for a Crawlbase account, grab your JavaScript token from the dashboard, and drop it in where you see YOUR_CRAWLBASE_JS_TOKEN.
npm install crawlbase cheerio
const { CrawlingAPI } = require('crawlbase') const cheerio = require('cheerio') const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_JS_TOKEN' }) const options = { ajax_wait: true, page_wait: 5000, } async function scrape() { const response = await api.get('https://quotes.toscrape.com/js/', options) const $ = cheerio.load(response.body) const quotes = [] $('.quote').each((i, el) => { quotes.push({ text: $(el).find('.text').text().trim(), author: $(el).find('.author').text().trim(), }) }) console.log(quotes) } scrape().catch((err) => console.error(err))
The waiting strategies you learned with Puppeteer have direct equivalents here. The ajax_wait option tells the API to wait for asynchronous content before returning, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear. For pages that reveal content behind a button, css_click_selector takes a URL-encoded CSS selector and clicks it after rendering, the same idea as a Puppeteer page.click followed by a wait.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. For any client-rendered page, like the one above, you need the JS token. The normal token would return the same empty shell a plain fetch does.
Render JavaScript-heavy pages behind trusted residential IPs in a single call. The Crawling API takes a JS token, runs the page in a real browser server-side, rotates IPs for you, and returns finished HTML, so you skip running a headless fleet, a proxy pool, and a CAPTCHA stack yourself. Try it on the free tier first.
Which approach should you choose?
Both have a place, and the decision is mostly about volume and how hard the target defends itself.
Run your own headless browser when you need fine-grained control of a page: complex multi-step interactions, logging into your own accounts for testing, generating screenshots for a small set of pages, or scraping a handful of friendly sites where blocking is not a concern. The control is unmatched and the cost is low at small scale.
Reach for the managed Crawling API when you are scraping at volume, when the target actively blocks bots, or when you simply do not want to own browser and proxy infrastructure. If you need raw IP rotation without rendering, the Smart AI Proxy covers that; if you want parsed JSON for supported sites instead of raw HTML, the Crawling API handles extraction too. The point is to spend your time on the data, not on keeping a fleet alive.
Key takeaways
- A headless browser runs the full page. It executes JavaScript and builds the real DOM, so it sees content a plain HTTP fetch never will.
-
Wait for a selector, not a timer.
waitForSelectorandwaitForFunctiontie the wait to the data you want and are far more robust than a fixed sleep. -
Extraction happens in page context.
page.evaluate(or Playwright's$$eval) runs DOM code in the page and returns clean structured objects. - Scale is the real cost. Browser fleets eat memory, and anti-bot defenses force stealth, proxies, and CAPTCHA handling on top.
- A JS token collapses both problems. The Crawling API renders server-side behind trusted IPs and returns finished HTML in one call.
- Match the tool to the job. DIY for control at small scale; a managed API for volume and hard targets.
Frequently Asked Questions (FAQs)
What is a headless browser in web scraping?
A headless browser is a real browser engine, such as Chrome or Firefox, that runs without a visible window. In web scraping it loads a page, runs its JavaScript, and builds the same DOM a human would see, which lets you extract content that only appears after scripts run. You drive it from code instead of clicking, so it is ideal for JavaScript-heavy sites that a plain HTTP request cannot read.
Should I use Puppeteer or Playwright for headless scraping?
Both are excellent and very similar. Puppeteer focuses on Chrome and Firefox and is simple to start with. Playwright drives Chromium, Firefox, and WebKit from one API and has stronger built-in auto-waiting, which removes a lot of manual wait code. Pick Playwright if you need cross-browser coverage or like its locator model; pick Puppeteer for a lean Chrome-only setup. The concepts in this guide apply to either.
What is the best way to wait for content in a headless browser?
Wait for a specific element rather than a fixed timer. Use waitForSelector to block until the element you want is in the DOM, or waitForFunction for a richer condition like a list reaching a certain length. Fixed sleeps are fragile: too short and you parse early, too long and every run drags. Tying the wait to your target data is both faster on average and more reliable.
Why do headless browsers get blocked?
Default headless browsers leak automation signals: the navigator.webdriver flag, unusual or missing fingerprints, and datacenter IP addresses that detection systems flag instantly. Serious sites watch for exactly these. Mitigating it means adding stealth configuration, rotating residential proxies so requests come from real-user IPs, and handling CAPTCHAs, each of which is ongoing work. A managed API that renders behind trusted IPs handles this for you.
Can I take screenshots with a headless browser?
Yes. Both Puppeteer and Playwright capture the viewport or the full scrollable page in one call, which is useful for visual QA, archiving, and debugging empty scrapes. If screenshots are your main need at volume, running a browser fleet just for images is overkill; the Screenshots API renders server-side and returns the image directly with no browser to manage.
When should I use the Crawling API instead of running my own browser?
Use your own headless browser for fine-grained control at small scale or on friendly sites. Switch to the Crawling API when you scrape at volume, hit aggressive anti-bot defenses, or do not want to own browser and proxy infrastructure. With a JS token it renders the page server-side behind rotating residential IPs and returns finished HTML in one call, so you skip the fleet, the proxy pool, and the CAPTCHA stack.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
