If you live in JavaScript, you already have most of what you need to pull structured data off the web. Node.js ships a fast runtime, a huge package ecosystem, and an async I/O model that handles many requests without breaking a sweat. Put two small libraries on top of it and you have a working scraper in a few dozen lines.
This guide shows you how to build a web scraper with Node.js the practical way. We start with the standard stack, axios for HTTP and cheerio for jQuery-style HTML parsing, and build a scraper that fetches a page, selects the fields you want, loops over a list, and writes the results to JSON and CSV. Then we get honest about where plain HTTP runs out of road (JavaScript-rendered pages and blocks at scale) and what to do about it.
What you will build
A small Node.js script that takes a URL, downloads the HTML, parses it with cheerio, and extracts a clean record per item. We will use a generic product-listing layout as the running example, since that pattern (a repeating card with a title, a price, and a link) covers most real scraping jobs. By the end you will have:
- A single-page fetcher built on axios and cheerio.
- An extractor that maps CSS selectors to fields.
- A loop that walks paginated list pages and collects every row.
- Output written to both JSON and CSV.
- A drop-in upgrade path for pages that block you or render client-side.
Why Node.js for scraping
Node.js runs JavaScript outside the browser on Chrome's V8 engine, which compiles to machine code and stays quick. Its non-blocking, event-driven model is a natural fit for scraping, where you spend most of your time waiting on network responses: you can have many requests in flight on a single thread without spinning up a thread per connection. Add the npm ecosystem, where almost every parsing, queuing, or storage need already has a battle-tested package, and you have a runtime built for this kind of work. Companies like Netflix and PayPal run Node.js in production for the same reasons.
The two libraries that do the heavy lifting for static scraping are axios (a promise-based HTTP client) and cheerio (a lightweight parser that gives you jQuery selectors over server-side HTML, with no browser attached). If you want a refresher on the request side specifically, see how to make HTTP requests in Node.js with the Fetch API.
Prerequisites
Nothing exotic. You need three things before writing code.
Basic JavaScript and Node.js. You should be comfortable writing a script, running it from the terminal, and installing packages with npm. Async/await will make the code read cleanly, so a working knowledge of promises helps.
Node.js 18 or later. Check your version with node --version. If you do not have it, install the current LTS from nodejs.org.
A code editor. Anything works; VS Code is the common pick.
Set up the project
Create a folder, initialize a project, and install the two dependencies.
mkdir node-scraper && cd node-scraper npm init -y npm install axios cheerio
To use modern import syntax, add "type": "module" to your package.json. If you would rather stick with require, the code below works the same way with CommonJS, just swap the import lines for const axios = require("axios").
Step 1: Fetch a page
Start by downloading the raw HTML. axios returns the response body on response.data. Setting a realistic User-Agent header makes the request look like a browser rather than a default Node client, which many sites treat with suspicion.
import axios from "axios"; const fetchPage = async (url) => { const { data } = await axios.get(url, { headers: { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", }, timeout: 15000, }); return data; };
The timeout keeps a slow or dead host from hanging your run forever. axios rejects the promise on any non-2xx status, so wrapping calls in try/catch (we do that later) keeps failures visible instead of silent.
Step 2: Parse and extract with cheerio
cheerio loads an HTML string and hands you a $ function that behaves like jQuery. You select elements by CSS selector and read their text or attributes. The pattern is: find the repeating container, then for each one pull the fields you care about.
import * as cheerio from "cheerio"; const parseProducts = (html) => { const $ = cheerio.load(html); const products = []; $(".product-card").each((_, el) => { const card = $(el); products.push({ title: card.find("h2.title").text().trim(), price: card.find(".price").text().trim(), url: card.find("a").attr("href"), }); }); return products; };
Two details matter here. .text() returns the combined text of an element, so .trim() removes the whitespace that markup leaves around it. Reading a link uses .attr("href") rather than .text(), because the value you want lives in the attribute, not the visible text. Adjust the selectors (.product-card, h2.title, .price) to match the page you are actually targeting; inspect it in your browser's dev tools to find the right ones.
Class names change when a site ships a redesign, and a selector that worked last month can quietly return empty strings. Treat selectors as something you maintain, not set once. When a field comes back blank, re-inspect the live page and update it. Periodic selector upkeep is normal for any production scraper.
Step 3: Loop over list pages
One page is a demo. A real job walks pagination. Most paginated listings expose a page number in the URL (?page=2) or a "next" link. The simplest robust approach is to iterate a known page range, fetch each, parse it, and stop when a page returns no items.
const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); const scrapeAll = async (baseUrl, maxPages = 10) => { const all = []; for (let page = 1; page <= maxPages; page++) { try { const html = await fetchPage(`${baseUrl}?page=${page}`); const rows = parseProducts(html); if (rows.length === 0) break; all.push(...rows); console.log(`Page ${page}: ${rows.length} items`); } catch (err) { console.error(`Page ${page} failed: ${err.message}`); } await sleep(1000); } return all; };
The sleep between requests is not decoration. A one-second pause keeps you from hammering the server in a tight loop, which is both polite and the fastest way to avoid getting throttled. The try/catch means one bad page logs an error and the run continues instead of crashing on item 4 of 200.
Step 4: Write to JSON and CSV
Collected data is only useful once it leaves memory. JSON is the no-dependency default; Node's built-in fs module writes it directly. A flat CSV is just as easy by hand and opens straight into a spreadsheet.
import { writeFileSync } from "fs"; const saveJson = (rows, file) => writeFileSync(file, JSON.stringify(rows, null, 2)); const saveCsv = (rows, file) => { const headers = Object.keys(rows[0]); const escape = (v) => `"${String(v ?? "").replace(/"/g, '""')}"`; const lines = [ headers.join(","), ...rows.map((r) => headers.map((h) => escape(r[h])).join(",")), ]; writeFileSync(file, lines.join("\n")); };
The escape helper wraps each value in quotes and doubles any internal quotes, which is the CSV rule that keeps a comma or quote inside a product title from shifting your columns. For anything more involved (nested data, large volumes) reach for a library like csv-stringify, but for a flat record set this is enough.
Putting it together
Wire the four pieces into one runnable script.
const main = async () => { const rows = await scrapeAll("https://example.com/products"); if (rows.length === 0) { console.log("No data collected."); return; } saveJson(rows, "products.json"); saveCsv(rows, "products.csv"); console.log(`Saved ${rows.length} items.`); }; main();
Run it with node scraper.js. You get a progress line per page and two files on disk. That is a complete static scraper in well under a hundred lines.
Where static HTTP runs out
The axios plus cheerio stack is fast and clean, and for server-rendered pages it is all you need. But two walls show up quickly on real targets.
JavaScript-rendered content. Many modern sites send a near-empty HTML shell and build the page in the browser with JavaScript. axios only fetches that initial shell; it does not run scripts, so cheerio finds nothing where the data should be. If your selectors return empty on a page that clearly shows content in a browser, this is almost always why.
Blocking at scale. A handful of requests from your IP is fine. A few hundred from the same datacenter address, in recognizable patterns, gets you rate-limited, CAPTCHA-walled, or blocked outright. A custom User-Agent buys you a little room; it does not solve the IP problem.
You have two ways forward. The first is a headless browser: Puppeteer or Playwright drives a real Chrome or Firefox, runs the page's JavaScript, and lets you scrape the rendered DOM. That solves rendering, but it is heavy: each instance is a full browser, it eats memory and CPU, and at scale you still have to manage a proxy pool yourself to stay unblocked. If that is the route you want, see our guide to Playwright web scraping.
The second is to offload both problems to an API.
Using the Crawling API for rendered, unblocked pages
The Crawling API folds rendering and IP rotation into a single request. You send it a URL, it fetches the page behind a trusted rotating residential IP (optionally rendering JavaScript first), and it returns finished HTML. You keep your existing cheerio parser unchanged; only the fetch step swaps out.
Install the official Node client.
npm install crawlbase
Then replace fetchPage with a version that goes through the API. Everything downstream (parse, loop, save) stays exactly as you wrote it.
import { CrawlingAPI } from "crawlbase"; import * as cheerio from "cheerio"; const api = new CrawlingAPI({ token: "YOUR_CRAWLBASE_TOKEN" }); const fetchPage = async (url) => { const response = await api.get(url, { ajax_wait: true, page_wait: 3000 }); if (response.statusCode === 200 && response.pcStatus === 200) { return response.body; } throw new Error(`Crawl failed: ${response.statusCode} / ${response.pcStatus}`); };
Two things to note. The client returns both statusCode (the response from the target site) and pcStatus (whether the crawl itself succeeded); checking both keeps a soft failure from passing as good HTML. The ajax_wait and page_wait options handle JavaScript-rendered targets: ajax_wait tells the API to wait for async content, and page_wait holds a few seconds after load so late elements appear before capture. Drop both options for plain static pages and you get the same rotation benefit without the rendering overhead.
Crawlbase tokens come in two flavors. The normal token fetches static HTML and is the right choice for server-rendered pages. The JavaScript (JS) token renders the page in a real browser first, which is what you need for client-side-rendered targets. If a page comes back as an empty shell with the normal token, switch to the JS token.
Skip running a headless browser fleet and managing your own proxy pool. The Crawling API renders JavaScript when you need it, rotates through residential IPs server-side, and returns finished HTML in one call, so your cheerio parser keeps working unchanged. Start on the free tier and point it at the pages that were blocking you.
Tips for a healthy scraper
A few habits keep a Node.js scraper running smoothly, whether you stay on axios or move to the API.
- Read the site's terms and robots.txt first. Know what you are allowed to collect and at what rate before you point a loop at it.
- Pace your requests. A delay between calls and reasonable concurrency keeps you from overwhelming a server and from looking like an attack.
-
Send realistic headers. A browser-like
User-Agentand standard accept headers reduce the chance of being flagged as a bot. - Handle errors per item. Wrap each fetch so one failure logs and moves on rather than killing the whole run.
- Cache during development. Save fetched HTML to disk while you iterate on selectors so you are not re-hitting the site on every code change.
- Watch the status codes. A rising rate of challenges or 4xx responses is a signal to slow down or rotate IPs, not noise to ignore.
For the full anti-block playbook, including IP rotation and fingerprinting, see how to scrape websites without getting blocked. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy (also called the AI Proxy) gives you residential IP rotation as a drop-in proxy endpoint.
Key takeaways
- axios plus cheerio is the static stack. Fetch HTML with axios, parse it with cheerio's jQuery-style selectors, and you have a working scraper in under a hundred lines.
- The pattern is fetch, select, loop, save. Find the repeating container, map selectors to fields, walk pagination with a delay, and write to JSON and CSV.
- Static HTTP has two limits. It cannot run JavaScript, so it misses client-rendered content, and a single IP gets blocked at scale.
- Puppeteer and Playwright solve rendering but are heavy. A real browser per instance costs memory and CPU, and you still manage proxies yourself.
- The Crawling API folds in both. One call returns rendered HTML behind a rotating residential IP, and your cheerio parser stays unchanged.
Frequently Asked Questions (FAQs)
Is Node.js good for web scraping?
Yes. Node.js runs JavaScript on the fast V8 engine, and its non-blocking I/O model lets you keep many network requests in flight on a single thread, which is exactly what scraping needs. The npm ecosystem also gives you mature libraries for every step, from HTTP requests to HTML parsing to headless browsers, so most jobs come together quickly.
What is the difference between axios and cheerio?
They do different halves of the job. axios is an HTTP client: it fetches the raw HTML of a page over the network. cheerio is a parser: it loads that HTML string and gives you jQuery-style CSS selectors to pull out the fields you want. You almost always use them together, axios to download and cheerio to extract.
Why does cheerio return empty results on some pages?
Usually because the page renders its content client-side with JavaScript. axios fetches only the initial HTML shell, and cheerio parses what it is given, so if the data is injected by scripts after load, there is nothing there to find. The fix is to render the page first, either with a headless browser like Puppeteer or Playwright, or with the Crawling API using its JavaScript rendering options.
How do I avoid getting blocked while scraping with Node.js?
Pace your requests with a delay, send realistic browser headers, keep concurrency reasonable, and rotate IP addresses so no single one trips a rate limit. A custom User-Agent helps but does not solve the IP problem on its own. Rotating residential proxies or a managed service like the Crawling API handle the rotation for you so you do not have to maintain a proxy pool.
Should I use Puppeteer or the Crawling API?
Use Puppeteer (or Playwright) when you need fine-grained control over a real browser, like clicking through multi-step flows or capturing screenshots, and you are willing to run and scale the browsers yourself. Use the Crawling API when you mainly need rendered, unblocked HTML at scale without managing a headless fleet and a proxy pool. Many teams prototype with a headless browser and move to the API once volume and block rates climb.
Can I write scraped data to a database instead of files?
Yes. The collected records are plain JavaScript objects, so once you have the array you can insert it anywhere: a JSON file, a CSV, or a database like PostgreSQL, MongoDB, or SQLite using their Node.js drivers. The saving step is independent of the scraping logic, so swap saveJson for an insert call without touching the fetch or parse code.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

