E-commerce sites are one of the richest sources of public structured data on the web. Every product listing carries a title, a price, a rating, an availability state, and an image, and that data drives price tracking, competitor research, inventory monitoring, and market trend analysis. The trouble is that modern storefronts render most of that detail with JavaScript and guard their pages against automated traffic, so a plain HTTP request tends to return a near-empty shell instead of the catalog you came for.

This guide is a step-by-step Node.js walkthrough for crawling product data from an e-commerce site the reliable way. You will build a small, runnable scraper that fetches a rendered listing page through the Crawling API, parses each product card with Cheerio, walks the pagination, and saves clean structured output as JSON or CSV. We keep the whole walkthrough scoped to public product data, and the legality section near the end is genuine, not boilerplate, so read it before you point this at any real volume.

What you will build

A Node.js script that takes a public e-commerce search or category URL, retrieves the rendered HTML through the Crawling API, and extracts a structured record for each product on the results page. We use a search query as the running example and pull these fields per item:

  • Title the product name as shown on the card, for example "Men's Analog Wrist Watch".
  • Price the listed price as displayed, like "Rs. 1,299".
  • Rating the review count or star rating shown next to the product.
  • Availability the stock or location signal a card surfaces, when present.
  • Image the thumbnail image URL for the product.
  • Product URL the link to the individual product page.

Why a plain request fails on e-commerce sites

If you request a storefront search URL with a bare HTTP client, you usually get a response with status 200 and only part of the listing data in the body. Two things work against you. First, most e-commerce sites render prices, ratings, and the bulk of each product card in the browser with JavaScript and AJAX, so the initial HTML is incomplete until the page's scripts run. Second, retail platforms flag automated traffic quickly: datacenter IPs and request patterns that do not look like a real browser get challenged, rate-limited, or blocked before they ever reach the rendered content.

So a working e-commerce crawler needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL with a JavaScript token, it renders the page behind a trusted IP, and it returns finished HTML for you to parse. For a deeper look at why client-rendered targets behave this way, see how to crawl JavaScript websites.

Why the JS token

Crawlbase offers two token types. The normal token (TCP) fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. E-commerce search pages load key fields client-side, so the JS token gives you the most complete page here. With it you can also pass wait parameters like ajax_wait and page_wait to handle AJAX loading. Using the normal token can return a partial result with prices or ratings missing, leaving you nothing reliable to parse.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic JavaScript and Node.js. You should be comfortable writing and running a Node script and installing packages with npm. If you are new to Node, the official docs and any beginner course will get you to the level this tutorial assumes. For a fuller walkthrough, see our guide on how to build a web scraper with Node.js.

Node.js 16 or later. Confirm your version with node --version. If you do not have it, install the latest LTS release from the Node.js website or through a version manager like nvm. NPM ships bundled with Node.js, so installing one gives you both.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Crawlbase issues two tokens: the normal token for static pages and the JS token for dynamic, JavaScript-rendered pages. Treat the token like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create a project folder, initialize it as an npm package, and install the libraries the scraper needs. The npm init -y flag accepts all defaults and writes a package.json for you.

bash
node --version

mkdir ecommerce-crawling && cd ecommerce-crawling
npm init -y

npm install crawlbase cheerio csv-writer

Three dependencies do the work: crawlbase is the official Node client for the Crawling API, cheerio parses the returned HTML with a jQuery-style API so you can pull out individual fields by CSS selector, and csv-writer turns the structured records into a CSV file at the end. If selectors are new to you, the primer on XPath and CSS selectors is a good companion.

Step 1: Fetch the rendered search page

Start by getting the finished page. Import the CrawlingAPI class, initialize it with your JS token, and request the search URL. Checking the status code before you parse keeps failures loud instead of silent. The page_wait option holds for a fixed number of milliseconds after load so late-rendering product cards appear before the page is captured.

javascript
const { CrawlingAPI } = require('crawlbase');

// Replace with your actual Crawlbase JS token
const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' });

async function crawl(pageUrl) {
  const options = { ajax_wait: 'true', page_wait: 5000 };
  const response = await api.get(pageUrl, options);
  if (response.statusCode === 200) {
    return response.body;
  }
  console.error(`Request failed: ${response.statusCode}`);
  return null;
}

const searchUrl = 'https://example-shop.com/catalog/?q=watches+for+men';
crawl(searchUrl).then((html) => {
  console.log(html ? html.slice(0, 500) : 'No HTML returned');
});

The two wait options matter for a client-rendered storefront. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for 5000 milliseconds (5 seconds) after load so late-rendering elements appear before capture. Five seconds is a reasonable start; raise it if prices or ratings come back empty. The JS token also guarantees IP rotation behind the scenes, so the request reads as a real visitor. Run the script with node scraper.js and you should see real product markup, not a stripped-down shell. That confirms rendering works before you write a single selector. If you want a refresher on the request layer itself, see how to make HTTP requests in Node.js.

Crawlbase Crawling API

The page you just fetched needed both a real browser render and a trusted IP, in one call. The Crawling API takes a JS token, runs the storefront in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public search page on the free tier first.

Step 2: Identify the selectors and parse each product

Before extracting anything, inspect a live product card. Right-click a title, price, or rating in your browser and choose "Inspect" to open the dev tools, then read off the class names and tags that wrap each field. E-commerce sites lay each result out in a repeating block, so you select every card, then read title, price, rating, image, and the product link from inside it. Reading each field defensively keeps one missing value from crashing the run.

javascript
const cheerio = require('cheerio');

function parseSearch(html) {
  const $ = cheerio.load(html);
  const results = [];

  $('div[data-qa-locator="general-products"] div[data-qa-locator="product-item"]').each((index, element) => {
    const card = $(element);
    const product = {};

    product.productPageUrl = card.find('.mainPic--ehOdr a').attr('href') || null;
    product.thumbnailImage = card.find('.mainPic--ehOdr img').attr('src') || null;
    product.title = card.find('.info--ifj7U .title--wFj93 a').text().trim() || null;
    product.price = card.find('.info--ifj7U .price--NVB62 span').text().trim() || null;
    product.noOfReviews = card.find('.info--ifj7U .rateAndLoc--XWchq .rating__review--ygkUy').text().trim() || null;
    product.location = card.find('.info--ifj7U .rateAndLoc--XWchq .location--eh0Ro').text().trim() || null;

    if (product.title) results.push(product);
  });

  return results;
}

A few details keep this resilient. Each field falls back to null when the element is missing, which is common since not every card shows a rating or a location line. The product URL and image are read from the anchor's href and the image's src rather than their text, so they use attr instead of text. The outer selector targets the product grid by its data-qa-locator attributes, which tend to be more stable than the hashed class names, and the final if (product.title) check skips any empty or placeholder block.

Selectors drift

The hashed class names above (.mainPic--ehOdr, .title--wFj93, .price--NVB62, and the rest) are generated by the storefront's build and change without notice. Treat the selectors here as a starting template, not a contract. When a field comes back as null, re-inspect the live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Put the crawler together

Now wire the fetch and the parse into one runnable script. Fetch the rendered HTML, hand it to the parser, and print the structured records. This is the smallest end-to-end version that crawls a single search page.

javascript
const { CrawlingAPI } = require('crawlbase');
const cheerio = require('cheerio');

const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' });

async function crawl(pageUrl) {
  const options = { ajax_wait: 'true', page_wait: 5000 };
  const response = await api.get(pageUrl, options);
  if (response.statusCode === 200) return response.body;
  console.error(`Request failed: ${response.statusCode}`);
  return null;
}

function parseSearch(html) {
  const $ = cheerio.load(html);
  const results = [];
  $('div[data-qa-locator="general-products"] div[data-qa-locator="product-item"]').each((index, element) => {
    const card = $(element);
    const product = {
      productPageUrl: card.find('.mainPic--ehOdr a').attr('href') || null,
      thumbnailImage: card.find('.mainPic--ehOdr img').attr('src') || null,
      title: card.find('.info--ifj7U .title--wFj93 a').text().trim() || null,
      price: card.find('.info--ifj7U .price--NVB62 span').text().trim() || null,
      noOfReviews: card.find('.info--ifj7U .rateAndLoc--XWchq .rating__review--ygkUy').text().trim() || null,
      location: card.find('.info--ifj7U .rateAndLoc--XWchq .location--eh0Ro').text().trim() || null,
    };
    if (product.title) results.push(product);
  });
  return results;
}

async function main() {
  const searchUrl = 'https://example-shop.com/catalog/?q=watches+for+men';
  const html = await crawl(searchUrl);
  if (!html) return;
  const results = parseSearch(html);
  console.log(JSON.stringify(results.slice(0, 3), null, 2));
}

main();

The main function constructs the search URL for the query, sends it through the Crawling API with a 5-second page wait so JavaScript rendering completes, and parses the returned HTML with Cheerio. The extracted records, including product URLs, images, titles, prices, review counts, and locations, are collected into an array and logged for inspection. Run it with node scraper.js to confirm the fields come back populated.

What the output looks like

Run the full script and you get a clean array of records, one per product, ready to write to JSON, CSV, or a database.

json
[
  {
    "productPageUrl": "https://example-shop.com/products/mens-analog-watch-1.html",
    "thumbnailImage": "https://img.example-shop.com/p/mens-analog-watch-1.jpg",
    "title": "Men's Analog Wrist Watch Stainless Steel",
    "price": "Rs. 1,299",
    "noOfReviews": "(128)",
    "location": "Karachi"
  },
  {
    "productPageUrl": "https://example-shop.com/products/sport-digital-watch-2.html",
    "thumbnailImage": "https://img.example-shop.com/p/sport-digital-watch-2.jpg",
    "title": "Sport Digital Watch Waterproof",
    "price": "Rs. 899",
    "noOfReviews": "(54)",
    "location": "Lahore"
  }
]

Step 4: Handle pagination across listing pages

One page of results is a demo; a real job walks the pagination. E-commerce search results are spread across many pages, so to collect a full catalog you first read the total page count, then iterate from page 1 to page N, fetching and parsing each one. The total is usually exposed by the pagination control near the bottom of the first page; you can read it off with a selector and parse it to a number. Each page URL just appends a &page= parameter to the search URL.

javascript
async function getTotalPages(query) {
  const searchUrl = `https://example-shop.com/catalog/?q=${encodeURIComponent(query)}`;
  const html = await crawl(searchUrl);
  if (!html) return 0;
  const $ = cheerio.load(html);
  const totalPages = parseInt($('ul.ant-pagination li:nth-last-child(2)').attr('title'), 10);
  return Number.isNaN(totalPages) ? 1 : totalPages;
}

async function crawlPage(query, page) {
  const searchUrl = `https://example-shop.com/catalog/?q=${encodeURIComponent(query)}&page=${page}`;
  const html = await crawl(searchUrl);
  return html ? parseSearch(html) : [];
}

async function crawlAll(query) {
  const totalPages = await getTotalPages(query);
  const results = [];
  for (let page = 1; page <= totalPages; page++) {
    const pageResults = await crawlPage(query, page);
    results.push(...pageResults);
  }
  return results;
}

The flow splits into three small functions. getTotalPages fetches the first search page and reads the page count from the pagination control, falling back to 1 if the value is missing. crawlPage fetches and parses a single page by appending the page parameter to the URL. crawlAll ties them together: it determines the total, loops from page 1 to N, and aggregates every page's results into one array. Because every results page shares the same card structure, the parseSearch function you already wrote works across all of them without changes. For very large catalogs, cap the loop at a sensible number of pages rather than crawling thousands in one run.

Step 5: Save the data as CSV

With the records collected, write them to a CSV file so you can open the data in a spreadsheet or load it into another tool. The csv-writer library lets you define headers that map to your field names and write all records in one call. JSON and CSV each have their place; the ecommerce web scraping guide covers when to reach for which.

javascript
const createCsvWriter = require('csv-writer').createObjectCsvWriter;

const csvWriter = createCsvWriter({
  path: 'ecommerce_products.csv',
  header: [
    { id: 'productPageUrl', title: 'Product Page URL' },
    { id: 'thumbnailImage', title: 'Thumbnail Image URL' },
    { id: 'title', title: 'Title' },
    { id: 'price', title: 'Price' },
    { id: 'noOfReviews', title: 'Number of Reviews' },
    { id: 'location', title: 'Location' },
  ],
});

async function saveToCsv(data) {
  await csvWriter.writeRecords(data);
}

(async () => {
  const products = await crawlAll('watches for men');
  await saveToCsv(products);
  console.log(`Saved ${products.length} products to ecommerce_products.csv`);
})();

The header array maps each field id to a human-readable column title, so the resulting CSV opens cleanly in Excel or Google Sheets with the columns labeled. To persist the same records in a database instead, the structured objects map directly onto table rows: keep the same field names as columns and insert one row per product. The shape of the data does not change, only the destination.

Scaling and staying unblocked

Even with rendering handled, retail platforms watch for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard commercial target.

  • Pace your requests. Hammering pages in a tight loop is the fastest way to get throttled. Spread requests out and vary your queries instead of crawling one path at full speed.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
  • Cache and dedupe. Store the rendered HTML or the parsed rows so a re-run does not re-fetch pages you already have, which keeps both your costs and your request volume down.

For the broader playbook, see how to scrape websites without getting blocked. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy gives you the same residential IP rotation as a drop-in proxy endpoint. The fetch-then-parse pattern in this guide carries across most storefronts: only the selectors and the pagination parameter change from one site to the next.

Whether crawling an e-commerce site is allowed depends on that site's terms of service, your jurisdiction, and what you do with the data. Many storefronts restrict automated access in their terms, so crawling can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the site's terms of service and its robots.txt, and treat both as the boundary for what you collect and how often you request it.

A few lines worth holding to. Collect only public product data: the title, price, rating, availability, image, and the product link that anyone can see without an account. Keep your request volume low enough that you are not straining the site's servers, and respect any rate expectations it states. Stay away from anything behind a login, including account pages, carts, and order history, and avoid personal data tied to identifiable buyers or sellers. Do not redistribute copyrighted media such as product photography or descriptions in ways the site has not licensed; reusing those commercially is a separate question from reading a public price.

For volume or commercial use, prefer an official channel where one exists. Many large retailers and marketplaces publish official product or affiliate APIs that give you guaranteed structure and clear usage rights, and those are the right tools when you need large volumes or commercial reuse. This guide is deliberately scoped to public listing and search pages because that is the line that keeps the work defensible. It does not cover anything behind authentication, personal data, private account or order information, or any attempt to bypass a sign-in. If your project needs more than public listings, an official API or a data agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • E-commerce sites render listings client-side. A plain request returns an incomplete page, so you must render it before you parse it.
  • You need rendering and a trusted IP together. The Crawling API with a JS token does both in one call; ajax_wait and page_wait control how long it waits for content.
  • Cheerio does the extraction. Select every product card, then map title, price, rating, image, and the product URL to current selectors, and expect those hashed selectors to drift.
  • Scale by walking pagination. Read the total page count, loop from page 1 to N appending the page parameter, and reuse the same parser across every page with sensible pacing.
  • Stay on public data. Respect each site's ToS and robots.txt, prefer an official product API for volume or commercial use, and never touch logins, personal data, or copyrighted media you do not have rights to.

Frequently Asked Questions (FAQs)

What is the difference between web crawling and web scraping?

Web crawling is the process of systematically navigating a site and collecting data across many pages, following links and pagination as it goes. Web scraping is the extraction of specific fields, like price or title, from a given page. In practice the two work together: the crawler walks the listing pages, and the scraping step pulls the structured fields out of each one. The script in this guide does both.

Why does a plain request return incomplete data from e-commerce sites?

Because most storefronts render prices, ratings, and the bulk of each product card client-side with JavaScript and AJAX. The initial HTML is partial until the page's scripts run in a browser, so a raw HTTP request returns status 200 with key fields missing or blank. To get a complete page you have to render it first, which is what the Crawling API's JS token handles for you.

Do I need the normal token or the JS token?

Use the JS token for e-commerce search and listing pages. The normal token fetches static HTML, which on a client-rendered storefront can come back with prices or ratings missing. The JS token renders the page in a real browser before handing back the HTML, and it lets you pass wait parameters like ajax_wait and page_wait so dynamically loaded cards are present when Cheerio parses them.

My selectors return null. What changed?

Almost certainly the site's markup. The hashed class names that storefront build tools generate change without notice, and they differ between search and individual product pages, so selectors that worked last month can break. Re-inspect a live page in your browser's dev tools and update the selectors, leaning on stable attributes like data-qa-locator where they exist. Periodic selector maintenance is normal for any production scraper.

How do I store the scraped product data?

For a quick result, write the records to a CSV file with csv-writer so the data opens in any spreadsheet. For repeated or larger runs, insert the same structured objects into a database, keeping your field names as columns and one row per product. Either way the record shape stays the same; only the destination changes, so you can start with CSV and move to a database later without rewriting the parser.

How do I avoid getting blocked while crawling e-commerce sites?

Keep your per-IP request rate low, vary your queries instead of looping one path, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off when you start seeing challenges rather than pushing through them.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available