Amazon is one of the richest sources of public commerce data on the web: product titles, prices, ratings, availability, and best-seller rankings that shift through the day. Pulling that data once is useful, but the real value shows up when you collect it on a schedule, so you can track how a price moves, when a listing goes out of stock, or how a ranking climbs over a week. A single manual run cannot tell you any of that.

This guide shows you how to automate Amazon scraping with JavaScript and Node.js. You build a scraper that pulls public product and search data through the Crawling API, then wrap it in the parts that make it hands-off: a scheduled job that runs it on a cron, the asynchronous Crawler with a webhook callback for large runs, durable storage for the results, and retry handling so a single failed request does not sink the batch. We keep the whole walkthrough scoped to public product data, and the legality section near the end is not boilerplate, so read it before you point this at real volume. If you only need a one-off pull, our guides on how to scrape Amazon product data and how to scrape Amazon best sellers cover the single-run case; this post is about running them on their own.

What you will build

A Node.js automation that scrapes public Amazon product and search pages on a schedule and stores a structured record per run. We will use a search results page as the running example and pull these fields per item:

  • ASIN the Amazon Standard Identification Number that uniquely keys a product.
  • Title the product name as shown on the card.
  • Price the listed price as displayed, like "$29.99".
  • Rating the average star rating when present.
  • Reviews the review count shown on the card.
  • Product URL the link to the individual product page.

Around that scraper you will add four automation pieces: a cron-driven scheduler, an asynchronous run with a webhook callback for large jobs, a JSON store keyed by run timestamp, and a retry wrapper that handles transient failures without aborting the whole batch.

Why a plain request fails on Amazon

If you request an Amazon search URL with a bare HTTP client, you rarely get the data you came for. Amazon renders much of the page in the browser and challenges automated traffic aggressively. A datacenter IP hitting product pages in a tight loop gets a CAPTCHA, a "Robot Check" interstitial, or an outright block long before you collect a useful sample. Even when a request succeeds, the markup you get back can be a stripped-down shell missing prices and ratings.

So a working Amazon scraper needs two things in one request: a page that actually renders, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work, and it gets worse once you run on a schedule and the failures pile up unattended. The Crawling API folds both into a single call: you send it the URL, it fetches the page behind a trusted, rotating IP, and it returns finished HTML for you to parse.

Two tokens

Crawlbase gives you a normal token and a JavaScript (JS) token. The normal token fetches static HTML; the JS token renders the page in a real browser first, which costs more credits. Many Amazon search and product pages parse fine with the normal token, so start there and switch to the JS token only if a field you need comes back empty.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic JavaScript and Node.js. You should be comfortable writing and running a Node script and installing packages with npm. If you are new to Node, our guide on how to build a web scraper with Node.js walks through the basics this tutorial assumes.

Node.js 16 or later. Confirm your version with node --version. If you do not have it, install it from the Node.js website or through a version manager like nvm.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token. The free tier includes 1,000 requests with no card, which is plenty to build and schedule this. Treat the token like a password: it authenticates your requests, so keep it out of version control and read it from an environment variable.

Set up the project

Create a project folder, initialize it, and install the libraries the automation needs.

bash
node --version

mkdir amazon-automation && cd amazon-automation
npm init -y

npm install crawlbase cheerio node-cron

Three dependencies do the work: crawlbase is the official Node client for the Crawling API and the asynchronous Crawler, cheerio parses the returned HTML with a jQuery-style API so you can pull fields by CSS selector, and node-cron runs the scraper on a schedule from inside the same process. Export your token once so every script in the folder can read it:

bash
export CRAWLBASE_TOKEN='YOUR_CRAWLBASE_TOKEN'

Step 1: Fetch and parse the search page

Start with the scraper itself, since everything else automates this core. Import the CrawlingAPI class, initialize it with your token, request the search URL, and parse each result card with cheerio. Checking the status code before you parse keeps failures loud instead of silent.

javascript
const { CrawlingAPI } = require('crawlbase');
const cheerio = require('cheerio');

const api = new CrawlingAPI({ token: process.env.CRAWLBASE_TOKEN });

async function scrapeSearch(searchUrl) {
  const response = await api.get(searchUrl);
  if (response.statusCode !== 200) {
    throw new Error(`Request failed: ${response.statusCode}`);
  }
  return parseSearch(response.body);
}

function parseSearch(html) {
  const $ = cheerio.load(html);
  const items = [];

  $('div[data-asin]').each((_, el) => {
    const card = $(el);
    const asin = card.attr('data-asin');
    const title = card.find('h2 span').text().trim();
    if (!asin || !title) return;

    items.push({
      asin,
      title,
      price: card.find('.a-price .a-offscreen').first().text().trim() || null,
      rating: card.find('.a-icon-alt').first().text().trim() || null,
      reviews: card.find('.a-size-base.s-underline-text').first().text().trim() || null,
      productUrl: `https://www.amazon.com/dp/${asin}`,
    });
  });

  return items;
}

const searchUrl = 'https://www.amazon.com/s?k=wireless+headphones';
scrapeSearch(searchUrl).then((items) => {
  console.log(JSON.stringify(items.slice(0, 3), null, 2));
});

A few details keep this resilient. Amazon stamps each result card with a data-asin attribute, which is the most stable hook on the page, so we anchor on that and skip any card without both an ASIN and a title (sponsored slots and layout spacers often have one but not the other). The price lives in a hidden .a-offscreen span that holds the clean, formatted value, which is more reliable than scraping the split visible price. Every field falls back to null when missing, so one absent value never crashes the run. Save the file as scraper.js and run it with node scraper.js; you should see a clean array of product records.

Crawlbase Amazon Scraper

The scraper above works for one page at a time, but an automated job that walks hundreds of Amazon pages on a schedule is where waiting on each request synchronously starts to hurt. The asynchronous Crawler takes your URLs, fetches each one behind a rotating, trusted IP, and pushes finished pages to a webhook you control, so a scheduled batch keeps moving without you managing a headless fleet or a proxy pool. Point it at public pages on the free tier first.

Step 2: Store each run

Automation is only useful if the data persists, so write every run to disk keyed by a timestamp. That gives you a history you can diff later to see how a price or ranking moved. A flat JSON file per run is the simplest durable store and easy to load into anything else afterward.

javascript
const fs = require('fs');
const path = require('path');

function saveRun(items) {
  const dir = path.join(__dirname, 'data');
  fs.mkdirSync(dir, { recursive: true });

  const stamp = new Date().toISOString().replace(/[:.]/g, '-');
  const file = path.join(dir, `run-${stamp}.json`);

  const payload = { scrapedAt: new Date().toISOString(), count: items.length, items };
  fs.writeFileSync(file, JSON.stringify(payload, null, 2));
  console.log(`Saved ${items.length} items to ${file}`);
  return file;
}

The ISO timestamp sorts naturally, so listing the data directory gives you the run history in order. For a production job you would swap this for a database, but the contract is the same: one record set per run, stamped with when it was collected. Each saved file carries a scrapedAt field so a later comparison knows exactly which moment a price belongs to.

Step 3: Add retries for transient failures

A scheduled job runs unattended, so a single flaky request must not abort the batch. Wrap the fetch in a small retry helper that backs off between attempts and only gives up after a few tries. Most transient failures (a momentary block, a slow render, a network blip) clear on the next attempt.

javascript
function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function scrapeWithRetry(url, attempts = 3) {
  for (let i = 1; i <= attempts; i++) {
    try {
      return await scrapeSearch(url);
    } catch (err) {
      console.warn(`Attempt ${i} failed: ${err.message}`);
      if (i === attempts) throw err;
      await sleep(2000 * i);
    }
  }
}

The backoff multiplies the wait by the attempt number, so the gaps grow (2 seconds, then 4, then 6) instead of hammering a target that is already struggling. After the final attempt the error rethrows, which lets the caller decide whether to log it and continue with the next keyword or stop the run. This pattern is the difference between a scheduler that recovers on its own and one you babysit.

Step 4: Schedule it with cron

Now make it hands-off. The node-cron package runs a function on a standard cron expression from inside your process, so you can keep the scraper, the store, and the retry logic in one place. Here we run the job every morning at 6 a.m. and collect a list of keywords each time.

javascript
const cron = require('node-cron');

const keywords = ['wireless headphones', 'mechanical keyboard', 'usb c hub'];

async function runJob() {
  console.log(`Run started at ${new Date().toISOString()}`);
  const all = [];

  for (const keyword of keywords) {
    const url =
      `https://www.amazon.com/s?k=${encodeURIComponent(keyword)}`;
    try {
      const items = await scrapeWithRetry(url);
      all.push(...items.map((it) => ({ ...it, keyword })));
    } catch (err) {
      console.error(`Skipping "${keyword}": ${err.message}`);
    }
    await sleep(3000);
  }

  saveRun(all);
}

cron.schedule('0 6 * * *', runJob);
console.log('Scheduler running. Daily job at 06:00.');

The expression 0 6 * * * means minute 0 of hour 6, every day. Each keyword runs through the retry wrapper, a failed keyword is logged and skipped rather than killing the run, and a 3 second pause between keywords keeps the pace polite. Leave this process running (under a process manager like pm2 or a systemd service in production) and it collects a fresh snapshot every morning with no further input. If you would rather not keep a Node process alive, drop the cron line and instead trigger runJob from a system crontab or a cloud scheduled task that runs node job.js on the same expression.

Step 5: Scale up with the async Crawler and a webhook

The synchronous loop above is fine for a few keywords, but once a run covers hundreds of product URLs, waiting on each request in turn becomes the bottleneck. The asynchronous Crawler is built for this: you push URLs to it, it fetches each one behind a trusted IP in the background, and it delivers the finished page to a webhook you host. Your scheduler stops blocking on every fetch and just enqueues work, then handles results as they arrive.

First, enqueue the URLs. The async client takes the same token and a callback URL pointing at your webhook endpoint.

javascript
const { CrawlingAPI } = require('crawlbase');

const api = new CrawlingAPI({ token: process.env.CRAWLBASE_TOKEN });

const productUrls = [
  'https://www.amazon.com/dp/B0CHX1W1XY',
  'https://www.amazon.com/dp/B09G9FPHY6',
];

async function enqueue() {
  for (const url of productUrls) {
    const response = await api.getAsync(url, {
      callback: 'https://your-server.com/crawlbase-webhook',
    });
    console.log(`Queued ${url} -> rid ${response.json.rid}`);
  }
}

enqueue();

Each call returns a request id (rid) immediately, so the loop finishes fast no matter how many URLs you queue. Crawlbase does the fetching in the background and, as each page completes, POSTs the finished HTML to your callback. Your webhook receives the pages, parses them, and stores the results, exactly like the synchronous path but decoupled from enqueueing.

javascript
const express = require('express');
const cheerio = require('cheerio');

const app = express();
app.use(express.text({ type: '*/*', limit: '5mb' }));

app.post('/crawlbase-webhook', (req, res) => {
  const $ = cheerio.load(req.body);
  const title = $('#productTitle').text().trim();
  const price = $('.a-price .a-offscreen').first().text().trim();

  if (title) saveRun([{ title, price: price || null }]);
  res.sendStatus(200);
});

app.listen(3000, () => console.log('Webhook listening on 3000'));

The callback URL must be publicly reachable, so during development expose your local server with a tunnel like ngrok and use that HTTPS address as the callback value. Always return a 200 quickly so Crawlbase knows the page was received; do the parsing and storage either before responding or in a background task. For a deeper look at this pattern, see how to extract data using the Crawlbase Crawler, which covers the async queue and callbacks in detail.

What the output looks like

Each run writes a single JSON file under data/, stamped with when it was collected. The shape stays the same whether the records came from the synchronous loop or the async webhook, so anything downstream reads one format.

json
{
  "scrapedAt": "2026-06-11T06:00:04.812Z",
  "count": 2,
  "items": [
    {
      "asin": "B0CHX1W1XY",
      "title": "Wireless Over-Ear Headphones, 40H Playtime",
      "price": "$59.99",
      "rating": "4.5 out of 5 stars",
      "reviews": "2,184",
      "productUrl": "https://www.amazon.com/dp/B0CHX1W1XY",
      "keyword": "wireless headphones"
    },
    {
      "asin": "B09G9FPHY6",
      "title": "Compact Mechanical Keyboard, Hot-Swappable",
      "price": "$45.99",
      "rating": "4.4 out of 5 stars",
      "reviews": "1,073",
      "productUrl": "https://www.amazon.com/dp/B09G9FPHY6",
      "keyword": "mechanical keyboard"
    }
  ]
}

With a run history like this, the automation pays off: load any two run files, match items by ASIN, and you have a price-over-time series you can chart or alert on. The same store feeds a best-seller tracker or a stock-change notifier without changing the scraper at all.

Staying unblocked

Even with the Crawling API handling rendering and rotation, a scheduled job that runs unattended needs habits that keep it healthy over weeks, not just one good run.

  • Pace the schedule, not just the loop. Running every few minutes around the clock looks nothing like a human and burns credits fast. A daily or hourly cadence is plenty for price and ranking tracking, and it keeps your footprint small.
  • Let rotation do its job. The Crawling API spreads requests across many residential IPs so no single one trips a rate limit. If you ever roll your own stack instead, this is the part to get right, and our guide on how to scrape websites without getting blocked covers the full playbook.
  • Watch the status codes in your logs. A scheduled run that starts returning non-200 responses is telling you something changed. Because the job logs each failure and the retry wrapper backs off, you get a paper trail instead of a silent gap in your data.

Whether scraping Amazon is allowed depends on Amazon's conditions of use, your jurisdiction, and what you do with the data. Amazon's terms restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Amazon's Conditions of Use and its robots.txt, and treat both as the boundary for what you collect.

A few lines worth holding to. Collect only public product data: the title, price, rating, review count, ASIN, and the product link that anyone can see without an account. Respect Amazon's stated rate expectations and keep your scheduled volume low enough that you are not straining its servers, which is exactly why the cron cadence and pacing in this guide matter. Avoid personal data, including anything tied to identifiable reviewers beyond the public review text on a page. If you plan to reuse the data commercially, get permission or a licensing agreement rather than assuming silence is consent.

For volume or commercial use, Amazon offers official routes, including the Product Advertising API and the seller-facing Selling Partner API, and those are the right tools when you need large volumes, guaranteed structure, or commercial rights. This guide is deliberately scoped to public product and search pages because that is the line that keeps the work defensible. It does not cover anything behind a login, buyer or seller account data, order history, private messages, or any attempt to bypass authentication. If your project needs more than public data, Amazon's official APIs or a data agreement are the correct path, not a cleverer scraper or a faster schedule.

Recap

Key takeaways

  • Automation is what makes the data valuable. A single scrape is a snapshot; a scheduled one gives you price, ranking, and stock changes over time.
  • The Crawling API handles rendering and rotation. You anchor on Amazon's data-asin cards and pull price, rating, and reviews with cheerio, while the API deals with blocks and IPs.
  • Retries and pacing keep an unattended job healthy. A backoff wrapper recovers from transient failures, and a skipped keyword never kills the whole run.
  • Cron makes it hands-off. node-cron or a system scheduler runs the job on a fixed expression, and every run is stored as a timestamped JSON file you can diff.
  • The async Crawler scales it. For large batches, enqueue URLs with a webhook callback so fetching happens in the background instead of blocking your loop.

Frequently Asked Questions (FAQs)

How do I schedule an Amazon scraper to run automatically?

The simplest path inside Node is node-cron, which runs a function on a standard cron expression from within your process. Write your scrape-and-store logic as one function, then schedule it, for example 0 6 * * * for 6 a.m. daily. If you prefer not to keep a Node process alive, put the same logic in a script and trigger it from a system crontab or a cloud scheduled task on the same expression.

When should I use the async Crawler instead of synchronous requests?

Use the asynchronous Crawler once a run covers many URLs and waiting on each request in turn becomes the bottleneck. You enqueue URLs with a webhook callback, Crawlbase fetches each page in the background behind a trusted IP, and it POSTs the finished HTML to your endpoint. Synchronous requests are simpler and fine for a handful of pages per run.

How do I handle failures in an unattended scraping job?

Wrap each fetch in a retry helper that backs off between attempts and gives up after a few tries, then catch and log a failure at the keyword level so one bad request skips that item instead of aborting the batch. Log every non-200 status so your run leaves a paper trail. The Crawling API's rotation already prevents most blocks, so retries mainly cover transient blips.

Do I need the JS token to scrape Amazon?

Often not. Many Amazon search and product pages parse fine with the normal token, which costs fewer credits. Start with the normal token, and only switch to the JavaScript (JS) token if a field you need comes back empty because it renders client-side. Using the cheaper token where it works keeps a scheduled job affordable on the free tier.

Where should I store the scraped results?

For a small job, a timestamped JSON file per run is durable and easy to diff later to track price or ranking changes. As volume grows, move to a database so you can query across runs efficiently. Either way, keep the record shape stable and stamp each run with a scrapedAt time so a later comparison knows which moment a value belongs to.

Automating does not change the rules; it just runs the same scrape repeatedly, which raises the importance of pacing. Stay on public product data, respect Amazon's Conditions of Use and robots.txt, keep your scheduled volume modest, and avoid personal data and anything behind a login. For volume or commercial reuse, Amazon's official Product Advertising or Selling Partner API is the sanctioned route.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available