How to Scrape Customer Reviews

Q: How do I get all the reviews and not just the first page?

Most platforms paginate with a ?page=2 style parameter. Build the page URLs in a loop, fetch and parse each one, and stop when a page returns no reviews or you hit a limit you set. For sites that use infinite scroll instead of numbered pages, pass the Crawling API's scroll option rather than constructing page URLs; the rest of the loop stays the same.

Customer reviews are some of the most useful public data on the web. Ratings, written feedback, and verified-purchase flags tell you what people actually think about a product, and they update constantly. The problem is that most review pages render client-side and paginate across dozens or hundreds of screens, so a plain HTTP request hands you an empty shell. This guide shows you how to scrape customer reviews with a small, runnable Python pipeline: render the JavaScript-heavy page, parse structured fields, paginate through the full set, store the results, and optionally run sentiment on top.

To keep this honest and defensible, the whole walkthrough is scoped to public reviews: the rating, title, body, date, and verified badge that anyone can see without logging in. It does not touch user accounts, login-walled content, or any personal data beyond what a platform already displays publicly. The ethics and ToS section near the end is not boilerplate, so read it before you point this at production volume.

Why scrape customer reviews

A single review tells you one person's opinion. A few thousand of them, structured into rows you can query, tell you where a product is winning and where it is quietly bleeding customers. That is the value: turning a rendered review page into clean, comparable data you can chart, track over time, or feed into a model. Teams use it for competitive benchmarking, product-gap analysis, brand monitoring, and tracking how sentiment shifts after a launch or a fix.

This is the same shape of problem as any ecommerce web scraping job. The difference with reviews is volume and pagination: the data is spread across many pages, loaded lazily, and guarded by anti-bot defenses that escalate as you go. So the approach has to handle rendering, pagination, and blocking from the first request.

Why a plain fetch fails on review pages

Request a modern review URL with a bare HTTP client and you typically get status 200 and almost no review content in the body. Two things work against you. First, most platforms render reviews in the browser with JavaScript, so the initial HTML is a skeleton that fills in only after the page's scripts run. Second, review sites flag automated traffic fast: datacenter IPs and request patterns that do not look like a real browser get challenged or blocked before they ever see the rendered content.

So a working review scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawlbase Crawling API folds both into a single call: you send it the URL with a JavaScript token, it renders the page behind a trusted IP, and returns the finished HTML to parse. If you would rather skip selectors on common targets, the Crawling API returns parsed fields as JSON, and for raw proxy access there is the Smart AI Proxy.

Why a JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Review pages are client-side rendered, so you need the JS token here. Using the normal token returns the same empty shell a plain fetch would, with the review cards missing.

What you extract from a review

The fields worth capturing are consistent across most platforms, even when the markup is not. Aim for these five on every review card:

rating the star score, normalized to a number. Most sites use a 1 to 5 scale; some use 1 to 10, which you convert later.
title the short headline a reviewer gives, when the platform has one.
body the review text itself, the qualitative part that carries the actual signal.
date when the review was posted, ideally from a machine-readable datetime attribute rather than display text.
verified whether the platform marks it as a verified purchase, which lets you filter out lower-trust reviews later.

The goal is one stable schema so reviews from different sources line up without per-source cleanup. A single unified shape works well:

json

{
  "rating": 4.5,
  "title": "Exactly what I needed",
  "body": "Arrived early and works as described.",
  "date": "2026-01-10",
  "verified": true,
  "url": "https://www.example.com/product/123/reviews?page=2"
}

Once every review fits this structure, cross-product and cross-platform analysis is just filtering and grouping, not reformatting.

Set up the project

You need Python 3 and a Crawlbase account with a JS token, which you get from the dashboard after signing up. Create a project folder and install the libraries.

bash

python --version

mkdir review-scraper && cd review-scraper
python -m venv venv && source venv/bin/activate
pip install crawlbase beautifulsoup4

Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML. The standard-library re, csv, and json modules cover normalization and storage, so there is nothing else to install. Keep your token out of the source: export it as an environment variable and read it at runtime.

Fetch the rendered HTML

Start by getting the finished page. The Python client wraps the API in one get call. You pass two options that matter for a review site: ajax_wait tells the API to wait for asynchronous content to load, and page_wait holds for a fixed number of milliseconds after load so late-rendering review cards have time to appear. Five seconds is a reasonable starting point; raise it if results come back thin.

python

import os
from crawlbase import CrawlingAPI

# JS token renders the page in a real browser before returning HTML
api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]})

options = {
    "ajax_wait": "true",
    "page_wait": 5000,
}

def fetch_html(url):
    response = api.get(url, options)
    if response["status_code"] != 200:
        raise RuntimeError(f"fetch failed: {response['status_code']}")
    return response["body"].decode("utf-8")

if __name__ == "__main__":
    url = "https://www.example.com/product/123/reviews"
    print(fetch_html(url)[:2000])

Run it and you should see real markup with review cards in it, not the empty shell a plain fetch returns. That confirms rendering is working before you write a single selector. The response["body"] comes back as bytes from the client, so decode it once and hand the string to the parser.

Crawlbase Crawling API

Review pages need a rendered page behind a trusted IP, in one call. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public review page on the free tier first.

Start free

Parse reviews with BeautifulSoup

With the HTML in hand, load it into BeautifulSoup and walk the review cards. Each card carries the fields you want, but the class names differ by platform, so the parser uses a list of candidate selectors and takes the first that matches. Inspect the live page in your browser's dev tools to find the current selectors for your target, then map each field to one.

python

import re
from bs4 import BeautifulSoup

def first_text(card, selectors):
    for sel in selectors:
        el = card.select_one(sel)
        if el:
            return el.get_text(separator=" ", strip=True)
    return ""

def parse_reviews(html, source_url=""):
    soup = BeautifulSoup(html, "html.parser")
    cards = soup.select("[data-review-id], article[class*='review'], .review-card")
    reviews = []
    for card in cards:
        rating_raw = first_text(card, ["[data-rating]", ".star-rating", "[class*='star']"])
        match = re.search(r"(\d+(?:\.\d+)?)", rating_raw)
        date_el = card.select_one("time[datetime], [data-review-date]")
        reviews.append({
            "rating": float(match.group(1)) if match else None,
            "title": first_text(card, [".review-title", "h3", "[class*='title']"]),
            "body": first_text(card, [".review-body", "[data-review-text]", "p"]),
            "date": (date_el.get("datetime") if date_el else ""),
            "verified": card.select_one("[class*='verified']") is not None,
            "url": source_url,
        })
    return [r for r in reviews if r["body"]]

Selectors drift

Review platforms change their class names and data attributes without notice. Treat the selectors above as a starting template, not a contract. When extraction returns empty fields, re-inspect the live page and update the candidate lists. This is normal maintenance for any production scraper, not a sign something is broken.

The first_text helper is what keeps the parser portable: give it a short list of likely selectors per field and it returns the first hit, so adapting to a new platform is mostly editing those lists rather than rewriting logic. Dropping reviews with an empty body filters out layout cards and ad slots that share the review container's class.

Paginate through every review

Fetching one page is rarely enough. Most platforms split reviews across dozens or hundreds of pages, usually with a ?page=2 style query parameter. If you only request the first page, you miss the majority of the data. The pattern is to build page URLs in a loop, fetch and parse each, and stop when a page returns no reviews or you hit a limit you set.

python

import time

def scrape_all_reviews(base_url, max_pages=25):
    all_reviews = []
    for page in range(1, max_pages + 1):
        sep = "&" if "?" in base_url else "?"
        page_url = f"{base_url}{sep}page={page}"
        html = fetch_html(page_url)
        reviews = parse_reviews(html, page_url)
        if not reviews:
            break  # empty page means we ran past the last one
        all_reviews.extend(reviews)
        print(f"page {page}: {len(reviews)} reviews")
        time.sleep(1)  # pace requests so you stay under rate limits
    return all_reviews

A few practical notes. Set a sane max_pages so a layout change cannot send you into an infinite loop. Stop as soon as a page yields zero reviews. And if your target uses infinite scroll instead of numbered pages, drop the scroll option into the Crawling API options dict rather than building page URLs; the rest of the loop is the same.

Normalize across platforms

For a single product on one platform, the parser output is usually clean enough to store directly. The moment you pull reviews from more than one source, small inconsistencies show up: one site rates out of 10, another uses relative dates like "3 days ago," and field names diverge. A thin normalization pass keeps everything comparable.

python

def normalize(review, scale=5):
    rating = review.get("rating")
    if rating is not None and scale != 5:
        # map any scale onto a common 0-5 range
        review["rating"] = round(rating / scale * 5, 2)
    review["body"] = " ".join(review["body"].split())
    return review

Convert ratings to one scale, collapse whitespace in the body, and align field names if a source names things differently. For a single platform you can skip this step; for multi-platform analysis it is what stops the data from drifting apart.

Store the results

Logging to the console is fine while you iterate, but you want the data on disk. CSV is the simplest target and opens in any spreadsheet; the standard-library csv module maps each dict key to a column.

python

import csv

def save_csv(reviews, path="reviews.csv"):
    fields = ["rating", "title", "body", "date", "verified", "url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(reviews)
    print(f"saved {len(reviews)} reviews to {path}")

If you would rather query the data with SQL, write the same rows into a SQLite table with the standard-library sqlite3 module; the parsing and pagination stay identical. JSON Lines is another good option when you want to stream records into a downstream pipeline. Wire the pieces together and the whole run is a handful of calls:

python

if __name__ == "__main__":
    base = "https://www.example.com/product/123/reviews"
    reviews = scrape_all_reviews(base, max_pages=25)
    reviews = [normalize(r) for r in reviews]
    save_csv(reviews)

Optional: run sentiment on the body text

Once reviews are structured, sentiment is a short add-on. A lightweight, rule-based model like VADER gives you a polarity score per review without training anything, which is enough to flag the angriest and happiest reviews and to track average sentiment over time.

python

# pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def add_sentiment(reviews):
    for r in reviews:
        score = analyzer.polarity_scores(r["body"])["compound"]
        r["sentiment"] = round(score, 3)
    return reviews

The compound score runs from -1 (very negative) to +1 (very positive). For deeper work, pass the same body text to a transformer-based classifier or a hosted NLP service; the pipeline up to this point does not change.

Scale across many products

A loop and a sleep is fine for one product. When you need reviews across hundreds of URLs, you start fighting concurrency, retries, and scheduling in your own code. That is where the Crawler comes in: instead of pulling pages one by one, you push a list of URLs to Crawlbase and it processes them in the cloud, rendering each page and delivering the finished HTML to your endpoint through a webhook. Failed requests are retried for you, so you are not babysitting a queue. For a handful of pages the Crawling API is enough; past that, the Crawler removes most of the operational overhead.

Staying unblocked

Even with rendering handled, review sites watch for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard commercial target.

Pace your requests. Hammering the same product's reviews in a tight loop is the fastest way to get throttled. The time.sleep in the pagination loop is there on purpose; keep it.
Lean on rotation. A pool of residential proxies spreads requests across many real-user IPs so no single address trips a rate limit. The Crawling API handles this for you; if you roll your own, this is the part to get right.
Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat the response status as signal, not noise, and back off when it changes.

For the broader playbook, see how to scrape websites without getting blocked. If you want to compare this managed approach with a hand-built headless stack, web scraping with Python and Selenium walks through that build.

The honest part: ToS and personal data

Scraping a large commercial site sits in a legal gray area, and whether it is allowed depends on the platform's terms of service, your jurisdiction, and what you do with the data. Many review sites restrict automated access in their terms, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work.

A few lines worth holding to. Collect only public reviews: the rating, title, body, date, and verified flag that anyone can see without an account. Respect the site's robots.txt and its stated rate expectations, and keep your request volume low enough that you are not straining anyone's servers. Do not collect personal data beyond what the platform already displays publicly, and never try to deanonymize a reviewer or join their reviews to data from elsewhere. If you plan to reuse the data commercially, get permission or an official data agreement rather than assuming silence is consent.

This guide is deliberately scoped to public review content because that is the line that keeps the work defensible. It does not cover anything behind a login, account or profile data, or actions taken as a logged-in user. If your project needs more than public reviews, the right move is an official API or a data agreement with the platform, not a cleverer scraper. For background on how managed access differs from raw scraping, ecommerce web scraping is a useful companion read.

Recap

Key takeaways

Review pages are client-side rendered. A plain fetch returns an empty shell, so you must render the page before you parse it.
Rendering and a trusted IP come together. The Crawling API with a JS token does both in one call; ajax_wait and page_wait control how long it waits for content.
Parse into one schema. Capture rating, title, body, date, and verified with candidate selectors, and expect those selectors to drift over time.
Pagination is the data. Loop page URLs, stop on an empty page, and pace requests so you stay under rate limits.
Store, then analyze. Write to CSV, SQLite, or JSON Lines, and bolt on sentiment once the data is structured.
Stay on public reviews. Respect ToS and robots.txt; no accounts, no personal data beyond what is publicly shown.

Frequently Asked Questions (FAQs)

How do I scrape customer reviews from JavaScript-heavy sites?

Most review platforms render their cards client-side, so a raw HTTP request returns status 200 with the reviews missing. You need a browser-based fetch. Send the URL to the Crawling API with a JS token and it renders the page in a real browser before handing back the HTML, so every review is present when BeautifulSoup parses it. The ajax_wait and page_wait options control how long it waits for late-loading content.

Do I need the normal token or the JS token to scrape customer reviews?

The JS token. The normal token fetches static HTML, which on a review site is the same empty shell a plain fetch returns. The JS token renders the page in a real browser first, so the review cards exist in the HTML by the time your parser runs.

How do I get all the reviews and not just the first page?

Most platforms paginate with a ?page=2 style parameter. Build the page URLs in a loop, fetch and parse each one, and stop when a page returns no reviews or you hit a limit you set. For sites that use infinite scroll instead of numbered pages, pass the Crawling API's scroll option rather than constructing page URLs; the rest of the loop stays the same.

My selectors return empty fields. What changed?

Almost certainly the platform's markup. Review sites change their class names and data attributes without notice, so selectors that worked last month can break. Re-inspect a live review page in your browser's dev tools and update the candidate selector lists in the parser. Periodic selector maintenance is normal for any production scraper.

Can I run sentiment analysis on the scraped reviews?

Yes. Once reviews are normalized into a consistent schema, the body text feeds straight into an NLP step. A rule-based model like VADER gives you a polarity score per review with no training; for more nuance, pass the same text to a transformer classifier or a hosted NLP service. The scraping pipeline does not change.

Is it legal to scrape customer reviews?

It depends on the platform's terms of service, your jurisdiction, and your purpose, and many sites restrict automated access. Keep strictly to public review content, respect robots.txt and rate expectations, and do not collect personal data beyond what is publicly displayed or try to identify individual reviewers. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why scrape customer reviews

Why a plain fetch fails on review pages

What you extract from a review

Set up the project

Fetch the rendered HTML

Parse reviews with BeautifulSoup

Paginate through every review

Normalize across platforms

Store the results

Optional: run sentiment on the body text

Scale across many products

Staying unblocked

The honest part: ToS and personal data

Key takeaways

Frequently Asked Questions (FAQs)

How do I scrape customer reviews from JavaScript-heavy sites?

Do I need the normal token or the JS token to scrape customer reviews?

How do I get all the reviews and not just the first page?

My selectors return empty fields. What changed?

Can I run sentiment analysis on the scraped reviews?

Is it legal to scrape customer reviews?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Building a Distributed Crawling Engine: Orchestrate in Node.js, Execute on Crawlbase

Web Scraping API for Enterprise: What CTOs Look For

Build a Scalable Web Data Pipeline: With Crawlbase

The infrastructure brief, in your inbox.