Amazon product reviews are one of the richest public signals on the open web. The star rating, title, body text, and date on each review add up to a running record of what real buyers think about a product. That data drives sentiment analysis, product research, and competitor comparison, which is why teams want a clean, structured feed of it instead of scrolling the page by hand.
This guide shows you how to scrape Amazon reviews with Python. You build a small, runnable scraper that fetches a product's reviews page through the Crawling API, parses each review with BeautifulSoup, walks the pagination, and exports the results to JSON and CSV. The whole walkthrough stays scoped to the public review text Amazon shows any visitor, and the legality section near the end is not boilerplate, so read it before you point this at real volume.
What you will build
A Python script that takes an Amazon product's reviews URL, retrieves the rendered page through the Crawling API, and extracts a structured record per review. We will use the Meta Quest Pro as the running example and pull these fields from each review block:
- Reviewer name the public display name shown on the review.
- Rating the star score, for example "4.0 out of 5 stars".
- Title the short headline the reviewer gave the review.
- Body text the full written review.
- Date the "Reviewed in the United States on ..." line.
The script collects those records across every page of reviews and writes them to amazon_reviews.json and amazon_reviews.csv, ready for a sentiment model, a spreadsheet, or a database.
Why a plain request fails on Amazon
If you point a bare HTTP client at an Amazon reviews URL, you rarely get the reviews. Amazon is one of the most heavily defended sites on the web against automated traffic. A datacenter IP, or a request that does not look like a real browser, gets met with a CAPTCHA, a "Robot Check" interstitial, or an outright block before you reach the review blocks. Even when a request slips through, parts of the page render through JavaScript, so a raw fetch can hand back a shell instead of finished markup.
So a working Amazon reviews scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real shopper. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it renders the page behind a trusted residential IP, rotates addresses for you, and returns finished HTML for BeautifulSoup to parse.
Crawlbase can return either raw HTML for you to parse yourself, or pre-parsed JSON through the Scraper API's built-in amazon-product-reviews parser. This tutorial parses the HTML with BeautifulSoup so you can see exactly which selectors map to which field, then notes where the auto-parse route saves you that step.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to the language, the scrape a website with Python primer and any beginner course will get you to the level this tutorial assumes.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your token from the account docs page. Crawlbase gives you 1,000 free requests to start with no card, and you pay only for successful requests. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the two libraries the scraper needs.
python --version python -m venv amazon_env source amazon_env/bin/activate pip install crawlbase beautifulsoup4
On Windows, activate the environment with amazon_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull each field out of a review block by CSS selector.
Understanding the Amazon reviews page
Before writing selectors, open a product's reviews page in your browser, right-click a single review, and choose Inspect. Amazon wraps each review in a container marked with data-hook="review", and exposes the individual fields through stable data-hook attributes inside that container. Those hooks are far more durable than Amazon's utility class names, so you target them wherever you can.
The fields you care about map to these hooks inside each review block:
-
Reviewer name the
span.a-profile-nameelement. -
Rating
[data-hook="review-star-rating"](orreview-star-rating-view-pointon some layouts). -
Title
[data-hook="review-title"]. -
Body text
[data-hook="review-body"]. -
Date
[data-hook="review-date"].
Step 1: Fetch the rendered reviews page
Start by getting the finished page. Import the CrawlingAPI class, initialize it with your token, point it at a product's reviews URL, and request it. Checking the status code before you parse keeps failures loud instead of silent.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) REVIEWS_URL = ( "https://www.amazon.com/Meta-Quest-Pro-Oculus/product-reviews/" "B09Z7KGTVW/?reviewerType=all_reviews" ) def crawl(page_url): options = {"ajax_wait": "true", "page_wait": 3000} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("latin1") print(f"Request failed: {response['status_code']}") return None if __name__ == "__main__": html = crawl(REVIEWS_URL) print(html[:500] if html else "No HTML returned")
The two wait options help when parts of the page load asynchronously. ajax_wait tells the API to wait for asynchronous content to finish, and page_wait holds for a fixed number of milliseconds after load so late-rendering review blocks appear before the page is captured. The body is decoded as latin1 because Amazon pages mix in characters that strict UTF-8 decoding can choke on. Run the script and you should see real review markup, not a Robot Check page. That confirms the request is getting through before you write a single selector.
That Robot Check page is exactly what Amazon throws at a bare request. The Crawling API renders the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML in one call, so you skip running a headless fleet and a proxy pool yourself. Point it at a reviews URL on the free tier first, then scale.
Step 2: Parse the reviews with BeautifulSoup
With rendered HTML in hand, load it into BeautifulSoup, find every review block, and pull each field by its data-hook selector. Wrap each block in a try/except so one malformed review does not crash the run.
from bs4 import BeautifulSoup def text_of(block, selector): el = block.select_one(selector) return el.get_text(strip=True) if el else None def parse_reviews(html): soup = BeautifulSoup(html, "html.parser") blocks = soup.select('div[data-hook="review"]') reviews = [] for block in blocks: try: reviews.append({ "reviewer_name": text_of(block, "span.a-profile-name"), "rating": text_of(block, '[data-hook="review-star-rating"]'), "title": text_of(block, '[data-hook="review-title"]'), "text": text_of(block, '[data-hook="review-body"]'), "date": text_of(block, '[data-hook="review-date"]'), }) except Exception as e: print(f"Skipped a review: {e}") return reviews
The text_of helper queries a single element inside one review block and returns None when the element is missing, instead of throwing on a .get_text() call against nothing. That keeps extraction resilient when a field is absent. The star rating selector falls back gracefully: if review-star-rating returns nothing on a particular layout, swap in review-star-rating-view-point, which Amazon uses on some pages. The rating arrives as a string like "4.0 out of 5 stars"; split on " out of" later if you want a bare numeric score for a model.
Amazon revises its markup often, and the utility class names change without notice. The data-hook attributes are more durable, which is why the selectors above lean on them. When a field comes back as None for every review, re-inspect a live reviews page in your browser's dev tools and update the selector. Periodic maintenance is normal for any production scraper, not a sign something is broken. The Scraper API's amazon-product-reviews parser exists precisely so you can offload this upkeep.
Step 3: Walk the review pagination
One page is a demo; a real job runs across every page of reviews for a product. Amazon paginates reviews with a pageNumber query parameter, so you walk pages by incrementing it and stopping when a page returns no review blocks. That avoids hardcoding a page count and naturally handles products with only a handful of reviews.
To see the pattern, compare the URLs Amazon uses:
-
Page 1
.../product-reviews/B09Z7KGTVW/?reviewerType=all_reviews -
Page 2
.../product-reviews/B09Z7KGTVW/?reviewerType=all_reviews&pageNumber=2 -
Page 3
.../product-reviews/B09Z7KGTVW/?reviewerType=all_reviews&pageNumber=3
import time def scrape_all_reviews(base_url, max_pages=10): all_reviews = [] for page in range(1, max_pages + 1): page_url = f"{base_url}&pageNumber={page}" html = crawl(page_url) if not html: break reviews = parse_reviews(html) if not reviews: print(f"No reviews on page {page}; stopping.") break all_reviews.extend(reviews) print(f"Page {page}: {len(reviews)} reviews") time.sleep(2) return all_reviews
The max_pages cap keeps a run bounded so a product with thousands of reviews does not spin forever, and the empty-results break stops you early when Amazon runs out of pages. The time.sleep(2) between pages paces requests so you are not hammering the site in a tight loop, which is the fastest way to get throttled. Tune both to your volume and the rate limits below.
Step 4: Assemble and store the data
Now wire the fetch, the parse, and the pagination into one runnable script, then write the collected reviews to both JSON and CSV. JSON keeps the nested structure for a pipeline; CSV drops straight into a spreadsheet or a pandas DataFrame for sentiment work.
import csv import json import time from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) REVIEWS_URL = ( "https://www.amazon.com/Meta-Quest-Pro-Oculus/product-reviews/" "B09Z7KGTVW/?reviewerType=all_reviews" ) def crawl(page_url): options = {"ajax_wait": "true", "page_wait": 3000} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("latin1") print(f"Request failed: {response['status_code']}") return None def text_of(block, selector): el = block.select_one(selector) return el.get_text(strip=True) if el else None def parse_reviews(html): soup = BeautifulSoup(html, "html.parser") blocks = soup.select('div[data-hook="review"]') reviews = [] for block in blocks: try: reviews.append({ "reviewer_name": text_of(block, "span.a-profile-name"), "rating": text_of(block, '[data-hook="review-star-rating"]'), "title": text_of(block, '[data-hook="review-title"]'), "text": text_of(block, '[data-hook="review-body"]'), "date": text_of(block, '[data-hook="review-date"]'), }) except Exception as e: print(f"Skipped a review: {e}") return reviews def scrape_all_reviews(base_url, max_pages=10): all_reviews = [] for page in range(1, max_pages + 1): page_url = f"{base_url}&pageNumber={page}" html = crawl(page_url) if not html: break reviews = parse_reviews(html) if not reviews: break all_reviews.extend(reviews) print(f"Page {page}: {len(reviews)} reviews") time.sleep(2) return all_reviews def save(reviews): with open("amazon_reviews.json", "w", encoding="utf-8") as f: json.dump(reviews, f, indent=2, ensure_ascii=False) if reviews: with open("amazon_reviews.csv", "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=reviews[0].keys()) writer.writeheader() writer.writerows(reviews) print(f"Saved {len(reviews)} reviews to JSON and CSV") def main(): reviews = scrape_all_reviews(REVIEWS_URL) save(reviews) if __name__ == "__main__": main()
Run it with python scraper.py. The script walks the review pages, prints a per-page count, and writes amazon_reviews.json and amazon_reviews.csv in the same directory. From there the records feed a sentiment model, a rating-trend chart, or a comparison against a competitor's product. Before any of that, it is worth running the data through a structure and clean web scraped data for AI and ML pass so the ratings and dates land in consistent types.
What the output looks like
Each record is a flat object with the five fields. The JSON file looks like this:
[ { "reviewer_name": "Grrgoyl", "rating": "4.0 out of 5 stars", "title": "No regret", "text": "My 256 gb Quest 2 is in danger of running out of space, so the Pro was an easy call.", "date": "Reviewed in the United States on August 2, 2023" }, { "reviewer_name": "Damian", "rating": "3.0 out of 5 stars", "title": "Excellent comfort, poor display", "text": "I purchased this to upgrade from my first gen Rift and the comfort is great, the display less so.", "date": "Reviewed in the United States on November 1, 2022" } ]
The CSV carries the same five columns: reviewer_name, rating, title, text, and date. If you would rather skip the parsing entirely, the Crawling API returns these fields pre-parsed as JSON through its amazon-product-reviews parser, which also surfaces extras like the review ID and verified-purchase flag.
Staying unblocked
Even with rendering and rotation handled for you, Amazon watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard commercial target.
-
Pace your requests. Spread requests out with a delay between pages instead of crawling at full speed. The
time.sleepin the pagination loop is the floor, not the ceiling. - Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked. If your interest is the full reviews picture across more than one retailer, the general how to scrape customer reviews guide covers the cross-site patterns, and scrape Amazon product data pairs nicely when you want the listing details alongside the reviews.
Is it legal to scrape Amazon reviews?
Whether scraping Amazon reviews is allowed depends on Amazon's terms of service, your jurisdiction, and what you do with the data. Amazon's Conditions of Use restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Amazon's Conditions of Use and its robots.txt, and treat both as the boundary for what you collect. Amazon also runs CAPTCHA challenges to confirm a human is browsing, which is part of the same defensive posture.
A few lines worth holding to. Collect only the public review text: the rating, title, body, and date that any visitor can read on the reviews page without an account. The reviewer name shown on a review is public, but a display name is the most you should ever retain. Do not build profiles of individual reviewers, do not follow profile links to assemble someone's review history across products, and do not try to tie a display name to a real identity. Respect privacy and treat each review as a data point about the product, not about a person.
This guide is deliberately scoped to public review pages because that is the line that keeps the work defensible. It does not cover anything behind a login, account or order data, or any attempt to bypass authentication, and it does not redistribute copyrighted review media. If you need licensed or bulk access, Amazon offers official APIs and partner programs for product and review data, and that is the right tool when you need large volumes, guaranteed structure, or commercial rights. When your project needs more than public review text, an official API or a data agreement is the correct path, not a cleverer scraper.
Key takeaways
- A plain request gets blocked. Amazon meets bare HTTP traffic with a Robot Check or CAPTCHA, so you need a rendered page behind a trusted IP, which the Crawling API gives you in one call.
-
Target the data-hook attributes. Each review sits in a
div[data-hook="review"]block, with name, rating, title, body, and date exposed through stabledata-hookselectors that outlast the utility class names. -
Paginate with pageNumber. Walk
&pageNumber=until a page returns no review blocks, pace requests with a delay, and cap the page count. - Export to JSON and CSV. JSON keeps the structure for a pipeline; CSV drops into a spreadsheet or pandas for sentiment and trend analysis.
- Stay on public review text. Respect Amazon's terms and robots.txt, keep to the public rating and text, never profile individual reviewers, and prefer an official API for licensed or bulk data.
Frequently Asked Questions (FAQs)
Why does a plain request fail on Amazon reviews?
Amazon defends hard against automated traffic. A datacenter IP or a request that does not look like a real browser gets met with a CAPTCHA, a Robot Check interstitial, or a block before it reaches the review blocks, and parts of the page render through JavaScript on top of that. The Crawling API renders the page behind a trusted residential IP, so the reviews are present when BeautifulSoup parses them.
Which fields can I extract from an Amazon review?
This scraper pulls the reviewer's public display name, the star rating, the review title, the body text, and the date. Each one maps to a data-hook attribute inside the review block: review-star-rating, review-title, review-body, and review-date, plus span.a-profile-name for the name. The Scraper API's amazon-product-reviews parser returns the same fields plus extras like the review ID and a verified-purchase flag.
How do I scrape every page of reviews?
Amazon paginates reviews with a pageNumber query parameter on the product-reviews URL. Increment it in a loop, parse each page with the same code, and stop when a page returns no review blocks. Cap the page count and add a short delay between requests so you pace the run and do not get throttled.
My selectors return None. What changed?
Almost certainly Amazon's markup. Its utility class names change without notice, which is why the selectors above target data-hook attributes instead. If the star rating comes back empty, try review-star-rating-view-point, which Amazon uses on some layouts. Re-inspect a live reviews page in your browser's dev tools and update the selector; periodic maintenance is normal for any production scraper.
Can I use the scraped reviews for sentiment analysis?
Yes, that is one of the most common reasons to collect them. Export to CSV, load it into pandas, and run the body text through a sentiment model or a rating-trend analysis. Clean the rating into a numeric value and parse the date into a real date type first so the fields are model-ready.
Is it safe to store reviewer names?
Keep it minimal. The display name on a public review is public, but it is the most you should retain, and you should never use it to build a profile of an individual reviewer or link a name to a real identity. Treat each review as a data point about the product, respect privacy, and check Amazon's terms and your local data-protection rules before storing any personal field.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

