Goodreads is one of the largest public catalogs of book data on the web, and every public book page carries the kind of reader signal that powers trend analysis, recommendation engines, and market research: a title, an author, an average rating, how many people rated it, and the text of public reviews. The catch is that Goodreads renders much of that content client-side and loads reviews asynchronously, so a plain HTTP request hands you a thin shell instead of the ratings and review text you came for.

This guide shows you how to scrape Goodreads ratings and reviews with Python the reliable way. You build a small, runnable scraper that fetches a rendered book page through the Crawling API, parses the fields you want with BeautifulSoup, and prints a clean structured record. We keep the whole walkthrough scoped to public book and review data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.

What you will build

A Python script that takes a public Goodreads book URL, retrieves the rendered HTML through the Crawling API, and extracts a structured record of the book and its visible reviews. We will use a well-known public title as the running example and pull these fields:

  • Book title the name of the book, for example "The Great Gatsby".
  • Author the author credited on the book page.
  • Average rating the aggregate score Goodreads computes from user ratings.
  • Ratings count how many people have rated the book.
  • Reviews the public review text shown on the page, with the reviewer's display name.

Why a plain fetch fails on Goodreads

If you request a Goodreads book URL with a bare HTTP client, you get a response with status 200 and almost none of the review content in the body. Two things work against you. First, Goodreads renders much of the rating and review section in the browser with JavaScript, so the initial HTML is a shell that only fills in after the page's scripts run. Second, the review list loads asynchronously and grows behind interaction, so even a partial render can miss the data you want unless you give the page time to settle.

So a working Goodreads scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL with a JavaScript token, it renders the page behind a trusted IP, and it returns finished HTML for you to parse.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Goodreads loads its ratings and reviews client-side, so you need the JS token here. Using the normal token returns the same thin shell a plain fetch would, and there is little of value to parse out of it.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to parsing HTML, our primer on how to use BeautifulSoup in Python covers the selector basics this tutorial leans on.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create a virtual environment so project dependencies stay isolated, then install the two libraries the scraper needs.

bash
python --version

python -m venv goodreads_env
source goodreads_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with goodreads_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector.

Step 1: Fetch the rendered book page

Start by getting the finished page. Import the CrawlingAPI class, initialize it with your JS token, and request the book URL. Checking the status before you parse keeps failures loud instead of silent.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    page_url = "https://www.goodreads.com/book/show/4671.The_Great_Gatsby"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

The two wait options matter for a client-rendered target like this. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear before the page is captured. Five seconds is a reasonable start; raise it if the review fields come back empty. Run the script with python scraper.py and you should see real book markup, not the thin shell a plain fetch returns. That confirms rendering works before you write a single selector.

Crawlbase Crawling API

Goodreads needs a rendered page behind a trusted IP, in one call. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public book page on the free tier first.

Step 2: Parse the book fields with BeautifulSoup

With rendered HTML in hand, load it into BeautifulSoup and pull each field by its selector. A Goodreads book page lays out the core details in a predictable structure, so you can map title, author, average rating, and ratings count to individual selectors, then walk the review cards to collect the public review text. Helper functions that return None on a missing element keep one absent field from crashing the whole run.

python
from bs4 import BeautifulSoup

def text_of(node, selector):
    el = node.select_one(selector)
    return el.get_text(strip=True) if el else None

def scrape_book(html):
    soup = BeautifulSoup(html, "html.parser")

    rating_el = soup.select_one("div.RatingStatistics span.RatingStars")
    average_rating = rating_el["aria-label"] if rating_el else None

    reviews = []
    for card in soup.select("div.ReviewsList article.ReviewCard"):
        reviews.append({
            "user": text_of(card, 'div[data-testid="name"]'),
            "review": text_of(card, "section.ReviewText span.Formatted"),
        })

    return {
        "title": text_of(soup, 'h1.H1Title a[data-testid="title"]'),
        "author": text_of(soup, "span.ContributorLink__name"),
        "average_rating": average_rating,
        "ratings_count": text_of(soup, 'span[data-testid="ratingsCount"]'),
        "reviews": reviews,
    }

The text_of helper does two useful things at once: it queries a single element within a given node and returns None when the element is missing, instead of throwing on a .get_text() call against nothing. Passing the node explicitly matters here, because each review card is its own scope and you want the reviewer name and review text read from that card, not from the first match on the whole page. The average rating is read from the aria-label attribute rather than visible text, so it is handled separately.

Selectors drift

Goodreads class names (the RatingStars and ReviewCard markers, the data-testid attributes, and the section wrappers) change without notice. Treat the selectors above as a starting template, not a contract. When a field comes back as None, re-inspect the live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Load more reviews

The first render shows only the top slice of reviews. Goodreads reveals the rest behind a "Show more reviews" button rather than a numbered pager, so to reach deeper review text you need the page to click that button before it is captured. The Crawling API exposes a css_click_selector option that clicks a matching element during the render, which lets you pull a larger review set in the same request.

python
def crawl_with_more_reviews(page_url):
    options = {
        "ajax_wait": "true",
        "page_wait": 5000,
        "css_click_selector": 'button:has(span[data-testid="loadMore"])',
    }
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

The selector targets the button that wraps the load-more control. Clicking it once expands the visible review list before the HTML is captured, so the same scrape_book parser then sees more cards without any change. If you need still more reviews than one click yields, raise page_wait to give the expanded list time to render. For background on why interaction-driven content behaves this way, see our guide on how to crawl JavaScript websites.

Step 4: Put it together

Now wire the fetch and the parse into one runnable script. Fetch the rendered HTML with the load-more click, hand it to the parser, and write the structured record to a JSON file.

python
import json
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    options = {
        "ajax_wait": "true",
        "page_wait": 5000,
        "css_click_selector": 'button:has(span[data-testid="loadMore"])',
    }
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def text_of(node, selector):
    el = node.select_one(selector)
    return el.get_text(strip=True) if el else None

def scrape_book(html):
    soup = BeautifulSoup(html, "html.parser")
    rating_el = soup.select_one("div.RatingStatistics span.RatingStars")
    average_rating = rating_el["aria-label"] if rating_el else None

    reviews = []
    for card in soup.select("div.ReviewsList article.ReviewCard"):
        reviews.append({
            "user": text_of(card, 'div[data-testid="name"]'),
            "review": text_of(card, "section.ReviewText span.Formatted"),
        })

    return {
        "title": text_of(soup, 'h1.H1Title a[data-testid="title"]'),
        "author": text_of(soup, "span.ContributorLink__name"),
        "average_rating": average_rating,
        "ratings_count": text_of(soup, 'span[data-testid="ratingsCount"]'),
        "reviews": reviews,
    }

def main():
    page_url = "https://www.goodreads.com/book/show/4671.The_Great_Gatsby"
    html = crawl(page_url)
    if not html:
        return
    data = scrape_book(html)
    with open("goodreads_book.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    print(json.dumps(data, indent=2, ensure_ascii=False)[:600])

if __name__ == "__main__":
    main()

What the output looks like

Run the full script with python scraper.py and you get a clean structured record for the book, ready to write to JSON, CSV, or a database.

json
{
  "title": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "average_rating": "Rating 3.93 out of 5",
  "ratings_count": "5,432,109 ratings",
  "reviews": [
    {
      "user": "Alex",
      "review": "Charms you with some of the most elegant English prose ever published."
    },
    {
      "user": "Inge",
      "review": "There was one thing I really liked about The Great Gatsby. It was short."
    }
  ]
}

Scaling to many books

One book is a demo; a real job runs over a list of titles. The shape stays the same: keep a list of book URLs, fetch each through the Crawling API, parse it with the same function, and collect the rows. Because every book page shares the same structure, the parser you already wrote works across all of them without changes.

python
import time

books = [
    "https://www.goodreads.com/book/show/4671.The_Great_Gatsby",
    "https://www.goodreads.com/book/show/5470.1984",
]

results = []
for url in books:
    html = crawl(url)
    if html:
        results.append(scrape_book(html))
    time.sleep(2)

with open("goodreads_books.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

The time.sleep(2) between requests paces the loop so you are not firing book pages back to back. To gather book URLs at scale you can scrape Goodreads public list and shelf pages with the same fetch-then-parse pattern, collecting the book links and then visiting each one. Just keep the volume reasonable and respect the rate limits covered below.

Staying unblocked

Even with rendering handled, Goodreads watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any large public target.

  • Pace your requests. Hammering book pages in a tight loop is the fastest way to get throttled. Spread requests out and vary your targets instead of crawling one path at full speed.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.

For the broader playbook, see how to scrape websites without getting blocked and the deeper dive on how to bypass captchas while web scraping. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy (also called the AI Proxy) gives you the same residential IP rotation as a drop-in proxy endpoint.

Whether scraping Goodreads is allowed depends on the Goodreads and Amazon terms of service, your jurisdiction, and what you do with the data. Goodreads is an Amazon-owned property, and its terms restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the Goodreads Terms of Service and its robots.txt, and treat both as the boundary for what you collect.

A few lines worth holding to. Collect only public book and review data: title, author, average rating, ratings count, and the review text that anyone can see without an account. Respect Goodreads's stated rate expectations and keep your request volume low enough that you are not straining its servers. Avoid reviewer personal data beyond what is publicly displayed on the page, and do not build profiles of identifiable individuals from it. If you plan to reuse the data commercially, get permission or a licensed source rather than assuming silence is consent.

One practical note specific to Goodreads: the official Goodreads API is effectively deprecated and closed to new keys, so there is no live first-party feed to fall back on the way some platforms offer. That leaves two realistic options for public data, scraping public book pages as shown here, or sourcing the data through a licensed provider. This guide is deliberately scoped to public book-page content because that is the line that keeps the work defensible. It does not cover anything behind a login, a user's private shelves or account data, reviewer personal data beyond the public display, or any attempt to bypass authentication. If your project needs more than public book and review data, a licensed data source is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Goodreads is client-side rendered. A plain fetch returns a thin shell, so you must render the page before you parse the ratings and reviews.
  • You need rendering and a trusted IP together. The Crawling API with a JS token does both in one call; ajax_wait and page_wait control how long it waits for content.
  • Reviews load behind a button. Use css_click_selector to expand the review list during the render so the same parser sees more cards.
  • BeautifulSoup does the extraction. Map title, author, average rating, ratings count, and review text to current selectors, and expect those selectors to drift.
  • Stay on public data. Respect the Goodreads and Amazon ToS and robots.txt, prefer a licensed source for bulk or commercial use, and never touch accounts, private shelves, or reviewer personal data.

Frequently Asked Questions (FAQs)

Why does a plain fetch return no reviews from Goodreads?

Because Goodreads renders its rating and review content client-side with JavaScript and loads the review list asynchronously. The initial HTML is a shell that only fills in after the page's scripts run in a browser, so a raw HTTP request returns status 200 with the review fields blank. To get real data you have to render the page first, which is what the Crawling API's JS token handles for you.

Do I need the normal token or the JS token for Goodreads?

The JS token. The normal token fetches static HTML, which on Goodreads is the same thin shell a plain fetch returns. The JS token renders the page in a real browser before handing back the HTML, so the ratings and review text are present when BeautifulSoup parses them.

How do I load more than the first few reviews?

Goodreads reveals additional reviews behind a "Show more reviews" button rather than a numbered pager. Pass a css_click_selector option to the Crawling API that targets that button, and it gets clicked during the render so the captured HTML includes the expanded list. Raise page_wait if you need the newly revealed reviews to finish rendering before capture.

My selectors return None. What changed?

Almost certainly Goodreads's markup. Its RatingStars and ReviewCard classes, data-testid attributes, and section wrappers change without notice, so selectors that worked last month can break. Re-inspect a live book page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.

Can I use the official Goodreads API instead?

In practice, no. The official Goodreads API is effectively deprecated and closed to new keys, so there is no live first-party feed to rely on. For public data the realistic options are scraping public book pages with the approach in this guide, or sourcing the data through a licensed provider. Either way, respect the terms of service, robots.txt, and rate limits.

How do I avoid getting blocked while scraping Goodreads?

Keep your per-IP request rate low, vary your targets instead of looping one path, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off when you start seeing challenges.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available