Yelp hosts hundreds of millions of crowd-sourced reviews of local businesses, and that public feedback is one of the richest signals on the open web for how a restaurant, shop, or service is actually perceived. Analysts read it to track sentiment over time, operators read it to find recurring complaints, and researchers read it to compare a whole category of businesses at once. The useful part is right there on each business page: a star rating, the review text, and the date it was posted.
This guide shows you how to crawl and scrape Yelp reviews with JavaScript and Node.js using cheerio. You build a small, runnable scraper that fetches a public Yelp business review page through the Crawling API, parses each review's rating, text, and date, handles review pagination, and exports the result as JSON and CSV. The whole walkthrough stays scoped to public review data and is built for aggregate analysis, not for profiling individual reviewers. Read the legality section near the end before you point this at any real volume.
What you will build
A Node.js script that takes a public Yelp business URL, retrieves the rendered HTML through the Crawling API, and extracts a structured record for every review on the page. We use a restaurant page as the running example and pull these fields per review:
- Rating the star rating the reviewer left, for example "4 out of 5".
- Text the public body of the review, the part you actually analyze.
- Date the date the review was posted, for sorting and trend work.
Notice what is not on that list. We deliberately do not collect or key on the reviewer's name, profile link, photo, or location. Those are personal data, and the whole point of this tutorial is aggregate analysis: sentiment, recurring themes, rating distribution over time. Treat a review as a rating, a block of text, and a date, never as a dossier on a person.
Why a plain request fails on Yelp
If you request a Yelp business URL with a bare HTTP client, you rarely get the reviews back. Two things work against you. First, Yelp renders the review list in the browser with JavaScript, so the initial HTML is a near-empty shell until the page's scripts run. Second, Yelp flags automated traffic aggressively: datacenter IPs and request patterns that do not look like a real browser get challenged with a CAPTCHA, rate-limited, or blocked before they reach the rendered reviews.
So a working Yelp scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it renders the page behind a trusted IP, and it returns finished HTML for you to parse with cheerio. Yelp is JS-rendered, so we request it with rendering turned on.
Everything below is structured so the output is useful in bulk and useless as a profile. We pull rating, text, and date, and we leave reviewer identity on the page. If your analysis genuinely needs to weight by reviewer, hash or pseudonymize at collection time so the stored record never ties text back to a named person.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic JavaScript and Node.js. You should be comfortable writing and running a Node script and installing packages with npm. If you are new to Node, the guide to building a web scraper with Node.js covers the basics this tutorial assumes.
Node.js 16 or later. Confirm your version with node --version. If you do not have it, install it from the Node.js website or through a version manager like nvm.
A Crawlbase account and token. Sign up, open your dashboard, and copy your JavaScript request token. The free tier gives you 1,000 requests with no card. Because Yelp needs rendering, use the JavaScript token rather than the normal one. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a project folder, initialize it, and install the two libraries the scraper needs.
node --version mkdir yelp-reviews && cd yelp-reviews npm init -y npm install crawlbase cheerio
Two dependencies do the work: crawlbase is the official Node client for the Crawling API, and cheerio parses the returned HTML with a jQuery-style API so you can pull out individual fields by CSS selector. The legacy version of this tutorial used request and cheerio; we keep cheerio and the same selector approach, and swap the raw HTTP call for the official client so rendering and retries are handled for you. Create a file named yelp-scraper.js in this folder and add the code from the steps below.
Step 1: Fetch the rendered review page
Start by getting the finished page. Import the CrawlingAPI class, initialize it with your JavaScript token, and request the business URL. Yelp renders reviews client-side, so we pass pageWait to give the scripts a moment to populate the list before the API captures the HTML. Checking the status code before you parse keeps failures loud instead of silent.
const { CrawlingAPI } = require('crawlbase'); const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }); const yelpPageURL = 'https://www.yelp.com/biz/sushi-yasaka-new-york'; api .get(yelpPageURL, { pageWait: 3000 }) .then((response) => { if (response.statusCode === 200) { console.log(response.body.slice(0, 500)); } }) .catch((error) => console.error('API request error:', error));
Run the script with node yelp-scraper.js and you should see real Yelp markup at the top of the body, not a stripped-down shell. That confirms rendering works before you write a single selector. The example URL points at one public restaurant page; swap in any other public business page and the same flow applies.
That first request just returned a fully rendered Yelp page without a headless browser or a proxy on your side. The Crawling API runs the page in a real browser, waits for the review list to populate, rotates through residential IPs server-side, and handles the CAPTCHAs Yelp throws at scrapers, so you get finished HTML from one call. Point it at a single public business page on the free tier first.
Step 2: Parse each review with cheerio
With rendered HTML in hand, load it into cheerio and walk the review blocks. Yelp lays each review out in a repeating container, so you select every review, then read the rating, text, and date from inside it. The legacy selectors for this page were .review.review--with-sidebar for the review container and .review-content p for the body text; we keep those and add rating and date. Reading each field defensively keeps one missing value from crashing the run.
const cheerio = require('cheerio'); function parseReviews(html) { const $ = cheerio.load(html); const reviews = []; // Legacy container selector, kept faithfully const blocks = $('.review.review--with-sidebar'); blocks.each((index, element) => { const block = $(element); // Public review text const text = block.find('.review-content p').text().trim(); // Rating: read it from the star widget's aria/title text const ratingLabel = block.find('[role="img"][aria-label*="star"]').attr('aria-label') || block.find('.i-stars').attr('title') || ''; const ratingMatch = ratingLabel.match(/([\d.]+)\s*star/i); const rating = ratingMatch ? parseFloat(ratingMatch[1]) : null; // Date the review was posted const date = block .find('.review-content .rating-qualifier, span[class*="date"]') .first() .text() .trim(); if (text) { reviews.push({ rating, date, text }); } }); return reviews; }
A few details keep this faithful to the page. The review body comes from .review-content p, exactly as in the original tutorial. The rating is not plain text on Yelp; it lives in a star-widget element whose aria-label or title reads something like "4 star rating", so a small regex pulls the number out. The date sits in the same review-content block. Each record is just { rating, date, text }, with no reviewer name, by design. If a review has no body text we skip it rather than store an empty row.
Yelp's class names change without notice, and the exact rating and date selectors above may need updating on a live page. Treat them as a starting template, not a contract. When a field comes back empty, re-inspect the live page in your browser's dev tools and update the selector. If you want a refresher on building robust selectors, see the guide to crawling JavaScript websites. Periodic selector maintenance is normal for any production scraper.
Step 3: Handle review pagination
A single page shows only the first batch of reviews. Yelp paginates the rest with a start query parameter that advances in steps (commonly 10 reviews per page), so ?start=10 is the second page, ?start=20 the third, and so on. To collect a business's reviews you walk those offsets until a page returns no review blocks, fetching and parsing each one with the functions you already wrote.
async function crawlPage(pageUrl) { const response = await api.get(pageUrl, { pageWait: 3000 }); if (response.statusCode === 200) return response.body; console.error(`Request failed: ${response.statusCode}`); return null; } async function crawlAllReviews(businessUrl, maxPages = 5) { const all = []; for (let page = 0; page < maxPages; page++) { const start = page * 10; const url = `${businessUrl}?start=${start}`; const html = await crawlPage(url); if (!html) break; const pageReviews = parseReviews(html); if (pageReviews.length === 0) break; // no more reviews all.push(...pageReviews); // Pace requests so we stay a polite visitor await new Promise((r) => setTimeout(r, 2000)); } return all; }
The loop stops on the first empty page, on a non-200 response, or when it hits maxPages, so you never spin forever. The two-second pause between fetches keeps your request rate modest, which matters more on a hard target like Yelp than raw speed. Keep maxPages low while you develop and only raise it once the parse output looks right.
Step 4: Assemble the full script with JSON and CSV export
Now wire fetch, parse, and pagination into one runnable script, then write the aggregate records to disk as both JSON and CSV. The CSV holds only rating, date, and text, the three fields that drive sentiment and theme analysis.
const fs = require('fs'); const { CrawlingAPI } = require('crawlbase'); const cheerio = require('cheerio'); const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }); function toCsv(rows) { const headers = ['rating', 'date', 'text']; const escape = (value) => `"${String(value ?? '').replace(/"/g, '""')}"`; const lines = [headers.join(',')]; for (const row of rows) { lines.push(headers.map((h) => escape(row[h])).join(',')); } return lines.join('\n'); } async function main() { const businessUrl = 'https://www.yelp.com/biz/sushi-yasaka-new-york'; const reviews = await crawlAllReviews(businessUrl, 5); if (reviews.length === 0) { console.log('No reviews parsed; check your selectors.'); return; } fs.writeFileSync('reviews.json', JSON.stringify(reviews, null, 2)); fs.writeFileSync('reviews.csv', toCsv(reviews)); console.log(`Saved ${reviews.length} reviews to JSON and CSV`); } main();
Paste the parseReviews, crawlPage, and crawlAllReviews functions from the earlier steps into the same file so main can call them. Run it with node yelp-scraper.js and you get two files: reviews.json with the full structured records and reviews.csv ready to open in a spreadsheet or feed into an analysis pipeline. The toCsv helper quotes every field and doubles any embedded quotes, which matters here because review text is long and frequently contains commas and line breaks.
What the output looks like
The JSON file holds one object per review, each with the rating, date, and text. No reviewer identity is present, which is exactly what you want for aggregate work.
[ { "rating": 5, "date": "3/14/2024", "text": "Consistently fresh fish and quick service at lunch. The omakase set is great value for the quality." }, { "rating": 3, "date": "2/2/2024", "text": "Good food but the wait was long on a weekend. Worth it if you can get there early." } ]
The CSV mirrors the same rows with a header line, so it drops straight into Excel, Google Sheets, or any data pipeline that reads delimited files.
rating,date,text "5","3/14/2024","Consistently fresh fish and quick service at lunch. The omakase set is great value for the quality." "3","2/2/2024","Good food but the wait was long on a weekend. Worth it if you can get there early."
Turn reviews into aggregate insight
The point of collecting reviews is what you learn in bulk, not any single entry. Once the data is in JSON or CSV you can compute the rating distribution, track average rating by month using the date field, and run the review text through sentiment scoring or theme clustering to surface recurring topics such as wait times, pricing, or service. None of that needs a reviewer's name, and keeping names out of the pipeline is both cleaner and safer.
If you plan to feed the text into a model, normalize it first: strip boilerplate, collapse whitespace, and decide how to handle non-English reviews. The guide to structuring and cleaning scraped data for AI and ML walks through that step. For the broader pattern of pulling and comparing feedback across sources, the guide on scraping customer reviews covers the same aggregate-first approach for other review platforms.
Staying unblocked
Even with rendering handled, Yelp watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard commercial target.
- Pace your requests. Keep the delay between paginated fetches, as the script does. Spreading requests out is the single biggest factor in staying under Yelp's rate limits.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit or a CAPTCHA. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or non-200 responses is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked.
Is it legal to scrape Yelp reviews?
Whether scraping Yelp is allowed depends on Yelp's terms of service, your jurisdiction, and what you do with the data. Yelp's Terms of Service explicitly restrict copying or scraping content from the site, whether by hand or with bots, extensions, or software, so automated collection can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Yelp's Terms of Service and its robots.txt, and treat both as the boundary for what you collect. The cleanest path for any real volume or commercial use is the official Yelp Fusion API, which exposes business and review data under clear, sanctioned terms.
Review text and star ratings shown on a business page are public, but the people who wrote them are not anonymous in the eyes of privacy law. Reviewer names, profile links, photos, and locations are personal data. Under the GDPR and the CCPA, processing that data needs a lawful basis, and individuals retain rights over it. That is why this tutorial deliberately collects only rating, text, and date, and why we recommend you do not store, key on, or republish a person's review tied to their identity. Keep your analysis aggregate: sentiment, themes, and rating trends, never a profile of an individual reviewer.
A few lines worth holding to. Collect only public review content, scrape at a slow and respectful rate, and keep your request volume low enough that you are not straining Yelp's servers. Stay out of anything behind a login. Do not redistribute Yelp's copyrighted content, including review media, as if it were your own. If you operate in or collect data about people in a GDPR or CCPA jurisdiction and your use touches personal data, get legal advice before you proceed. When in doubt, the Yelp Fusion API or a data agreement is the correct path, not a cleverer scraper.
Key takeaways
- Yelp renders reviews client-side and blocks hard. A plain request returns an empty shell or a CAPTCHA, so you must render the page behind a trusted IP before you parse it.
- The Crawling API does both in one call. It renders the page with the JavaScript token, waits for the review list, rotates residential IPs, and handles CAPTCHAs, returning finished HTML to parse with cheerio.
-
Parse rating, text, and date only. Keep the legacy
.review.review--with-sidebarand.review-content pselectors, add the star-widget rating and the date, and walk thestartoffset for pagination. - Stay aggregate and privacy-safe. Leave reviewer names off the record; analyze sentiment, themes, and rating trends, and never build a profile of an individual or republish a review tied to a person.
- Prefer the official API for volume. Respect Yelp's ToS and robots.txt, mind GDPR and CCPA when personal data is involved, and use the Yelp Fusion API or a data agreement for commercial or large-scale use.
Frequently Asked Questions (FAQs)
Can I scrape Yelp reviews legally?
It depends on Yelp's Terms of Service, your jurisdiction, and your use. Yelp's terms restrict scraping its content, so automated collection can run against them. Review text and ratings are public, but reviewer identity is personal data covered by laws like the GDPR and CCPA. Keep collection to public review content, stay aggregate, scrape slowly, and prefer the official Yelp Fusion API for any real volume or commercial use.
Why does a plain request return incomplete data from Yelp?
Because Yelp renders its review list client-side with JavaScript and challenges automated traffic with CAPTCHAs. A raw HTTP request from a datacenter IP usually returns an empty shell or a block page rather than the reviews. To get a complete page you have to render it behind a trusted IP and give the scripts time to populate the list, which is what the Crawling API handles for you.
How do I get all reviews instead of just the first page?
Yelp paginates reviews with a start query parameter that advances in steps of about ten, so ?start=10 is page two, ?start=20 page three, and so on. Loop over those offsets, fetch and parse each page, and stop when a page returns no review blocks or you hit your page cap. The crawlAllReviews function in this guide does exactly that, with a pause between fetches.
Should I store reviewer names?
No. Reviewer names, profile links, and photos are personal data, and this tutorial is built for aggregate analysis, not profiling. Store only the rating, text, and date. If your analysis genuinely needs to distinguish reviewers, pseudonymize or hash an identifier at collection time so the stored record never ties text back to a named person, and respect any GDPR or CCPA obligations that apply.
My rating or date comes back empty. What changed?
Almost certainly Yelp's markup. Its class names and the structure of the star widget change without notice, so selectors that worked before can break. Re-inspect a live review block in your browser's dev tools, confirm where the rating aria-label or title and the date now live, and update the selectors. Periodic selector maintenance is normal for any production scraper.
Is there an official way to get Yelp review data?
Yes. The Yelp Fusion API is Yelp's sanctioned program for accessing business information and a limited set of review data under clear terms. For commercial projects, large volumes, or anything where you need guaranteed structure and usage rights, the Fusion API or a direct data agreement is the right tool, not a scraper.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
