Yelp is one of the densest sources of public local-business data on the web. Every search result carries a business name, a star rating, a review count, the categories it files under, its neighborhood or address, and a link to its own Yelp page. For local-business research, competitive mapping, or building a regional dataset of service providers, those public listing fields are exactly the structured signal you want, all visible without signing in.
This guide shows you how to scrape Yelp with Python the reliable way. You fetch rendered search-result pages through the Crawling API, parse each business card with BeautifulSoup to pull the name, rating, review count, category, address, and link, then walk the pagination to cover a full result set and export to JSON or CSV. Everything here stays scoped to public business data, and the legality section near the end is not boilerplate, so read it before you point this at real volume.
What you will build
A small Python scraper that takes a search query and a location, retrieves the rendered Yelp search-results page through the Crawling API, and extracts a structured record for every business on the page. The running example is "Italian Restaurants" in "San Francisco, CA", and for each listing we pull these fields:
- Business name the primary identifier shown on the listing card.
- Rating the aggregate star rating for the business.
- Review count how many reviews back that rating.
- Category the business categories the listing files under.
- Address the public neighborhood or street area, used for any geographic analysis.
- Link the URL to the business's own Yelp page.
Why a plain request fails on Yelp
You can point the requests library at a Yelp search URL and, on a good day, get some HTML back. Two problems show up fast. First, Yelp renders much of its search-results content with JavaScript, so the raw HTML a plain request receives is often a shell that does not yet contain the business cards you came for. Second, Yelp watches for scraper-shaped traffic: it rate-limits by IP, serves CAPTCHAs to requests that look automated, and blocks datacenter addresses that fetch pages in a tight, machine-shaped pattern. A single request from your laptop might succeed; a few hundred from the same IP will not.
So a scraper that actually finishes the job needs two things: the page rendered as a real browser would render it, and requests that read as a real visitor coming from a trusted IP. You can build that yourself with a headless browser fleet and a pool of rotating residential proxies, but maintaining that stack is most of the work. The Crawling API folds it into a single call: you send it the URL, it renders the JavaScript and routes the request through residential IPs server-side, handles the anti-bot layer, and returns the finished HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Because Yelp builds its search results client-side, use the JS token here so the business cards are present in the HTML you get back. The normal token is the right pick only for targets that ship their data in the initial response.
Prerequisites
A few things to have in place first. None take long.
Basic Python. You should be comfortable running a script and installing packages with pip. If selectors are new to you, the primer on how to use BeautifulSoup in Python covers the parsing side in depth.
Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your JavaScript token from the account docs page. The first 1,000 requests are free and no card is required. Treat the token like a password and keep it out of version control.
Set up the project
Create a virtual environment so dependencies stay isolated, then install the libraries the scraper needs.
python --version python -m venv yelp_env source yelp_env/bin/activate pip install crawlbase beautifulsoup4 pandas
On Windows, activate the environment with yelp_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML so you can pull each field by CSS selector, and pandas writes the records out to CSV at the end.
Step 1: Fetch a rendered search page
Start by getting one results page back. Build the search URL from your query and location, import the CrawlingAPI class, initialize it with your token, and request the URL with JavaScript rendering turned on. A Yelp search is driven by two URL parameters: find_desc for the business category and find_loc for the location. Checking the status before you parse keeps failures loud instead of silent.
from urllib.parse import urlencode from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def build_url(query, location, start=0): base = "https://www.yelp.com/search?" params = {"find_desc": query, "find_loc": location, "start": start} return base + urlencode(params) def crawl(page_url): response = api.get(page_url, {"ajax_wait": "true", "page_wait": "3000"}) if response["headers"]["pc_status"] == "200": return response["body"].decode("latin1") print(f"Request failed: {response['headers']['pc_status']}") return None if __name__ == "__main__": url = build_url("Italian Restaurants", "San Francisco, CA") html = crawl(url) print(html[:500] if html else "No HTML returned")
The ajax_wait and page_wait options tell the Crawling API to render JavaScript and pause briefly so the business cards finish loading before the HTML comes back. The status check reads pc_status from the response headers, which is the Crawlbase status for the request, distinct from the upstream HTTP code. Run the script with python yelp_scraper.py and you should see real results markup rather than a challenge page or an empty shell. That confirms the fetch path works before you write a single selector.
Yelp renders its results client-side and challenges scraper-shaped traffic. The Crawling API renders the JavaScript in a real browser and routes each request through rotating residential IPs server-side, handles CAPTCHAs and blocks, and hands back ready-to-parse HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public search page on the free tier first.
Step 2: Parse the listings with BeautifulSoup
With a rendered results page in hand, load it into BeautifulSoup and walk the result cards. Each business sits in a card under a predictable container, and within it the name, rating, review count, category, address, and link map to their own selectors. Reading each field defensively, returning None when an element is missing, keeps one absent value from crashing the run.
from bs4 import BeautifulSoup BASE = "https://www.yelp.com" def text_of(node): return node.get_text(strip=True) if node else None def extract_business(card): name = card.select_one('div[class*="businessName"] h3 > span > a') rating = card.select_one('div.css-volmcs + div.css-1jq1ouh > span:first-child') reviews = card.select_one('div.css-volmcs + div.css-1jq1ouh > span:last-child') category = card.select('div[class*="priceCategory"] div > p > span:first-child a') address = card.select_one('div[class*="priceCategory"] div > p > span:last-child') return { "name": text_of(name), "rating": text_of(rating), "review_count": text_of(reviews), "category": ", ".join(c.get_text(strip=True) for c in category) if category else None, "address": text_of(address), "link": BASE + name["href"] if name and name.get("href") else None, } def extract_businesses(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select('div[data-testid="serp-ia-card"]:not(.ABP)') return [extract_business(card) for card in cards]
The cards are selected with div[data-testid="serp-ia-card"]:not(.ABP), which picks the organic result cards while skipping the ad-placement variants. The businessName anchor carries both the display name and the relative href to that business's Yelp page, so the name and the link come from the same element. Rating and review count are the two spans that follow the rating block, and the category and address live inside the price-and-category row. The text_of helper returns None when an element is absent instead of throwing on a .get_text() call against nothing, which keeps extraction resilient when a card is missing a field.
Yelp's class names (the hashed css-* tokens especially) are generated and change without notice, so treat these selectors as a starting template, not a contract. When a field comes back as None across every listing, re-inspect a live results page in your browser's dev tools and update the selector. The data-testid and businessName attribute hooks tend to be more stable than the hashed class names, so prefer them where you can. Periodic selector maintenance is normal for any production scraper.
Step 3: Put it together
Now wire the fetch and the parse into one runnable script for a single page. Build the URL, fetch the rendered HTML, hand it to the parser, and print the structured records as JSON.
import json from urllib.parse import urlencode from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) BASE = "https://www.yelp.com" def build_url(query, location, start=0): params = {"find_desc": query, "find_loc": location, "start": start} return BASE + "/search?" + urlencode(params) def crawl(page_url): response = api.get(page_url, {"ajax_wait": "true", "page_wait": "3000"}) if response["headers"]["pc_status"] == "200": return response["body"].decode("latin1") print(f"Request failed: {response['headers']['pc_status']}") return None def text_of(node): return node.get_text(strip=True) if node else None def extract_business(card): name = card.select_one('div[class*="businessName"] h3 > span > a') rating = card.select_one('div.css-volmcs + div.css-1jq1ouh > span:first-child') reviews = card.select_one('div.css-volmcs + div.css-1jq1ouh > span:last-child') category = card.select('div[class*="priceCategory"] div > p > span:first-child a') address = card.select_one('div[class*="priceCategory"] div > p > span:last-child') return { "name": text_of(name), "rating": text_of(rating), "review_count": text_of(reviews), "category": ", ".join(c.get_text(strip=True) for c in category) if category else None, "address": text_of(address), "link": BASE + name["href"] if name and name.get("href") else None, } def extract_businesses(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select('div[data-testid="serp-ia-card"]:not(.ABP)') return [extract_business(card) for card in cards] def main(): url = build_url("Italian Restaurants", "San Francisco, CA") html = crawl(url) if not html: return data = extract_businesses(html) print(json.dumps(data, indent=2)) if __name__ == "__main__": main()
What the output looks like
Run the full script with python yelp_scraper.py and you get a clean list of structured records, ready to write to JSON, CSV, or a database.
[ { "name": "Bella Trattoria", "rating": "4.3", "review_count": "(1.9k reviews)", "category": "Italian, Bars, Pasta Shops", "address": "Inner Richmond", "link": "https://www.yelp.com/biz/bella-trattoria-san-francisco" }, { "name": "Sotto Mare", "rating": "4.3", "review_count": "(5.2k reviews)", "category": "Seafood, Italian, Bars", "address": "North Beach/Telegraph Hill", "link": "https://www.yelp.com/biz/sotto-mare-san-francisco" } ]
Any field a card does not carry comes back as null, which is expected and exactly why the parser reads each field defensively rather than assuming every key is present. The review count arrives as a display string like "(1.9k reviews)"; if you need a clean integer for analysis, strip the parentheses and expand the k suffix in a later cleanup pass.
Step 4: Handle pagination across result pages
One page is a demo; a real job covers the full result set. Yelp paginates its search results through the start URL parameter, which sets the offset of the first result on the page and advances in steps of ten. So walking the pages is a loop over a range of offsets: 0, 10, 20, and so on. The same build_url and extract_businesses functions carry over without changes, so pagination is just an outer loop that paces itself between requests and writes the combined result to JSON and CSV.
import json import time import pandas as pd def scrape_all_pages(query, location, max_pages=5): all_rows = [] for page in range(max_pages): start = page * 10 url = build_url(query, location, start) html = crawl(url) if not html: print(f"Stopping at offset {start}: no HTML") break rows = extract_businesses(html) if not rows: print(f"No results at offset {start}; reached the end") break all_rows.extend(rows) print(f"Offset {start}: {len(rows)} businesses") time.sleep(2) return all_rows if __name__ == "__main__": rows = scrape_all_pages("Italian Restaurants", "San Francisco, CA", max_pages=5) with open("yelp_businesses.json", "w") as f: json.dump(rows, f, indent=2) pd.DataFrame(rows).to_csv("yelp_businesses.csv", index=False) print(f"Saved {len(rows)} businesses to JSON and CSV")
Two details make this loop production-friendly. It stops early when a page returns no businesses, so you do not waste requests past the last real page, and it sleeps for two seconds between requests so the run does not arrive as one tight burst. The export step writes both formats from the same list of dictionaries: json.dump for a structured file and pandas for a CSV that opens straight in a spreadsheet. Tune max_pages and the sleep to your volume; the slower you go, the less attention you draw.
Staying unblocked
Even with the Crawling API handling rendering, IP rotation, and the anti-bot layer, a few habits keep a run healthy, and they apply to any local-directory target.
- Pace your requests. The sleep above is not cosmetic. A tight loop is the fastest way to get throttled; spreading requests out reads far more like normal traffic.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API does this for you; if you build your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate is too aggressive. Treat that as a signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked. If your goal is the wider category of local-business directories rather than Yelp specifically, the guide to scraping local business listings and the Yellow Pages walkthrough cover sibling sources with the same approach. And if you want the review text behind each business rather than the search-results summary, see how to crawl Yelp reviews, keeping the privacy notes below in mind.
Is it legal to scrape Yelp?
Whether scraping Yelp is allowed depends on the site's terms of service, your jurisdiction, and what you do with the data. None of the code here changes that; it only makes the technical part work. Read Yelp's Terms of Service and its robots.txt, and treat both as the boundary for what you collect and how fast. Yelp's terms restrict automated access, so for anything beyond small-scale research the right path is Yelp's own official channel: the Yelp Fusion API offers business and search data under terms Yelp supports, which is safer and more durable than scraping the front end.
If you do collect from public pages, scope it to public business data only: the business name, the aggregate rating, the review count, the category, the public neighborhood or address, and the link to the listing. The reviews themselves deserve a sharper line. Aggregate ratings and review counts are business-level facts, but the text of an individual review and the name of the person who wrote it are personal data. Treat reviewer identities as personal: do not build profiles of individuals, do not republish a person's review tied to their name, and apply GDPR or CCPA obligations wherever personal data is in scope.
What this approach does not cover is just as important. It does not touch anything behind a login, and it does not bypass authentication or any access control to reach gated content; that is out of scope here and runs against the site's terms. Respect Yelp's stated rate expectations, keep your request volume reasonable so you are not straining its servers, and if you plan to store, enrich, or commercially reuse Yelp data, prefer the Fusion API and check the rules that apply to you rather than assuming public means unrestricted.
Key takeaways
-
Yelp is a structured directory. Each search result is a card with a name, rating, review count, category, public address, and a link, driven by the
find_descandfind_locURL parameters. - A plain request fails twice over. Yelp renders results client-side and blocks scraper-shaped traffic; the Crawling API renders the JavaScript, routes through residential IPs, and returns ready-to-parse HTML in one call.
-
BeautifulSoup does the extraction. Map name, rating, review count, category, address, and link to current selectors, read each field defensively, and expect the hashed
css-*class names to drift. -
Pagination is a loop over the
startoffset. Step by ten, reuse the same parser, stop early on an empty page, sleep between requests, and export to JSON and CSV. - Stay on public business data. Respect the ToS and robots.txt, treat reviewer identities as personal data, never touch login-gated content, and prefer the official Yelp Fusion API for anything beyond small research.
Frequently Asked Questions (FAQs)
Do I need the normal token or the JS token for Yelp?
The JavaScript token. Yelp builds its search-results cards client-side, so a normal-token fetch often returns an HTML shell without the businesses in it. The JS token renders the page in a real browser first, which is what puts the cards in the HTML you parse. Pair it with the ajax_wait and page_wait options so the content has time to settle before the response comes back.
How do I handle pagination on Yelp?
Yelp exposes the result offset through a start URL parameter that advances in steps of ten, so you loop over a range of offsets (0, 10, 20, and so on), build a URL per page, and run the same parser on each. Stop when a page returns zero businesses, which marks the end of the result set, and sleep a couple of seconds between requests so the run does not arrive as one burst.
My selectors return None. What changed?
Almost certainly Yelp's markup. The hashed class names such as css-volmcs and css-1jq1ouh are generated and change without notice, so selectors that worked last month can break. Re-inspect a live results page in your browser's dev tools and update them, and prefer the more stable data-testid and businessName hooks where you can. Periodic selector maintenance is normal for any production scraper.
Is it legal to scrape Yelp reviews?
Aggregate ratings and review counts are business-level facts you can use for analysis, but the text of an individual review and the reviewer's name are personal data. Do not build profiles of individuals or republish a person's review tied to their identity, and apply GDPR or CCPA where personal data is involved. For review data at any scale, the Yelp Fusion API is the path Yelp supports, and it is the safer choice than scraping the public pages.
How do I avoid getting blocked while scraping Yelp?
Keep your per-IP request rate low, pace requests with a delay, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rendering, rotation, and the anti-bot layer for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off the moment you start seeing challenges.
Can I export the scraped data to Excel?
Yes. The scraper produces a list of dictionaries, which pandas turns into a spreadsheet in two lines: pd.DataFrame(rows).to_excel("yelp_businesses.xlsx", index=False). Because every record shares the same keys, the columns line up cleanly, and the same structure exports just as easily to the CSV the script already writes or to a database table.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
