Local business listings are one of the most useful public datasets on the web. A directory search like "plumbers in Austin" or "restaurants in Denver" returns a structured grid of businesses, each carrying a name, an address, a phone number, a category, and a star rating with a review count. Sales, marketing, and research teams pull that data to build prospect lists by city, enrich CRM records with verified contact details, and map competitor density across markets. Collecting it by hand does not scale past a handful of results, so the work belongs in a script.
This guide shows you how to scrape local business listings with Python the reliable way. You build a small, runnable scraper that fetches a rendered directory results page through the Crawling API, parses each listing card with BeautifulSoup, and pulls a clean record per business: name, address, phone, category, rating, review count, and website. We keep the whole walkthrough scoped to public business information, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Python script that takes a category and a city, retrieves the rendered listings page through the Crawling API, and extracts a structured record per business. We use a Yellow Pages search as the running example and pull these fields from each result card:
- Name the business name, for example "Austin Plumbing Co".
- Address the street address shown on the card.
- Phone the publicly listed business phone number.
- Category the primary category the directory files the business under.
- Rating the average star rating, when the business has one.
- Reviews the number of reviews behind that rating.
- Website the link to the business's own site, when listed.
Why a plain request fails on listing sites
Collecting listings at scale is not as simple as sending requests and parsing HTML, for two reasons that compound the moment you scale past a few queries.
First, results are geo-dependent. A bare query like "plumbers" returns completely different businesses depending on whether the request appears to come from Austin, Denver, or Phoenix. To get a consistent dataset you have to control both the query (include the city) and the request location (geo-targeting), or the results shift unpredictably from run to run.
Second, modern directories defend against automated traffic and increasingly render listings client-side. Many platforms return a thin HTML shell and inject the actual business cards with JavaScript afterward, so a standard HTTP request hands you a page with no listings in it. Once you scale past a few requests, the platform also starts applying IP blocks, CAPTCHA challenges, and request throttling. So a working scraper needs two things in one request: a browser that renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it renders the page behind a trusted residential IP, and it returns finished HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Static directory pages parse fine with the normal token, but platforms that inject listings client-side (Google Maps, Yelp) need the JS token. Match the token to the page: use the normal token for simple pages and the JS token for dynamic ones.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to BeautifulSoup, the primer on how to use BeautifulSoup in Python covers the selector basics this tutorial leans on.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the two libraries the scraper needs.
python --version python -m venv listings_env source listings_env/bin/activate pip install crawlbase beautifulsoup4
On Windows, activate the environment with listings_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull each field out of a listing card by CSS selector.
Understanding the listings page
A directory results page lays out a column of listing cards, one per business. Each card carries the same handful of fields: a business name, a street address, a phone number, the category it is filed under, and a rating with a review count. A "visit website" link sits on the card when the business has supplied one. Below the column sit pagination controls that let you walk through additional result pages for the same query.
Before writing selectors, open a results page in your browser, right-click a listing card, and choose Inspect. On Yellow Pages each result is wrapped in a div.result container, with the name in a.business-name, the address in div.street-address and div.locality, the phone in div.phones, the primary category in div.categories, the rating exposed through the class on div.result-rating, the review count in span.count, and the website in a.track-visit-website. Those are the selectors you target.
Step 1: Fetch the rendered listings page
Start by getting the finished page. Import the CrawlingAPI class, initialize it with your token, build the search URL from a category and a city, and request it. Checking the status code before you parse keeps failures loud instead of silent.
from urllib.parse import quote_plus from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def build_url(category, city): terms = quote_plus(category) geo = quote_plus(city) return f"https://www.yellowpages.com/search?search_terms={terms}&geo_location_terms={geo}" def crawl(page_url): options = {"country": "US"} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("latin1") print(f"Request failed: {response['status_code']}") return None if __name__ == "__main__": url = build_url("plumbers", "Austin, TX") html = crawl(url) print(html[:500] if html else "No HTML returned")
The build_url helper assembles the search URL from two query parameters: search_terms for the category and geo_location_terms for the city, both URL-encoded so spaces and commas survive the round trip. The country option pins the request to a US IP, which is the geo-targeting piece: a query for "plumbers in Austin" only returns sensible results when the request also appears to originate in the right market. The body is decoded as latin1 because directory pages mix in characters that strict UTF-8 decoding can choke on. Run the script and you should see real listing markup, not an empty shell or a block page. That confirms retrieval works before you write a single selector.
That single api.get call did the part that usually eats a week: it fetched the listings page behind a trusted residential IP pinned to the right country, so the directory returned real cards instead of a block page. The Crawling API handles the rendering, the IP rotation, and the geo-targeting for you, so you skip running a headless browser fleet and a proxy pool yourself. Point it at one city on the free tier first.
Step 2: Parse the listing cards with BeautifulSoup
With the HTML in hand, load it into BeautifulSoup, find every listing card, and pull each field by its selector. Each business is wrapped in a div.result container, with the name, address, phone, category, rating, review count, and website each exposed through its own class. Wrap each card in a try/except so one malformed listing does not crash the run.
from bs4 import BeautifulSoup def text_of(card, selector): el = card.select_one(selector) return el.get_text(strip=True) if el else None def parse_rating(card): el = card.select_one("div.result-rating") if not el: return None words = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5} rating = None for cls in el.get("class", []): base = cls.replace("-half", "") if base in words: rating = words[base] + (0.5 if "-half" in cls else 0) return rating def scrape_results(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select("div.result") results = [] for card in cards: try: website = card.select_one("a.track-visit-website") results.append({ "name": text_of(card, "a.business-name"), "address": text_of(card, "div.street-address"), "locality": text_of(card, "div.locality"), "phone": text_of(card, "div.phones"), "category": text_of(card, "div.categories"), "rating": parse_rating(card), "reviews": text_of(card, "span.count"), "website": website["href"] if website else None, }) except Exception as e: print(f"Skipped a card: {e}") return results
The text_of helper queries a single element inside one card and returns None when it is missing, instead of throwing on a .get_text() call against nothing. That keeps extraction resilient when a field is absent, which is common since not every listing carries a website or a rating. The parse_rating helper reads the star rating from the class list on div.result-rating, where the directory writes the score as words like four or four half, and converts it to a number. The review count comes from span.count, and the website is read from the anchor's href rather than its text. Names, addresses, phones, and categories each map to their own selector.
Directory class names change without notice. Treat the selectors above as a starting template, not a contract. When a field comes back as None for every card, re-inspect a live results page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 3: Put it together and export
Now wire the fetch and the parse into one runnable script, and write the records to both JSON and CSV so they drop straight into a sheet or a database. Fetch the rendered page, hand it to the parser, then dump the structured records.
import csv import json from urllib.parse import quote_plus from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def build_url(category, city): terms = quote_plus(category) geo = quote_plus(city) return f"https://www.yellowpages.com/search?search_terms={terms}&geo_location_terms={geo}" def crawl(page_url): response = api.get(page_url, {"country": "US"}) if response["status_code"] == 200: return response["body"].decode("latin1") print(f"Request failed: {response['status_code']}") return None def text_of(card, selector): el = card.select_one(selector) return el.get_text(strip=True) if el else None def parse_rating(card): el = card.select_one("div.result-rating") if not el: return None words = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5} rating = None for cls in el.get("class", []): base = cls.replace("-half", "") if base in words: rating = words[base] + (0.5 if "-half" in cls else 0) return rating def scrape_results(html): soup = BeautifulSoup(html, "html.parser") results = [] for card in soup.select("div.result"): try: website = card.select_one("a.track-visit-website") results.append({ "name": text_of(card, "a.business-name"), "address": text_of(card, "div.street-address"), "locality": text_of(card, "div.locality"), "phone": text_of(card, "div.phones"), "category": text_of(card, "div.categories"), "rating": parse_rating(card), "reviews": text_of(card, "span.count"), "website": website["href"] if website else None, }) except Exception as e: print(f"Skipped a card: {e}") return results def save(rows, name): with open(f"{name}.json", "w") as f: json.dump(rows, f, indent=2) if rows: with open(f"{name}.csv", "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=rows[0].keys()) writer.writeheader() writer.writerows(rows) def main(): url = build_url("plumbers", "Austin, TX") html = crawl(url) if not html: return data = scrape_results(html) save(data, "listings") print(json.dumps(data, indent=2)) if __name__ == "__main__": main()
What the output looks like
Run the full script with python scraper.py and you get a clean list of records, one per business, written to both listings.json and listings.csv and printed to the console.
[ { "name": "Austin Plumbing Co", "address": "1200 W 5th St", "locality": "Austin, TX 78703", "phone": "(512) 555-0142", "category": "Plumbers, Water Heaters", "rating": 4.5, "reviews": "(38)", "website": "https://www.austinplumbingco.example" }, { "name": "Lone Star Drain & Sewer", "address": "904 E Cesar Chavez St", "locality": "Austin, TX 78702", "phone": "(512) 555-0188", "category": "Plumbers", "rating": null, "reviews": null, "website": null } ]
The second record shows the resilience at work: that business has no rating, no review count, and no website on file, so those fields come back as null rather than crashing the run. The CSV version carries the same columns in the same order, ready to open in a spreadsheet or load into a database. If you want a refresher on flattening nested records into rows, the guide on scraping tables from a website covers the same export shape.
Scaling across cities and pages
One query in one city is a demo. The real value of listing data comes from running the same category across many markets, which is also where consistency matters most. Walk a list of cities, paginate each one with the &page= parameter until a page returns no cards, and collect everything into a single dataset.
import time def scrape_city(category, city, max_pages=5): base = build_url(category, city) collected = [] for page in range(1, max_pages + 1): html = crawl(f"{base}&page={page}") if not html: break rows = scrape_results(html) if not rows: break for row in rows: row["city"] = city collected.extend(rows) print(f"{city} page {page}: {len(rows)} listings") time.sleep(2) return collected def scrape_cities(category, cities): all_rows = [] for city in cities: all_rows.extend(scrape_city(category, city)) return all_rows data = scrape_cities("restaurants", ["Austin, TX", "Denver, CO", "Phoenix, AZ"]) save(data, "multi_city")
The max_pages cap keeps each city bounded so a broad query does not spin forever, and the empty-results break stops you early when the directory runs out of pages. Stamping each row with its city keeps the markets separable once everything lands in one file. The time.sleep(2) between pages paces requests so you are not hammering the directory in a tight loop, which is the fastest way to get throttled. This multi-city loop is the same pattern behind a price comparison tool: one query, many sources, normalized into one dataset.
Staying unblocked
Even with retrieval handled, directories watch for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any commercial target.
-
Pace your requests. Spread requests out with a delay between pages and vary your queries instead of crawling one term at full speed. The
time.sleepin the loop is the floor, not the ceiling. - Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Match the geo. Pin the request country to the market you are querying so results stay consistent and the traffic looks local rather than out of place.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.
For the broader playbook on keeping a scraper healthy, see how to scrape websites without getting blocked. When you outgrow on-demand requests and need to push thousands of city-and-category URLs at once, the asynchronous Crawler processes large batches in the background and delivers results to a webhook or to Cloud Storage, so you reuse this same parser without managing a request queue yourself.
Is it legal to scrape business listings?
Whether scraping a listing directory is allowed depends on the platform's terms of service, your jurisdiction, and what you do with the data. Most directories restrict automated access in their terms, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the platform's Terms of Use and its robots.txt, and treat both as the boundary for what you collect and how fast you collect it.
A few lines are worth holding to. Collect only public business information: the names, addresses, phone numbers, categories, ratings, review counts, and website links that anyone can see on a results page without an account. A business's listed contact details are public business data, but do not harvest personal data about individuals, including private contact information, reviewer identities, or anything tied to a named person beyond what the business itself publishes. Keep your request volume low enough that you are not straining the platform's servers, and respect any rate expectations it states. If you plan to reuse the data commercially, get permission or an official agreement rather than assuming silence is consent.
This guide is deliberately scoped to public listing pages because that is the line that keeps the work defensible. It does not cover anything behind a login, account data, or scraping personal information about real people. Where a platform offers a sanctioned route, prefer it: maps and places providers publish official APIs that return the same listing fields under clear terms, and that is the right tool when you need large volumes, guaranteed structure, or commercial rights. If your project needs more than public listings, an official API or a data agreement is the correct path, not a cleverer scraper.
Key takeaways
- Listings are geo-dependent. Include the city in the query and pin the request country, or the same category returns inconsistent businesses from run to run.
- Retrieval is the hard part. The Crawling API fetches the page behind a trusted residential IP and renders it when needed, so you get real cards instead of a block page or an empty shell.
-
BeautifulSoup does the extraction. Loop the
div.resultcards and map name, address, phone, category, rating, reviews, and website to current selectors, and expect those selectors to drift. -
Scale across cities and pages. Walk a city list, paginate with
&page=until a page is empty, stamp each row with its city, and pace requests with a delay. - Stay on public data. Respect the platform's ToS and robots.txt, prefer an official maps or places API for licensed or bulk data, and never collect personal information about individuals.
Frequently Asked Questions (FAQs)
Do I need the normal token or the JS token for listings?
It depends on the directory. Static results pages like Yellow Pages parse fine with the normal token. Platforms that inject listings client-side, such as Google Maps and Yelp, need the JS token so the page is rendered in a real browser before the HTML is returned. Match the token to the page: start with the normal token and switch to the JS token if the cards are missing from the body.
How do I get accurate results for a specific city?
Two things together. Put the city in the query itself through the geo_location_terms parameter, and pin the request to the right market with the country option on the Crawling API. Local search is tied to location, so a query without both pieces returns results that shift unpredictably depending on where the request appears to originate.
Can I scrape multiple cities in one run?
Yes. Pass a list of cities and loop the same category across each one, combining the results into a single dataset. Stamp each record with its city before you merge so the markets stay separable, and add a short delay between requests so you pace the run.
My selectors return None. What changed?
Almost certainly the directory's markup. Class names like a.business-name for the name or div.result-rating for the rating change without notice. Re-inspect a live results page in your browser's dev tools, update the selector to match, and rerun. Periodic selector maintenance is normal for any production scraper.
Can I scrape personal contact details from listings?
No, and this guide does not cover it. Stick to public business information: the business name, address, listed phone, category, rating, review count, and website. Personal data about individuals, private contact details, or reviewer identities are out of scope and run against most platforms' terms. For richer or licensed data, the correct route is an official maps or places API.
How do I handle very large jobs across hundreds of cities?
For on-demand work the Crawling API is enough, but when you are pushing thousands of city-and-category URLs at once, move to the asynchronous Crawler. You push the URLs and receive results through a webhook or Cloud Storage instead of waiting on each request, which improves throughput and avoids bottlenecks. The same parser in this guide handles the returned HTML unchanged.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

