Yellow Pages is one of the oldest business directories on the web, and it is still a dense source of local-business data: name, public phone number, street address, the categories a business files under, and a link to its own website. For sales prospecting, market mapping, or building a regional dataset of service providers, those public listings are exactly the structured intelligence you want. The friction is operational, not conceptual: Yellow Pages watches for scraper-shaped traffic and will throttle or challenge a naive loop quickly.

This guide shows you how to scrape Yellow Pages business listings with Python the reliable way. You fetch rendered search-result pages through the Crawling API, parse each result with BeautifulSoup to pull the name, phone, address, category, and website, then walk the pagination to cover a full result set. Everything here is scoped to public business-directory data, and the legality section near the end is not boilerplate, so read it before you point this at real volume.

What you will build

A small Python scraper that takes a search query and a location, retrieves the rendered Yellow Pages search-results page through the Crawling API, and extracts a structured record for every business on the page. The running example is "Information Technology" businesses in "Los Angeles, CA", and for each listing we pull these fields:

  • Business name the primary identifier for the listing.
  • Phone the public contact number shown on the card.
  • Address the public street address, used for any geographic analysis.
  • Category the business categories the listing files under.
  • Website the link to the business's own site, when one is listed.

How Yellow Pages search pages are structured

A Yellow Pages search is driven by two URL parameters: search_terms for the query and geo_location_terms for the location. Submit a search and you land on a results page where each business is a self-contained card. The card carries the name as a heading link, the phone and address in their own blocks, the categories as a list, and, for businesses that have paid for or claimed the listing, an outbound website link.

Results span multiple pages. Yellow Pages uses a page URL parameter to move between them, which makes pagination a matter of incrementing an integer rather than chasing dynamic "load more" behavior. That predictability is what lets the same parser run across every page without changes once you have it working on one.

Why a plain fetch struggles

You can hit a Yellow Pages search URL with the requests library and, on a good day, get HTML back. The problem shows up at volume. Yellow Pages deploys anti-scraping defenses: it rate-limits by IP, serves CAPTCHAs to traffic that looks automated, and blocks datacenter addresses that request pages in a tight, machine-shaped pattern. A single request from your laptop might succeed; a few hundred from the same IP will not.

So a scraper that actually finishes the job needs requests that read as a real visitor coming from a trusted IP. You can build that yourself with a pool of rotating residential proxies and the plumbing to keep them healthy, but maintaining that stack is most of the work. The Crawling API folds it into a single call: you send it the URL, it routes the request through residential IPs server-side and handles the anti-bot layer, and it returns the HTML for you to parse.

Which token to use

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Yellow Pages serves its listing data in the initial HTML, so the normal token is the right choice here and keeps each request cheaper. Reach for the JS token only if a target starts rendering listings client-side.

Prerequisites

A few things to have in place first. None take long.

Basic Python. You should be comfortable running a script and installing packages with pip. If selectors are new to you, the primer on how to use BeautifulSoup in Python covers the parsing side in depth.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

A Crawlbase account and token. Sign up, open your dashboard, and copy your normal token from the account docs page. The first 1,000 requests are free and no card is required. Treat the token like a password and keep it out of version control.

Set up the project

Create a virtual environment so dependencies stay isolated, then install the two libraries the scraper needs.

bash
python --version

python -m venv yellowpages_env
source yellowpages_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with yellowpages_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull each field by CSS selector.

Step 1: Fetch a rendered search page

Start by getting one results page back. Build the search URL from your query and location, import the CrawlingAPI class, initialize it with your token, and request the URL. Checking the status before you parse keeps failures loud instead of silent.

python
from urllib.parse import urlencode
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def build_url(query, location, page=1):
    base = "https://www.yellowpages.com/search?"
    params = {"search_terms": query, "geo_location_terms": location, "page": page}
    return base + urlencode(params)

def crawl(page_url):
    response = api.get(page_url)
    if response["headers"]["pc_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

if __name__ == "__main__":
    url = build_url("Information Technology", "Los Angeles, CA")
    html = crawl(url)
    print(html[:500] if html else "No HTML returned")

Note the status check reads pc_status from the response headers, which is the Crawlbase status for the request, distinct from the upstream HTTP code. Run the script with python scraper.py and you should see real results markup rather than a challenge page. That confirms the fetch path works before you write a single selector.

Crawlbase Crawling API

Yellow Pages rate-limits by IP and challenges scraper-shaped traffic. The Crawling API routes each request through rotating residential IPs server-side, handles CAPTCHAs and blocks, and hands back ready-to-parse HTML, so you skip running a proxy pool and a retry layer yourself. Point it at a public search page on the free tier first.

Step 2: Parse the listings with BeautifulSoup

With a results page in hand, load it into BeautifulSoup and walk the result cards. Each card sits under a predictable container, and within it the name, phone, address, categories, and website map to their own selectors. Reading each field defensively, returning None when an element is missing, keeps one absent value from crashing the run.

python
from bs4 import BeautifulSoup

def text_of(node):
    return node.get_text(strip=True) if node else None

def extract_listings(html):
    soup = BeautifulSoup(html, "html.parser")
    cards = soup.select("div.search-results.organic div.result")
    listings = []

    for card in cards:
        name = card.select_one("a.business-name")
        phone = card.select_one("div.phone")
        address = card.select_one("div.adr")
        category = card.select_one("div.categories")
        website = card.select_one("a.track-visit-website")

        listings.append({
            "name": text_of(name),
            "phone": text_of(phone),
            "address": text_of(address),
            "category": text_of(category),
            "website": website["href"] if website else None,
        })

    return listings

The text_of helper queries a node and returns None when it is absent, instead of throwing on a .get_text() call against nothing. That keeps extraction resilient: not every listing has a website link or a clean phone block, and a missing field should leave a None in the record rather than stop the loop. The website is read from the anchor's href rather than its text, so it is handled separately.

Selectors drift

The class names above (result, business-name, adr, categories, track-visit-website) reflect the current Yellow Pages markup, and that markup changes without notice. Treat the selectors as a starting template, not a contract. When a field comes back as None across every listing, re-inspect a live results page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper.

Step 3: Put it together

Now wire the fetch and the parse into one runnable script for a single page. Build the URL, fetch the HTML, hand it to the parser, and print the structured records.

python
import json
from urllib.parse import urlencode
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def build_url(query, location, page=1):
    base = "https://www.yellowpages.com/search?"
    params = {"search_terms": query, "geo_location_terms": location, "page": page}
    return base + urlencode(params)

def crawl(page_url):
    response = api.get(page_url)
    if response["headers"]["pc_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

def text_of(node):
    return node.get_text(strip=True) if node else None

def extract_listings(html):
    soup = BeautifulSoup(html, "html.parser")
    cards = soup.select("div.search-results.organic div.result")
    listings = []

    for card in cards:
        website = card.select_one("a.track-visit-website")
        listings.append({
            "name": text_of(card.select_one("a.business-name")),
            "phone": text_of(card.select_one("div.phone")),
            "address": text_of(card.select_one("div.adr")),
            "category": text_of(card.select_one("div.categories")),
            "website": website["href"] if website else None,
        })

    return listings

def main():
    url = build_url("Information Technology", "Los Angeles, CA")
    html = crawl(url)
    if not html:
        return
    data = extract_listings(html)
    print(json.dumps(data, indent=2))

if __name__ == "__main__":
    main()

What the output looks like

Run the full script with python scraper.py and you get a clean list of structured records, ready to write to JSON, CSV, or a database.

json
[
  {
    "name": "L. A. Computer Works",
    "phone": "(310) 277-9799",
    "address": "2355 Westwood Blvd, Los Angeles, CA 90064",
    "category": "Computer Technical Assistance and Support Services",
    "website": "http://lacomputerworks.com"
  },
  {
    "name": "Desktop Conquest",
    "phone": "(213) 321-1869",
    "address": "Los Angeles, CA 90057",
    "category": "Computer System Designers and Consultants",
    "website": null
  }
]

Listings with no claimed website come back with "website": null, which is expected and exactly why the parser reads each field defensively rather than assuming every key is present.

Step 4: Handle pagination across result pages

One page is a demo; a real job covers the full result set. Because Yellow Pages exposes the result page through the page URL parameter, walking the pages is a loop over an integer range. The same build_url and extract_listings functions carry over without changes, so pagination is just an outer loop that paces itself between requests.

python
import time

def scrape_all_pages(query, location, max_pages):
    all_listings = []
    for page in range(1, max_pages + 1):
        url = build_url(query, location, page)
        html = crawl(url)
        if not html:
            print(f"Stopping at page {page}: no HTML")
            break
        listings = extract_listings(html)
        if not listings:
            print(f"No results on page {page}; reached the end")
            break
        all_listings.extend(listings)
        print(f"Page {page}: {len(listings)} listings")
        time.sleep(2)
    return all_listings

if __name__ == "__main__":
    rows = scrape_all_pages("Information Technology", "Los Angeles, CA", max_pages=5)
    with open("yellow_pages.json", "w") as f:
        json.dump(rows, f, indent=2)
    print(f"Saved {len(rows)} listings")

Two details make this loop production-friendly. It stops early when a page returns no listings, so you do not waste requests past the last real page, and it sleeps for two seconds between requests so the run does not arrive as one tight burst. Tune max_pages and the sleep to your volume; the slower you go, the less attention you draw.

Staying unblocked

Even with the Crawling API handling IP rotation and the anti-bot layer, a few habits keep a run healthy, and they apply to any directory target.

  • Pace your requests. The sleep above is not cosmetic. A tight loop is the fastest way to get throttled; spreading requests out reads far more like normal traffic.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API does this for you; if you build your own stack, this is the part to get right. The deeper background is in the guide to rotating IP addresses.
  • Read the status codes. A run that starts returning challenges or errors is telling you the current rate is too aggressive. Treat that as a signal to back off, not noise to ignore.

For the broader playbook, see how to scrape websites without getting blocked and the deeper dive on how to bypass CAPTCHAs while web scraping. When you scale this to many queries and locations, the patterns in large-scale web scraping cover queueing and storage. If you would rather route your own traffic through a rotating pool than use the managed API, the Smart AI Proxy gives you the same residential rotation as a drop-in proxy endpoint.

Whether scraping Yellow Pages is allowed depends on the site's terms of service, your jurisdiction, and what you do with the data. None of the code here changes that; it only makes the technical part work. Read the Yellow Pages Terms of Service and its robots.txt, and treat both as the boundary for what you collect and how fast.

A few lines worth holding to. Collect only public business-directory data: the business name, the public phone number, the public address, and the category that anyone can see without signing in. Respect the site's stated rate expectations and keep your request volume reasonable so you are not straining its servers. This guide is deliberately scoped to that public surface because it is the line that keeps the work defensible.

What this approach does not cover is just as important. It does not touch anything behind a login, and it does not bypass authentication or any access control to reach gated content; that is out of scope here and runs against the site's terms. And note that aggregating business contact data can carry separate legal obligations depending on your jurisdiction, even when each field is individually public, so if you plan to store, enrich, or commercially reuse a contact dataset, check the rules that apply to you rather than assuming public means unrestricted.

Recap

Key takeaways

  • Yellow Pages is a structured directory. Each search result is a card with a name, public phone, public address, category, and an optional website, driven by the search_terms and geo_location_terms URL parameters.
  • A plain fetch struggles at volume. Rate limits, CAPTCHAs, and IP blocks stop a naive loop; the Crawling API routes through residential IPs and returns ready-to-parse HTML in one call.
  • BeautifulSoup does the extraction. Map name, phone, address, category, and website to current selectors, read each field defensively, and expect those selectors to drift.
  • Pagination is a loop over the page parameter. Reuse the same parser across pages, stop early on an empty page, and sleep between requests to pace the run.
  • Stay on public data. Respect the ToS and robots.txt, never touch login-gated content, and remember that aggregating contact data can carry its own obligations by jurisdiction.

Frequently Asked Questions (FAQs)

Do I need the normal token or the JS token for Yellow Pages?

The normal token. Yellow Pages serves its listing data in the initial HTML, so a normal-token fetch returns parseable markup and keeps each request cheaper. The JS token renders the page in a real browser first, which you only need when a target loads its listings client-side after the page arrives. Start with the normal token and switch only if fields come back empty across the board.

How do I handle pagination on Yellow Pages?

Yellow Pages exposes the result page through a page URL parameter, so you loop over an integer range, build a URL per page, and run the same parser on each. Stop when a page returns zero listings, which marks the end of the result set, and sleep a couple of seconds between requests so the run does not arrive as one burst.

My selectors return None. What changed?

Almost certainly the Yellow Pages markup. Class names like result, business-name, and track-visit-website change without notice, so selectors that worked last month can break. Re-inspect a live results page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Why do some listings have no website?

Not every business claims or links its own site on Yellow Pages, so the website anchor is simply absent on those cards. The parser reads the field defensively and stores None rather than throwing, so a missing website leaves a clean null in the record and the loop continues to the next listing.

How do I avoid getting blocked while scraping Yellow Pages?

Keep your per-IP request rate low, pace requests with a delay, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and the anti-bot layer for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off the moment you start seeing challenges.

Can I export the scraped data to Excel?

Yes. The scraper produces a list of dictionaries, which pandas turns into a spreadsheet in two lines: pd.DataFrame(rows).to_excel("yellow_pages.xlsx", index=False). Because every record shares the same keys, the columns line up cleanly, and the same structure exports just as easily to CSV or a database table.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available