SuperPages is one of the largest online business directories in the US, with millions of companies indexed by industry and location. Each listing carries the kind of public, structured detail that sales and marketing teams want for a prospect list: a business name, the category it files under, a street address, a public phone number, and often a link to the company's own website. For building a regional dataset of service providers or seeding a B2B outreach campaign, that public directory data is exactly the raw material you need.
This guide shows you how to scrape SuperPages business listings with Python the reliable way. You fetch rendered search-result pages through the Crawling API, parse each result with BeautifulSoup to pull the name, category, address, phone, website, and detail-page link, then walk the pagination to cover a full result set and export the records to JSON or CSV. Everything here stays scoped to public business-directory data, and the legality section near the end covers the obligations that attach to B2B lead data, so read it before you point this at real volume.
What you will build
A small Python scraper that takes a search query and a location, retrieves the rendered SuperPages search-results page through the Crawling API, and extracts a structured record for every business on the page. The running example is "Home Services" businesses in "Los Angeles, CA", and for each listing we pull these fields:
- Business name the primary identifier you group leads on.
- Category the industry the listing files under, used to segment leads.
- Address the public street address, including city, state, and zip.
- Phone the public contact number shown on the listing card.
- Website the link to the business's own site, when one is listed.
- Detail page link the URL of the business's dedicated SuperPages page.
Why a plain request fails on SuperPages
You can hit a SuperPages search URL with the requests library and, on a good day, get HTML back. The problem shows up at volume. SuperPages deploys anti-scraping defenses: it rate-limits by IP, serves CAPTCHAs to traffic that looks automated, and blocks datacenter addresses that request pages in a tight, machine-shaped pattern. A single request from your laptop might succeed; a few hundred from the same IP will not.
So a scraper that actually finishes the job needs requests that read as a real visitor coming from a trusted IP. You can build that yourself with a pool of rotating residential proxies and the plumbing to keep them healthy, but maintaining that stack is most of the work. The Crawling API folds it into a single call: you send it the URL, it routes the request through residential IPs server-side and handles the anti-bot layer, and it returns the HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. SuperPages serves its listing data in the initial HTML, so the normal token is the right choice here and keeps each request cheaper. Reach for the JS token only if a target starts rendering listings client-side.
Prerequisites
A few things to have in place first. None take long.
Basic Python. You should be comfortable running a script and installing packages with pip. If selectors are new to you, the primer on how to use BeautifulSoup in Python covers the parsing side in depth.
Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your normal token from the account docs page. The first 1,000 requests are free and no card is required. Treat the token like a password and keep it out of version control.
Set up the project
Create a virtual environment so dependencies stay isolated, then install the two libraries the scraper needs.
python --version python -m venv superpages_env source superpages_env/bin/activate pip install crawlbase beautifulsoup4
On Windows, activate the environment with superpages_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull each field by CSS selector.
Step 1: Fetch a rendered search page
Start by getting one results page back. Build the search URL from your query and location, import the CrawlingAPI class, initialize it with your token, and request the URL. Checking the status before you parse keeps failures loud instead of silent.
from urllib.parse import urlencode from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def build_url(query, location, page=1): base = "https://www.superpages.com/search?" params = {"search_terms": query, "geo_location_terms": location, "page": page} return base + urlencode(params) def crawl(page_url): response = api.get(page_url) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed: {response['headers']['pc_status']}") return None if __name__ == "__main__": url = build_url("Home Services", "Los Angeles, CA") html = crawl(url) print(html[:500] if html else "No HTML returned")
Note the status check reads pc_status from the response headers, which is the Crawlbase status for the request, distinct from the upstream HTTP code. Run the script with python scraper.py and you should see real results markup rather than a challenge page. That confirms the fetch path works before you write a single selector.
SuperPages rate-limits by IP and challenges scraper-shaped traffic, exactly the friction you just saw motivate the fetch step. The Crawling API routes each request through rotating residential IPs server-side, handles CAPTCHAs and blocks, and hands back ready-to-parse HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public search page on the free tier first.
Step 2: Parse the listings with BeautifulSoup
With a results page in hand, load it into BeautifulSoup and walk the result cards. Each business is a self-contained card under a predictable container, and within it the name, address, phone, website, and detail-page link map to their own selectors. Reading each field defensively, returning an empty string when an element is missing, keeps one absent value from crashing the run.
from bs4 import BeautifulSoup BASE = "https://www.superpages.com" def extract_listings(html): soup = BeautifulSoup(html, "html.parser") listings = [] for business in soup.select("div.search-results > div.result"): name_el = business.select_one("a.business-name span") category_el = business.select_one("div.categories") address_el = business.select_one("span.street-address") phone_el = business.select_one("a.phone.primary") website_el = business.select_one("a.weblink-button") link_el = business.select_one("a.business-name") listings.append({ "name": name_el.text.strip() if name_el else "", "category": category_el.text.strip() if category_el else "", "address": address_el.text.strip() if address_el else "", "phone": phone_el.text.strip() if phone_el else "", "website": website_el["href"] if website_el else "", "detail_page_link": BASE + link_el["href"] if link_el else "", }) return listings
The selectors come straight from the SuperPages card markup: the business name sits in a span inside an a.business-name anchor, the address in span.street-address, the phone in a.phone.primary, and the outbound website in a.weblink-button. The detail-page link reuses the same a.business-name anchor and is a relative path, so it is prefixed with the BASE host to make a full URL. Each field is guarded with an if ... else "" so a missing element leaves an empty string in the record rather than throwing.
The class names above (result, business-name, street-address, phone primary, weblink-button) reflect the current SuperPages markup, and that markup changes without notice. Treat the selectors as a starting template, not a contract. When a field comes back empty across every listing, re-inspect a live results page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper.
Step 3: Handle pagination across result pages
One page is a demo; a real lead list covers the full result set. SuperPages exposes the result page through the page URL parameter, so walking the pages is a loop over an integer range. The same build_url and extract_listings functions carry over without changes, so pagination is just an outer loop that paces itself between requests.
import time def scrape_all_pages(query, location, max_pages): all_listings = [] for page in range(1, max_pages + 1): print(f"Scraping page {page}...") url = build_url(query, location, page) html = crawl(url) if not html: print(f"Stopping at page {page}: no HTML") break listings = extract_listings(html) if not listings: print(f"No results on page {page}; reached the end") break all_listings.extend(listings) time.sleep(2) return all_listings
Two details make this loop production-friendly. It stops early when a page returns no listings, so you do not waste requests past the last real page, and it sleeps for two seconds between requests so the run does not arrive as one tight burst. Tune max_pages and the sleep to your volume; the slower you go, the less attention you draw.
Step 4: Put it together and export
Now wire the fetch, the parse, and the pagination into one runnable script, then write the records to both a JSON file and a CSV so the lead list drops straight into a spreadsheet or a CRM import.
import csv import json import time from urllib.parse import urlencode from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) BASE = "https://www.superpages.com" FIELDS = ["name", "category", "address", "phone", "website", "detail_page_link"] def build_url(query, location, page=1): base = "https://www.superpages.com/search?" params = {"search_terms": query, "geo_location_terms": location, "page": page} return base + urlencode(params) def crawl(page_url): response = api.get(page_url) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed: {response['headers']['pc_status']}") return None def extract_listings(html): soup = BeautifulSoup(html, "html.parser") listings = [] for business in soup.select("div.search-results > div.result"): name_el = business.select_one("a.business-name span") category_el = business.select_one("div.categories") address_el = business.select_one("span.street-address") phone_el = business.select_one("a.phone.primary") website_el = business.select_one("a.weblink-button") link_el = business.select_one("a.business-name") listings.append({ "name": name_el.text.strip() if name_el else "", "category": category_el.text.strip() if category_el else "", "address": address_el.text.strip() if address_el else "", "phone": phone_el.text.strip() if phone_el else "", "website": website_el["href"] if website_el else "", "detail_page_link": BASE + link_el["href"] if link_el else "", }) return listings def scrape_all_pages(query, location, max_pages): all_listings = [] for page in range(1, max_pages + 1): print(f"Scraping page {page}...") html = crawl(build_url(query, location, page)) if not html: break listings = extract_listings(html) if not listings: break all_listings.extend(listings) time.sleep(2) return all_listings def save_json(data, filename="superpages_listings.json"): with open(filename, "w") as f: json.dump(data, f, indent=4) def save_csv(data, filename="superpages_listings.csv"): with open(filename, "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=FIELDS) writer.writeheader() writer.writerows(data) def main(): rows = scrape_all_pages("Home Services", "Los Angeles, CA", max_pages=5) save_json(rows) save_csv(rows) print(f"Saved {len(rows)} listings") if __name__ == "__main__": main()
Because every record shares the same six keys, the CSV columns line up cleanly and csv.DictWriter writes them without any extra mapping. Swap the query and location at the bottom to target a different industry or city, and raise max_pages when you want a deeper sweep of one search.
What the output looks like
Run the full script with python scraper.py and you get a clean list of structured records, ready to write to JSON, CSV, or a database. The JSON file looks like this:
[ { "name": "Evergreen Cleaning Systems", "category": "House Cleaning", "address": "3325 Wilshire Blvd Ste 622, Los Angeles, CA 90010", "phone": "213-375-1597", "website": "https://www.evergreencleaningsystems.com", "detail_page_link": "https://www.superpages.com/los-angeles-ca/bpp/evergreen-cleaning-systems-540709574" }, { "name": "Any Day Anytime Cleaning Service", "category": "House Cleaning", "address": "27612 Cherry Creek Dr, Santa Clarita, CA 91354", "phone": "661-297-2702", "website": "", "detail_page_link": "https://www.superpages.com/santa-clarita-ca/bpp/any-day-anytime-cleaning-service-513720439" } ]
Listings with no claimed website come back with "website": "", which is expected and exactly why the parser reads each field defensively rather than assuming every key is present. From here the data is ready for de-duplication, enrichment, or import into your outreach tooling. For the broader workflow around turning these records into a campaign, see the guide on web crawling for lead generation.
Scaling to more queries and locations
A single search covers one industry in one city. A real prospecting dataset usually crosses many of both, so the natural next step is to drive the scraper from a list of query and location pairs rather than hard-coded strings.
searches = [ ("Home Services", "Los Angeles, CA"), ("Plumbers", "San Diego, CA"), ("Electricians", "Phoenix, AZ"), ] all_rows = [] for query, location in searches: all_rows.extend(scrape_all_pages(query, location, max_pages=3)) save_json(all_rows) save_csv(all_rows)
The extend call appends every search into one flat list, so the export step stays unchanged. When the matrix of queries and locations grows large, move the work off a single synchronous loop and onto a queue. The async Crawler takes URLs in bulk and pushes results back as they finish, which is a better fit than blocking on each page once you are scraping thousands of searches.
Staying unblocked
Even with the Crawling API handling IP rotation and the anti-bot layer, a few habits keep a run healthy, and they apply to any directory target.
- Pace your requests. The two-second sleep is not cosmetic. A tight loop is the fastest way to get throttled; spreading requests out reads far more like normal traffic.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API does this for you; if you build your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate is too aggressive. Treat that as a signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked. If you want the same approach applied to a neighbouring directory, the walkthroughs on scraping Yellow Pages and scraping local business listings follow the same fetch-parse-paginate shape with different selectors.
Is it legal to scrape SuperPages?
Whether scraping SuperPages is allowed depends on the site's terms of service, your jurisdiction, and what you do with the data. None of the code here changes that; it only makes the technical part work. Read the SuperPages Terms of Service and its robots.txt, and treat both as the boundary for what you collect and how fast. Everything in this guide is scoped to public B2B business-directory data: a company name, its category, a public street address, a public phone number, and a link to its own website. That is information any visitor can see without signing in, and it describes businesses rather than private individuals.
The legal weight shifts when you act on that data. Business contact details are still subject to privacy and anti-spam law in many regions. Under the GDPR, a named contact at a small business can count as personal data, so you need a lawful basis to store and process it, and people retain the right to object and be removed. In the US, the CAN-SPAM Act governs commercial email: you must identify yourself honestly, avoid deceptive subject lines, and honor opt-out requests promptly. Cold-calling rules and do-not-call registries apply to phone outreach in the same spirit. Collecting the data is one thing; using it for outreach is where these obligations bite, so build opt-out handling and suppression lists in from the start rather than bolting them on later.
What this approach does not cover is just as important. It does not touch anything behind a login, and it does not bypass authentication or any access control to reach gated content; that is out of scope here and runs against the site's terms. If SuperPages offers an official API or a licensed data feed for the volume you need, prefer it: a sanctioned source removes the ambiguity entirely. When in doubt about a commercial use of an aggregated contact dataset, check the rules that apply to you rather than assuming public means unrestricted.
Key takeaways
-
SuperPages is a structured B2B directory. Each search result is a card with a business name, category, public address, public phone, an optional website, and a detail-page link, driven by the
search_termsandgeo_location_termsURL parameters. - A plain fetch struggles at volume. Rate limits, CAPTCHAs, and IP blocks stop a naive loop; the Crawling API routes through residential IPs and returns ready-to-parse HTML in one call.
- BeautifulSoup does the extraction. Map name, category, address, phone, website, and link to the current selectors, read each field defensively, and expect those selectors to drift.
-
Pagination is a loop over the
pageparameter. Reuse the same parser across pages, stop early on an empty page, sleep between requests, and export to JSON and CSV. - Lawful outreach is on you. The data is public, but GDPR and CAN-SPAM still apply to how you contact these businesses, so you need a lawful basis and must honor opt-outs.
Frequently Asked Questions (FAQs)
Do I need the normal token or the JS token for SuperPages?
The normal token. SuperPages serves its listing data in the initial HTML, so a normal-token fetch returns parseable markup and keeps each request cheaper. The JS token renders the page in a real browser first, which you only need when a target loads its listings client-side after the page arrives. Start with the normal token and switch only if fields come back empty across the board.
How do I handle pagination on SuperPages?
SuperPages exposes the result page through a page URL parameter, so you loop over an integer range, build a URL per page, and run the same parser on each. Stop when a page returns zero listings, which marks the end of the result set, and sleep a couple of seconds between requests so the run does not arrive as one burst.
My selectors return empty values. What changed?
Almost certainly the SuperPages markup. Class names like result, business-name, street-address, and weblink-button change without notice, so selectors that worked last month can break. Re-inspect a live results page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
How do I export the leads to CSV or Excel?
The scraper already writes a CSV with csv.DictWriter, since every record shares the same keys. For Excel, pandas turns the same list of dictionaries into a spreadsheet in two lines: pd.DataFrame(rows).to_excel("superpages_listings.xlsx", index=False). The columns line up cleanly because the field set is fixed.
Can I scrape the individual business detail pages too?
Yes. Each record carries a detail_page_link, so you can feed those URLs back through the same crawl function and parse the dedicated page for extra fields like operating hours or a longer contact block. Pace that second pass the same way, since it doubles your request count, and keep it scoped to the public business information on the page.
How do I keep my outreach compliant?
Treat the scraped list as a starting point, not a green light. Confirm you have a lawful basis to contact each business under the rules in your region, identify yourself honestly in every message, and wire opt-out handling and a suppression list into your sending pipeline from day one. GDPR and CAN-SPAM obligations attach to the outreach, not the collection, so the compliance work lives in how you use the data.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
