GoodFirms is a B2B directory that connects buyers with IT service providers, software companies, and agencies. Each public listing carries the structured fields that drive competitor research, market sizing, and partner discovery: company name, rating, the services offered, location, hourly rate band, and a link to the full profile. For anyone mapping a vertical or building a vendor shortlist, that public directory data is the raw material, and collecting it by hand across dozens of agencies is slow and error-prone.
This guide shows you how to scrape GoodFirms with Python the reliable way. You build a small, runnable scraper that fetches rendered GoodFirms pages through the Crawling API, collects company records from a category listing, parses the fields with BeautifulSoup, handles pagination, and exports clean JSON and CSV. The whole walkthrough stays scoped to public business listing data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Python script that takes a public GoodFirms category URL, walks the paginated search listings, extracts a structured record per company, then drills into individual profiles for the deeper fields. The running example is web development agencies in London. We pull these fields:
- Company name the listed business name on the directory card.
- Rating the public review score on the listing.
- Services the service category or tagline the company is listed under.
- Location the city and country shown on the card.
- Hourly rate the rate band from the company profile.
- Profile URL the canonical link to the full profile page.
Why a plain request fails on GoodFirms
Request a GoodFirms category or profile URL with a bare HTTP client and you often get status 200 with only a fraction of the listing data in the body. Two things work against you. First, GoodFirms loads much of its directory grid and profile detail in the browser through JavaScript, so the initial HTML is a thin shell that fills in only after the page's scripts run. Pull the company cards out of that first response and you can capture a partial set or miss late-rendering fields. Second, a busy B2B directory watches for automated traffic: datacenter IPs and non-browser request patterns get rate-limited, blocked, or challenged before they reach the rendered content.
So a working GoodFirms scraper needs two things in one request: a browser that renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus rotating residential proxies, but keeping that healthy is most of the work. The Crawling API folds both into a single call: send it the URL with a JavaScript token, it renders the page behind a trusted IP, and returns finished HTML to parse. For the background on why client-side rendering breaks naive scrapers, the guide to crawling JavaScript websites covers it in depth.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Because GoodFirms fills parts of its directory and profile pages client-side, the JS token is the safe default here: it returns the finished markup rather than the thin shell a plain fetch would, so there is something useful for BeautifulSoup to parse.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to the parsing side, the BeautifulSoup guide is a good companion to this tutorial, and the broader scrape a website with Python walkthrough covers the fundamentals.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda, and make sure Python is on your PATH.
A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Crawlbase includes 1,000 free requests to start, which is plenty for working through this guide. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the libraries the scraper needs.
python --version python -m venv goodfirms_env source goodfirms_env/bin/activate pip install crawlbase beautifulsoup4
On Windows, activate the environment with goodfirms_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector. Both json and csv ship with the standard library, so there is nothing more to install for the export step.
Step 1: Fetch a rendered GoodFirms page
Start by getting a finished page. Import the CrawlingAPI class, initialize it with your JS token, and request a GoodFirms category URL. Pass ajax_wait and page_wait so the API holds for the dynamic content before the page is captured. Checking the Crawlbase pc_status before you parse keeps failures loud instead of silent.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) OPTIONS = { "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0", "ajax_wait": "true", "page_wait": 5000, } def crawl(page_url): response = api.get(page_url, OPTIONS) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed: {response['headers']['pc_status']}") return None if __name__ == "__main__": listing_url = "https://www.goodfirms.co/companies/web-development-agency/london" html = crawl(listing_url) print(html[:500] if html else "No HTML returned")
The two wait options matter for a client-rendered target. ajax_wait tells the API to wait for asynchronous content, and page_wait holds a fixed number of milliseconds after load so late-rendering cards appear before capture. Five seconds is a reasonable start; raise it if results come back thin. Run the script with python goodfirms_scraper.py and you should see real GoodFirms directory markup, not the shell a plain request returns. That confirms rendering works before you write a single selector.
GoodFirms needs a rendered page behind a trusted IP, in one call, which is exactly what the ajax_wait and page_wait options above set up. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public category page on the free tier first.
Step 2: Identify the listing selectors
Before writing the parser, inspect the category page in your browser's dev tools (right-click and choose Inspect, or press Ctrl + Shift + I) to find the elements that wrap each company. On the GoodFirms search listings, each company card lives inside a list item, and the fields map to these selectors:
-
Company name sits in an
<h3>with the classfirm-name. -
Location is a
<div>with the classfirm-location. -
Service category is a
<div>nested underfirm-contentwith the classtagline. -
Rating appears in a
<span>with the classrating-number. -
Profile URL is the
hrefof an<a>with the classvisit-profile, insidefirm-urls.
With those in hand, load the rendered HTML into BeautifulSoup and pull each field per card. Every lookup is guarded so a missing field returns a default instead of crashing the run.
from bs4 import BeautifulSoup def extract_company(card): name = card.select_one("h3.firm-name") location = card.select_one("div.firm-location") category = card.select_one("div.firm-content > div.tagline") rating = card.select_one("span.rating-number") link = card.select_one("div.firm-urls > a.visit-profile") return { "name": name.get_text(strip=True) if name else "", "location": location.get_text(strip=True) if location else "", "category": category.get_text(strip=True) if category else "", "rating": rating.get_text(strip=True) if rating else "No rating", "profile_url": link["href"] if link else "", } def parse_listings(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select("ul.firm-directory-list > li.firm-wrapper") return [extract_company(card) for card in cards]
The container selector ul.firm-directory-list > li.firm-wrapper walks from the directory list down to each company card, and extract_company reads the five fields from inside it. The inline guards keep a card that omits, say, a rating from breaking the loop: it falls back to "No rating" instead.
Directory sites revise their markup without notice, and generated class names can change between visits. Treat the selectors here as a starting template, not a contract. When a list comes back empty, re-inspect the live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 3: Handle pagination across listing pages
One category page is a slice of the result set. GoodFirms paginates with a page query parameter, so you walk each page and gather the records. A small retry wrapper around the fetch keeps a single slow page from ending the run.
import time def fetch_html(page_url, max_retries=2): for attempt in range(max_retries + 1): html = crawl(page_url) if html: return html if attempt < max_retries: print(f"Retrying ({attempt + 1}/{max_retries})...") time.sleep(1) print(f"Unable to fetch {page_url}") return None def scrape_all_pages(base_url, num_pages=5): all_companies = [] for page in range(1, num_pages + 1): url = f"{base_url}?page={page}" print(f"Scraping page {page}...") html = fetch_html(url) if html: all_companies.extend(parse_listings(html)) time.sleep(2) return all_companies
fetch_html retries a failed fetch up to twice with a short pause, returning the HTML on success and None once it gives up. scrape_all_pages appends the page parameter, parses the cards, and caps the crawl at the num_pages ceiling so a large category does not run away. The time.sleep(2) between pages paces the run.
Step 4: Assemble the listings scraper
Now wire the pieces into one runnable script: walk the pages, collect the company records, and export them to both JSON and CSV.
import csv import json import time from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) OPTIONS = { "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0", "ajax_wait": "true", "page_wait": 5000, } def crawl(page_url): response = api.get(page_url, OPTIONS) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed: {response['headers']['pc_status']}") return None def fetch_html(page_url, max_retries=2): for attempt in range(max_retries + 1): html = crawl(page_url) if html: return html if attempt < max_retries: time.sleep(1) return None def extract_company(card): name = card.select_one("h3.firm-name") location = card.select_one("div.firm-location") category = card.select_one("div.firm-content > div.tagline") rating = card.select_one("span.rating-number") link = card.select_one("div.firm-urls > a.visit-profile") return { "name": name.get_text(strip=True) if name else "", "location": location.get_text(strip=True) if location else "", "category": category.get_text(strip=True) if category else "", "rating": rating.get_text(strip=True) if rating else "No rating", "profile_url": link["href"] if link else "", } def parse_listings(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select("ul.firm-directory-list > li.firm-wrapper") return [extract_company(card) for card in cards] def scrape_all_pages(base_url, num_pages=5): all_companies = [] for page in range(1, num_pages + 1): url = f"{base_url}?page={page}" print(f"Scraping page {page}...") html = fetch_html(url) if html: all_companies.extend(parse_listings(html)) time.sleep(2) return all_companies def save_outputs(records): with open("goodfirms_companies.json", "w") as f: json.dump(records, f, indent=2) if not records: return with open("goodfirms_companies.csv", "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=records[0].keys()) writer.writeheader() writer.writerows(records) def main(): base_url = "https://www.goodfirms.co/companies/web-development-agency/london" companies = scrape_all_pages(base_url, num_pages=3) save_outputs(companies) print(f"Saved {len(companies)} companies") if __name__ == "__main__": main()
The script walks up to three category pages, parses each into records, and paces the loop with a two-second sleep. save_outputs writes both JSON and CSV using the keys of the first record as the header, so you have the data in whichever shape your downstream tool wants. Adjust num_pages and the category URL to fit your target vertical and city.
What the output looks like
Run the full script with python goodfirms_scraper.py and you get a clean structured record per company, ready for analysis, a database, or a spreadsheet.
[ { "name": "Unified Infotech", "location": "London, United Kingdom", "category": "Driving Digital Transformation with Advanced Tech", "rating": "5.0", "profile_url": "https://www.goodfirms.co/company/unified-infotech" }, { "name": "instinctools", "location": "London, United Kingdom", "category": "Building Custom Software Solutions", "rating": "4.9", "profile_url": "https://www.goodfirms.co/company/instinctools" } ]
The matching CSV carries the same columns, one row per company, which drops straight into pandas or any spreadsheet for filtering by rating, location, or service category.
Step 5: Scrape company profile pages
The listings give you breadth; the profile pages give you depth. Each profile URL leads to a page with the company's full description, hourly rate band, team size, founding year, and services. Inspect a profile page the same way, and the deeper fields map to these selectors:
-
Company name is an
<h1>withitemprop="name". -
Description is a
<div>with the classprofile-summary-text. -
Hourly rate is a
<span>insidediv.profile-pricing. -
Number of employees is a
<span>insidediv.profile-employees. -
Year founded is a
<span>insidediv.profile-founded. -
Services come from the
data-nameattribute of each<button>inul.services-chart-list.
import re import json import time from bs4 import BeautifulSoup def text_of(soup, selector, default="N/A"): el = soup.select_one(selector) return el.get_text(strip=True) if el else default def extract_profile(html, url): soup = BeautifulSoup(html, "html.parser") summary = soup.select_one("div.profile-summary-text") description = re.sub(r"\s+", " ", summary.get_text(strip=True)) if summary else "N/A" services = [b["data-name"] for b in soup.select("ul.services-chart-list button[data-name]")] return { "name": text_of(soup, 'h1[itemprop="name"]'), "profile_url": url, "description": description, "hourly_rate": text_of(soup, "div.profile-pricing > span"), "no_of_employees": text_of(soup, "div.profile-employees > span"), "year_founded": text_of(soup, "div.profile-founded > span"), "services": services, } def scrape_profiles(profile_urls): profiles = [] for url in profile_urls: print(f"Scraping profile: {url}") html = fetch_html(url) if html: profiles.append(extract_profile(html, url)) time.sleep(2) return profiles if __name__ == "__main__": profile_urls = [ "https://www.goodfirms.co/company/unified-infotech", "https://www.goodfirms.co/company/instinctools", ] data = scrape_profiles(profile_urls) with open("goodfirms_profiles.json", "w") as f: json.dump(data, f, indent=2) print(f"Saved {len(data)} profiles")
This reuses the fetch_html wrapper from the listings scraper, so rendering, retries, and the JS token carry over. The re.sub(r"\s+", " ") call collapses the whitespace runs that company descriptions tend to carry, and the services list reads the data-name attribute off each chart button. A typical profile record looks like this:
{ "name": "Unified Infotech", "profile_url": "https://www.goodfirms.co/company/unified-infotech", "description": "Unified Infotech is a digital transformation partner serving enterprises with custom web, mobile, and software solutions...", "hourly_rate": "$50 - $99/hr", "no_of_employees": "50 - 249", "year_founded": "2010", "services": [ "Web Development", "Software Development", "Web Designing (UI/UX)", "Mobile App Development", "E-commerce Development" ] }
Feed the profile_url values from the listings step into scrape_profiles and you get the hourly rate band and services for each company in the same run, joining breadth and depth into one dataset.
Staying unblocked at scale
Even with rendering handled, a busy directory watches for scraper-shaped traffic. A few habits keep a longer run healthy on any commercial target.
- Pace your requests. Hammering listings in a tight loop is the fastest way to get throttled or challenged. The two-second sleeps above are the floor, not the ceiling; widen them for larger jobs and vary your targets instead of crawling one category at full speed.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
-
Read the status codes. A run that starts returning non-200
pc_statusvalues is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.
For larger crawls, the async Crawler queues requests and delivers results to a webhook, which suits running many category pages without holding open connections. For the broader playbook, see how to scrape websites without getting blocked. And if you are mapping the wider B2B vendor landscape, the same approach carries over to scraping Clutch, Superpages, and other local business listings.
Is it legal to scrape GoodFirms?
Whether scraping GoodFirms is allowed depends on its Terms of Service, your jurisdiction, and what you do with the data. GoodFirms restricts automated access and bulk collection in its terms, so scraping can run against them regardless of how careful your tooling is. None of the code here changes that; it only makes the technical part work. Read GoodFirms' Terms of Service and its robots.txt before you start, respect the rate limits and crawl directives they declare, and keep your request volume low enough that you are not straining their servers.
Scope matters as much as politeness. Stick to public business listing data: the company name, rating, service category, location, hourly rate band, and profile link that any visitor can see without an account. A company profile may also surface contact details for the business or named individuals, and the moment you collect or store anything that identifies a person, data-protection law applies. Under the GDPR you need a lawful basis to process personal data, individuals can ask to be removed, and if you use any collected contact information for outreach, regimes like the GDPR and the US CAN-SPAM Act govern that too: you need consent or another lawful basis, accurate sender information, and a working opt-out. People and businesses can ask not to be contacted, and you must honor that. Avoid scraping anything behind a login, and do not redistribute GoodFirms' own editorial content or review text wholesale, which is copyrighted.
This guide is deliberately scoped to public listing and profile pages because that is the line that keeps the work defensible. It does not cover anything behind an account, the bulk harvesting of personal contact details, or any attempt to bypass authentication. Public business data only. If your project needs more than that, the right path is a permitted one: GoodFirms publishes data through its own channels and partner arrangements, so check whether an official API or licensed feed covers your use case. That is the correct route for commercial or bulk use, not a cleverer scraper.
Key takeaways
- GoodFirms renders parts of its pages client-side. A plain request can return a thin shell, so render the page with the JS token before you parse it.
-
You need rendering and a trusted IP together. The Crawling API with a JS token does both in one call;
ajax_waitandpage_waitcontrol how long it waits for content. - Work in two layers. Parse the category listings for name, rating, services, location, and profile link, then follow each profile URL for the hourly rate band, team size, and services list.
-
Paginate and export. Walk the
pagequery parameter up to a ceiling, pace the run with short sleeps, and write the records to JSON and CSV. - Stay on public business data. Respect GoodFirms' ToS and robots.txt, treat any personal contact info as regulated under the GDPR and CAN-SPAM with a lawful basis and a working opt-out, and never touch logins or copyrighted editorial content.
Frequently Asked Questions (FAQs)
Why does a plain request return only part of the GoodFirms data?
Because GoodFirms loads parts of its directory grid and profile details client-side with JavaScript. The initial HTML can be a shell that fills in only after the scripts run, so a raw request may return status 200 with cards or profile fields missing. Render the page first to get the full set, which is what the Crawling API's JS token handles.
Do I need the normal token or the JS token for GoodFirms?
Use the JS token. The normal token fetches static HTML, which can miss the parts of GoodFirms that render in the browser. The JS token runs the page in a real browser first, so the company cards and profile fields are present when BeautifulSoup parses them.
What data can I scrape from GoodFirms?
Public business listing fields: the company name, its rating, the service category or tagline, the location, the hourly rate band, team size, year founded, and the profile link. Stay on data that is visible to any visitor without an account, and treat any contact details for named individuals as personal data that falls outside the public-listing scope this guide covers.
My selectors return empty results. What changed?
Almost certainly GoodFirms' markup. Class names like firm-name, firm-location, and rating-number, and the profile containers such as profile-pricing and services-chart-list, can change without notice, so selectors that worked last month can break. Re-inspect a live page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.
How do I handle pagination across a category?
GoodFirms appends a page query parameter to the category URL. Walk the pages in a loop, parse the company cards on each, cap the crawl at a num_pages ceiling so a large category does not run away, and add a short sleep between pages. The scrape_all_pages function above shows the full loop.
Can I use scraped GoodFirms data for outreach or commercially?
Treat that as a legal question, not a technical one. Any contact information you collect is personal data, so outreach is governed by regimes like the GDPR and the US CAN-SPAM Act: you need a lawful basis or consent, accurate sender details, and a working opt-out, and people can ask not to be contacted. GoodFirms' Terms of Service also restrict reuse of its content. Review the terms, check whether an official API or licensed feed covers your use case, and seek legal advice before building a product or an outreach list on top of the data.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
