Most teams that watch a market, build a search index, or feed a dataset start the same way: they crawl data from a set of public web pages and turn it into clean records. The hard part is rarely a single page. It is doing it across hundreds of pages without your requests getting throttled, blocked, or silently returning half-empty HTML.

This guide shows you how to build a small, runnable web crawler in Python. It fetches a start page through the Crawling API, extracts the links on it, follows the ones that stay inside your target scope, parses the fields you want on each page, removes duplicates, and exports clean JSON and CSV. The walkthrough stays on a neutral example site so you can run it as-is and then point it at your own public source.

Crawling vs scraping in one paragraph

These two words get used interchangeably, but they name different jobs. Crawling is discovery: starting from one or more URLs, following links, and walking outward to find pages worth visiting. Scraping is extraction: taking one page's HTML and pulling out the specific fields you care about, such as a title, a price, or a date. A real pipeline does both. The crawler decides which pages to visit and the scraper decides what to keep from each one. The script in this guide is a crawler with a scraper bolted onto every page it visits.

What you will build

A single Python script that takes a start URL, discovers article links by following in-scope links, fetches each page through the Crawling API, and extracts a structured record per page. The running example uses https://example.com as a stand-in for a public listing or blog index. Each record carries these fields:

  • Title the main heading of the page.
  • URL the canonical link the record was scraped from.
  • Summary the lead paragraph or meta description.
  • Date the published or updated date when the page exposes one.
  • Links the count of in-scope links discovered on the page.

Why a plain request often fails

The naive version of this is a loop around a bare HTTP client: fetch a URL, parse it, queue the links, repeat. It works on a toy site and falls apart on a real one for two reasons.

First, rendering. Many modern pages ship a thin HTML shell and load their real content in the browser through JavaScript and Ajax. Request that shell with a plain client and the links and fields you want are not in the body yet, so your crawler discovers nothing and parses nothing. Second, blocking. Sites watch for automated traffic: datacenter IP ranges, missing browser headers, and request patterns that fire faster than any human get rate-limited, IP-blocked, or served a CAPTCHA before they ever reach the content.

So a crawler that holds up at scale needs two things in every request: a browser that renders the page, and an IP the site reads as a real visitor. You can assemble that yourself from a headless browser plus a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call: you send it a URL, it renders the page behind a trusted IP, and it returns finished HTML for you to parse.

Prerequisites

A few things need to be in place first. None take long.

Basic Python. You should be comfortable writing and running a script and installing packages with pip. If the parsing side is new to you, the BeautifulSoup guide pairs well with this tutorial.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda, and make sure Python is on your PATH.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token from the account page. Crawlbase includes 1,000 free requests to start, which is plenty for working through this guide. There are two token types: the normal token fetches static HTML, and the JavaScript token renders the page in a real browser first. Use the normal token for static pages and the JavaScript token when the content loads client-side. Treat the token like a password and keep it out of version control.

Set up the project

Create a virtual environment so dependencies stay isolated, then install the two libraries the crawler needs.

bash
python --version

python -m venv crawler_env
source crawler_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with crawler_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out fields and links by CSS selector. Both json and csv ship with the standard library, so nothing more is needed for the export step.

Step 1: Fetch a page through Crawlbase

Start by getting one page reliably. Import the CrawlingAPI class, initialize it with your token, and request the start URL. Checking the Crawlbase pc_status before you parse keeps failures loud instead of silent, and gives you a clean place to retry.

python
import time
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def fetch_html(page_url, max_retries=2):
    for attempt in range(max_retries + 1):
        response = api.get(page_url)
        if response["headers"]["pc_status"] == "200":
            return response["body"].decode("utf-8")
        if attempt < max_retries:
            print(f"Retrying ({attempt + 1}/{max_retries})...")
            time.sleep(1)
    print(f"Failed: {page_url} ({response['headers']['pc_status']})")
    return None

if __name__ == "__main__":
    html = fetch_html("https://example.com")
    print(html[:500] if html else "No HTML returned")

The fetch_html helper is the backbone of the whole crawler. It sends the URL through Crawlbase, retries up to twice with a short pause when a fetch fails, and returns the decoded HTML on success or None once it gives up. Run it with python crawler.py and you should see real markup print, which confirms the request path works before you write a single selector. If your target loads content client-side, initialize with the JavaScript token and pass {"ajax_wait": "true", "page_wait": 5000} as a second argument to api.get so the API waits for the dynamic content before capturing the page.

Crawlbase Crawling API

The fetch_html helper above leans on one thing: every request comes back rendered and from an IP the site trusts. The Crawling API does exactly that. It runs the page in a real browser when you need it, rotates through residential IPs server-side, and hands you finished HTML, so you skip standing up a headless browser fleet and a proxy pool of your own. Point it at a public page on the free tier first.

Discovery is just link extraction done in a loop. Load the HTML into BeautifulSoup, pull every anchor's href, and resolve relative paths against the page they were found on so you always work with absolute URLs.

python
from urllib.parse import urljoin, urldefrag
from bs4 import BeautifulSoup

def extract_links(html, base_url):
    soup = BeautifulSoup(html, "html.parser")
    links = set()
    for a in soup.select("a[href]"):
        href = a["href"].strip()
        if not href or href.startswith(("mailto:", "tel:", "javascript:")):
            continue
        absolute = urljoin(base_url, href)
        absolute, _ = urldefrag(absolute)
        links.add(absolute)
    return links

Three small decisions make this robust. The function skips mailto:, tel:, and javascript: anchors that are not real pages. It uses urljoin so a relative href like /articles/web-data becomes a full URL against the page it came from. And it calls urldefrag to drop the #section fragment, because /page and /page#top are the same document and you do not want to visit both. Returning a set deduplicates the links found on this single page before they ever reach the queue.

Step 3: Keep the crawl in scope

Left unbounded, a crawler follows links off your target site and never stops. The fix is a scope rule: only follow links that share the start URL's host and, optionally, sit under a path prefix you care about. This is the crawler equivalent of staying on the product section instead of wandering into the help center.

python
from urllib.parse import urlparse

def in_scope(url, root):
    root_parts = urlparse(root)
    url_parts = urlparse(url)
    if url_parts.scheme not in ("http", "https"):
        return False
    if url_parts.netloc != root_parts.netloc:
        return False
    return url_parts.path.startswith(root_parts.path)

in_scope compares each candidate URL against the root you started from. It rejects anything that is not HTTP or HTTPS, anything on a different host (netloc), and anything whose path does not start with the root path. Set the root to https://example.com/ to crawl the whole host, or to https://example.com/blog/ to stay inside one section. Tightening scope here is the single biggest lever on how much you fetch.

Step 4: Parse the fields on each page

Discovery tells you which pages to visit; parsing decides what to keep. Pull a small, well-defined record from each page and guard every lookup so a missing field returns None instead of crashing the run.

python
def text_of(soup, selector):
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else None

def attr_of(soup, selector, attr):
    el = soup.select_one(selector)
    return el.get(attr) if el else None

def parse_page(html, url):
    soup = BeautifulSoup(html, "html.parser")
    summary = (
        attr_of(soup, 'meta[name="description"]', "content")
        or text_of(soup, "article p")
    )
    return {
        "url": url,
        "title": text_of(soup, "h1") or text_of(soup, "title"),
        "summary": summary,
        "date": attr_of(soup, "time[datetime]", "datetime"),
    }

The two helpers, text_of and attr_of, query a single element and return its text or one attribute, falling back to None when the element is absent. parse_page uses a chain of fallbacks: it prefers the meta[name="description"] tag for the summary and drops to the first article paragraph if there is none, and it takes the h1 for the title but uses the <title> tag when no h1 exists. These selectors are deliberately generic so the script runs on the example site. For a real target, open the page in your browser's dev tools and replace them with selectors that match its actual markup.

Step 5: Assemble the crawl loop

Now wire the pieces into one breadth-first crawler. A queue holds URLs to visit, a visited set prevents fetching the same page twice, and a max_pages ceiling stops the run from going on forever. For each page it fetches, the crawler parses a record, counts the in-scope links, and queues the new ones.

python
import csv
import json
import time
from collections import deque
from urllib.parse import urljoin, urldefrag, urlparse
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def fetch_html(page_url, max_retries=2):
    for attempt in range(max_retries + 1):
        response = api.get(page_url)
        if response["headers"]["pc_status"] == "200":
            return response["body"].decode("utf-8")
        if attempt < max_retries:
            time.sleep(1)
    return None

def extract_links(html, base_url):
    soup = BeautifulSoup(html, "html.parser")
    links = set()
    for a in soup.select("a[href]"):
        href = a["href"].strip()
        if not href or href.startswith(("mailto:", "tel:", "javascript:")):
            continue
        absolute, _ = urldefrag(urljoin(base_url, href))
        links.add(absolute)
    return links

def in_scope(url, root):
    r, u = urlparse(root), urlparse(url)
    return (
        u.scheme in ("http", "https")
        and u.netloc == r.netloc
        and u.path.startswith(r.path)
    )

def text_of(soup, selector):
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else None

def attr_of(soup, selector, attr):
    el = soup.select_one(selector)
    return el.get(attr) if el else None

def parse_page(html, url, link_count):
    soup = BeautifulSoup(html, "html.parser")
    summary = (
        attr_of(soup, 'meta[name="description"]', "content")
        or text_of(soup, "article p")
    )
    return {
        "url": url,
        "title": text_of(soup, "h1") or text_of(soup, "title"),
        "summary": summary,
        "date": attr_of(soup, "time[datetime]", "datetime"),
        "links": link_count,
    }

def crawl(start_url, max_pages=25):
    queue = deque([start_url])
    visited = set()
    records = []
    while queue and len(visited) < max_pages:
        url = queue.popleft()
        if url in visited:
            continue
        visited.add(url)
        html = fetch_html(url)
        if not html:
            continue
        found = {l for l in extract_links(html, url) if in_scope(l, start_url)}
        records.append(parse_page(html, url, len(found)))
        for link in found:
            if link not in visited:
                queue.append(link)
        print(f"[{len(visited)}/{max_pages}] {url}")
        time.sleep(2)
    return records

This is a textbook breadth-first crawl. The visited set is the dedupe guard at the crawl level: a URL is added before it is fetched, so even if three pages all link to the same article, it is requested exactly once. max_pages caps the total work, the in-scope filter keeps the queue from filling with off-site links, and the two-second sleep paces the run so you are not hammering the server. The print line gives you a live progress trail while it works.

Step 6: Dedupe and export to JSON and CSV

The visited set already prevents fetching a URL twice, but redirects and trailing-slash variants can still produce two records that describe the same page. A final pass keyed on the URL collapses those before export.

python
def dedupe(records):
    seen = {}
    for record in records:
        seen[record["url"].rstrip("/")] = record
    return list(seen.values())

def save_outputs(records):
    with open("crawl_results.json", "w") as f:
        json.dump(records, f, indent=2)
    if not records:
        return
    with open("crawl_results.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=records[0].keys())
        writer.writeheader()
        writer.writerows(records)

def main():
    records = crawl("https://example.com", max_pages=25)
    records = dedupe(records)
    save_outputs(records)
    print(f"Saved {len(records)} pages")

if __name__ == "__main__":
    main()

dedupe keys each record on its URL with the trailing slash stripped, so /article and /article/ resolve to one entry, and the later record wins. save_outputs writes a JSON file and a CSV using the keys of the first record as the header, giving you the data in whichever shape your next tool wants. Drop these two functions in below the crawl loop from Step 5 and the script runs end to end.

What the output looks like

Run the full script with python crawler.py and you get one structured record per page, ready for analysis, a database, or a spreadsheet.

json
[
  {
    "url": "https://example.com/articles/web-data",
    "title": "A Practical Guide to Web Data",
    "summary": "How teams turn public pages into clean, structured records.",
    "date": "2024-09-18",
    "links": 12
  },
  {
    "url": "https://example.com/articles/crawling-basics",
    "title": "Crawling Basics",
    "summary": "Discovery, scope, and dedupe explained from first principles.",
    "date": "2024-08-02",
    "links": 9
  }
]

The matching CSV carries the same columns, one row per page, which drops straight into pandas or any spreadsheet for sorting, filtering, or joining against another dataset. If you want to take the storage step further, storing scraped data on cloud and loading it into SQL are natural next steps.

Scaling the crawl

The script above is deliberately single-threaded so it is easy to read and easy to keep polite. A few changes take it from a demo to a job you can leave running.

  • Raise the ceiling carefully. max_pages is your safety valve. Increase it in steps and watch how many in-scope links the crawl is discovering before you commit to a large run.
  • Persist the frontier. For long crawls, write the queue and visited set to disk so an interrupted run resumes instead of starting over and re-fetching everything.
  • Go asynchronous for volume. When you need thousands of pages, the async Crawler queues requests and pushes results to a webhook, so you are not holding open connections while pages render.

For JavaScript-heavy targets where the links themselves load client-side, the same loop works once you switch to the JavaScript token and the wait options; see crawling JavaScript websites for the details.

Staying unblocked

Even with rendering and trusted IPs handled, a few habits keep a longer crawl healthy.

  • Pace your requests. The two-second sleep in the loop is a floor, not a ceiling. Widen it for larger jobs, and avoid crawling one path as fast as the server will answer.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you build your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning non-200 pc_status values is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.

For the fuller playbook, see how to scrape websites without getting blocked.

Scraping responsibly

Crawl public data only, and respect the rules of the sites you visit. Read each target's terms of service and its robots.txt before you start, keep your request rate reasonable so you are not straining anyone's servers, and stay clear of anything behind a login or paywall. When the pages you collect contain personal data, privacy laws such as GDPR and CCPA apply to how you store and use it, so scope your fields to what you actually need and avoid harvesting details tied to identifiable individuals. The code in this guide makes the technical part work; keeping the project on the right side of these lines is on you.

Recap

Key takeaways

  • Crawling and scraping are two jobs. The crawler discovers which pages to visit by following links; the scraper extracts the fields you keep from each one.
  • Render and route through a trusted IP. A plain client misses client-rendered content and gets blocked; the Crawling API returns finished HTML from a trusted IP in one call.
  • Scope and dedupe keep the crawl sane. An in_scope check stops the run wandering off-site, and a visited set plus a URL-keyed pass remove duplicate work and duplicate records.
  • Parse defensively. Guard every selector so a missing field returns None and one odd page does not end the run.
  • Export once, use anywhere. Writing both JSON and CSV lets the same dataset flow into pandas, a database, or a spreadsheet without rework.

Frequently Asked Questions (FAQs)

What is the difference between web crawling and web scraping?

Crawling is the discovery step: starting from one or more URLs and following links to find pages worth visiting. Scraping is the extraction step: taking a single page's HTML and pulling out specific fields like a title or a date. Most real pipelines do both at once, which is exactly what the script in this guide does, crawling to find pages and scraping a record from each.

Why does my crawler return empty or partial HTML?

Usually because the page renders its content in the browser with JavaScript, so the initial HTML is a thin shell and your links and fields are not in it yet. Fetch the page through the Crawling API with the JavaScript token and the ajax_wait and page_wait options, which render the page first and return the finished markup for you to parse.

How do I stop the crawler from leaving the site I am targeting?

Use a scope rule. The in_scope function compares each candidate link against the host and path of your start URL and rejects anything that does not match. Set the root path narrowly, for example https://example.com/blog/, to keep the crawl inside one section instead of the whole domain.

How does the crawler avoid visiting the same page twice?

Two layers. A visited set records every URL before it is fetched, so a page that is linked from many places is still requested only once. After the crawl, a dedupe pass keyed on the URL (with the trailing slash normalized) collapses any records that still describe the same page before they reach JSON and CSV.

Should I export to JSON or CSV?

Both, and let the downstream tool decide. JSON keeps the nested, typed shape that code and APIs prefer, while CSV drops straight into spreadsheets and pandas. The save_outputs function writes both from the same records, so you are not locked into one format. For more on the tradeoffs, see the difference between JSON and CSV.

How many pages can I crawl on the free tier?

Crawlbase includes 1,000 free requests to start, and you pay only for successful requests. Each page the crawler fetches is one request, so the max_pages ceiling in the script maps directly to your usage. For larger or recurring jobs, the async Crawler scales the same approach without holding open connections.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available