Crawling a web page means writing software that walks a set of URLs, fetches each one, and pulls structured fields out of the HTML. It is how you turn pages built for human eyes into data you can query: price feeds for monitoring, article archives for research, listing grids for market analysis, or a training corpus for a model. Most of that data is public and sits in plain view, but reading it at any volume by hand is hopeless, so you reach for a crawler.

This guide shows you how to crawl a web page with Scrapy, Python's mature crawling framework, while routing every request through the Crawling API so the page comes back rendered and the request rides a rotating, trusted IP instead of your datacenter address. You will build a small, runnable Scrapy spider that fetches a search results page, parses a field from each listing, and prints clean records. The walkthrough stays scoped to public listing data, and there is a short note on crawling responsibly near the end that is worth reading before you point this at any real volume.

What you will build

A single-file Scrapy spider that fetches a search results page through the Crawling API and yields one structured record per result. We use an Amazon search as the running example, the same one the original version of this tutorial used, and pull two fields from each product card:

  • Title the product title text shown on the result card.
  • URL the link from the card into that product's own detail page.

Two fields keep the example readable, and the pattern extends to any selector you want to add. The same spider shape works for any site you have the right to crawl: swap the start URL and the selectors, and the fetch-and-parse loop stays the same.

Why a plain request gets blocked

Point a bare Scrapy request at a busy commercial site and two things tend to go wrong. First, many pages render their content in the browser: the initial HTML is a thin shell, and the listings only appear after the page's JavaScript runs. A raw fetch returns the shell, so there is nothing to parse. Second, large sites watch for automated traffic. A datacenter IP making fast, repetitive requests that do not look like a real browser gets challenged with a CAPTCHA or blocked outright, often before you see a single product.

So a crawler that actually works needs two things in the same request: a browser that renders the page, and an IP the site reads as a real visitor. You can build that yourself with a headless browser plus a pool of rotating residential proxies, but assembling those pieces and keeping them healthy is most of the work. The Crawling API folds both into one call. You hand it a URL, it fetches the page behind a trusted residential IP (and renders it in a real browser when you ask for that), and it returns finished HTML for Scrapy to parse. Your spider talks to one endpoint and never touches a proxy list.

How the routing works

Instead of requesting the target site directly, your spider requests https://api.crawlbase.com/?token=YOUR_CRAWLBASE_TOKEN&url=.... The API fetches the target on your behalf through its IP pool and streams the response body back to Scrapy. From Scrapy's point of view it is just an ordinary HTTP response, so every selector and pipeline you already know still works.

Prerequisites

A few things need to be in place before you write any code. None take long.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to crawling in general, the broader walkthrough on how to scrape a website with Python covers the fundamentals this tutorial assumes.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token. Crawlbase gives you two: a normal token for static HTML and a JavaScript token for pages that need rendering. We use the placeholder YOUR_CRAWLBASE_TOKEN throughout. Treat it like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create an isolated environment so the project's dependencies do not collide with anything else, then install the two libraries the spider needs.

bash
python --version

python -m venv crawler_env
source crawler_env/bin/activate

pip install scrapy crawlbase

On Windows, activate the environment with crawler_env\Scripts\activate instead of the source line. Two dependencies do the work. scrapy is the crawling framework: it manages the request queue, the downloader, and the parsing loop. crawlbase is the official Python client for the Crawling API, and its CrawlingAPI class has a buildURL helper that wraps any target URL into a proper API request, token and all, so you do not have to assemble that query string by hand.

Scrapy runs a single spider straight from a file with scrapy runspider, so you do not need a full project scaffold for this tutorial. Create one file to hold the spider:

bash
touch myspider.py

Step 1: Fetch a page through the Crawling API

Start with a spider that does nothing but prove the routing works. Subclass scrapy.Spider, give it a name, and set start_urls. The one trick here is that the start URL is not the target directly: you wrap it with api.buildURL so Scrapy requests the Crawling API endpoint, and the API fetches the target for you.

python
import scrapy
from crawlbase import CrawlingAPI

# Replace YOUR_CRAWLBASE_TOKEN with the token from your dashboard
api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

class AmazonSpider(scrapy.Spider):
    name = "amazonspider"

    # Target page to crawl, then route it through the Crawling API
    targets = ["https://www.amazon.com/s?k=cold+brew+coffee+maker"]
    start_urls = [api.buildURL(url, {}) for url in targets]

The spider has no parse method yet, so it will fetch the page and stop. That is on purpose: you want to confirm the request reaches the target through the API and comes back with status 200 before you write a single selector. Run it from the project directory:

bash
scrapy runspider myspider.py

In the log you should see a Crawled (200) line for a GET against api.crawlbase.com, with your target URL carried in the url query parameter. That 200 is the whole point of this step: the request went out through the Crawling API, the API fetched the Amazon search page behind a trusted IP, and the rendered HTML came back to Scrapy. Because there is no parser yet, Scrapy logs that the default parse callback is not defined and closes the spider. The plumbing is working; now you can extract data.

Crawlbase Crawling API

That Crawled (200) against a hard commercial target is the part most crawlers fail at. The Crawling API took the URL you passed to buildURL, fetched the page behind a rotating residential IP, rendered it in a real browser when needed, and handed Scrapy finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at your own target on the free tier first.

Step 2: Parse fields with CSS and XPath selectors

Now add a parse method. Scrapy calls it automatically with the response for each fetched page, and the response exposes both CSS and XPath selectors over the HTML. For each product card on the search page, you pull the title and the link and yield a small dictionary. Scrapy collects whatever you yield as scraped items.

python
    def parse(self, response):
        for card in response.css("div[data-component-type='s-search-result']"):
            title = card.css("h2 a span::text").get()
            href = card.css("h2 a::attr(href)").get()
            if not title or not href:
                continue
            yield {
                "title": title.strip(),
                "url": response.urljoin(href),
            }

A few things are worth calling out. The card selector targets a stable data-component-type attribute rather than a brittle utility class, which is the kind of durable hook you should prefer on any site. response.css(...).get() returns the first match as text or None when nothing matches, so the if not title or not href guard skips sponsored slots and layout rows that do not carry both fields. response.urljoin(href) turns the relative link the card gives you into an absolute URL. If you prefer XPath, the same two fields read as card.xpath(".//h2//a//span/text()").get() and card.xpath(".//h2/a/@href").get(). CSS and XPath are interchangeable here; pick whichever reads more clearly for a given field. The deeper comparison of the two lives in the guide on web scraping with XPath and CSS selectors.

Selectors drift

Site markup changes without notice, and the selectors above are a starting template, not a contract. If title or url comes back None for every card, open the live page in your browser's dev tools, re-inspect a product card, and update the selector. Periodic selector maintenance is normal for any production crawler, not a sign something is broken.

Step 3: Assemble and run the full spider

Put the pieces together into one file. This is the complete, runnable spider: the import, the API client, the start URLs routed through buildURL, and the parse method.

python
import scrapy
from crawlbase import CrawlingAPI

# Replace YOUR_CRAWLBASE_TOKEN with the token from your dashboard
api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

class AmazonSpider(scrapy.Spider):
    name = "amazonspider"

    targets = ["https://www.amazon.com/s?k=cold+brew+coffee+maker"]
    start_urls = [api.buildURL(url, {}) for url in targets]

    def parse(self, response):
        for card in response.css("div[data-component-type='s-search-result']"):
            title = card.css("h2 a span::text").get()
            href = card.css("h2 a::attr(href)").get()
            if not title or not href:
                continue
            yield {
                "title": title.strip(),
                "url": response.urljoin(href),
            }

Run it and write the items straight to a file with Scrapy's built-in feed export, which serializes whatever the spider yields:

bash
scrapy runspider myspider.py -o products.json

The -o products.json flag tells Scrapy to dump every yielded item to a JSON file. Drop the flag and the items print to your console instead. Either way, each Scraped from line in the log corresponds to one product, and the closing stats report how many items the run collected.

What the output looks like

Each item is a small record with the two fields you yielded. The JSON file is a list of them, ready to load into a database, a notebook, or a downstream pipeline.

json
[
  {
    "title": "Airtight Cold Brew Iced Coffee Maker and Tea Infuser with Spout, 1.0L",
    "url": "https://www.amazon.com/Airtight-Coffee-Maker-Infuser-Spout/dp/B01CTIYU60"
  },
  {
    "title": "KitchenAid Cold Brew Coffee Maker, Brushed Stainless Steel",
    "url": "https://www.amazon.com/KitchenAid-KCM4212SX-Coffee-Brushed-Stainless/dp/B06XNVZDC7"
  }
]

Crawling more than one page

One search page is a demo. A real crawl follows the links you just collected, or walks the next pages of results, and Scrapy is built for exactly that. Instead of yielding a plain dictionary, yield a scrapy.Request for each URL you want to follow, route it through buildURL so it goes back over the Crawling API, and point it at a callback that parses the next page.

python
    def parse(self, response):
        for card in response.css("div[data-component-type='s-search-result']"):
            href = card.css("h2 a::attr(href)").get()
            if href:
                product_url = response.urljoin(href)
                yield scrapy.Request(
                    api.buildURL(product_url, {}),
                    callback=self.parse_product,
                )

    def parse_product(self, response):
        yield {
            "title": response.css("#productTitle::text").get(default="").strip(),
            "url": response.url,
        }

Scrapy queues every request you yield, fetches them through its downloader, and calls the matching callback for each response, so a two-level crawl (search page, then each product page) is just two parse methods. Because each follow-up request is wrapped with buildURL, it rides the Crawling API too, which keeps the IP rotation and rendering consistent across the whole crawl. Keep crawls bounded with Scrapy settings like CLOSESPIDER_ITEMCOUNT while you are testing, and add a polite delay with DOWNLOAD_DELAY so you are not hammering the target. For sites that paint their listings with JavaScript, the same routing handles them once you request rendering; the guide on how to crawl JavaScript websites covers when you need that.

Staying unblocked

Routing through the Crawling API handles the two hardest parts, rendering and a trusted IP, but a few habits keep any longer crawl healthy.

  • Pace your requests. Set a DOWNLOAD_DELAY and let Scrapy's AutoThrottle adapt the rate instead of firing requests as fast as the framework can. Speed is what gets a crawler noticed.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API does this for you; if you ever roll your own stack, this is the part to get right.
  • Read the status codes. A crawl that starts returning non-200 responses is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.

The same patterns apply well beyond Python. If you want to compare the approach in another language, the walkthrough on how to build a web crawler in Java follows the same fetch-through-the-API-then-parse shape with a different toolchain.

Crawl responsibly

Stick to public data, the listing titles and links anyone can see without logging in, and stay off anything behind authentication, personal information, or copyrighted media you intend to redistribute. Honor each site's robots.txt and terms of service, which set the boundary for what you may collect and how, and keep your request rate reasonable so you are not straining someone else's servers. When a site offers an official API for the data you need, prefer it: it is the sanctioned path and usually the more stable one. None of the tooling here changes those obligations; it only makes the technical part work.

Recap

Key takeaways

  • Scrapy gives you the crawling framework. A spider subclass with start_urls and a parse method is the whole core, and scrapy runspider runs it from one file.
  • Route every request through the Crawling API. Wrap each target URL with api.buildURL so the request rides a rotating, trusted IP and comes back rendered, instead of hitting the site from your datacenter address.
  • Confirm a 200 before you parse. Run the spider with no parser first; a Crawled (200) against the API endpoint proves the routing works before you touch selectors.
  • Extract with CSS or XPath. The response exposes both; map each field to a durable selector, guard against missing matches, and expect selectors to drift over time.
  • Crawl responsibly. Stay on public data, honor robots.txt and terms of service, pace your requests, and prefer an official API when one exists.

Frequently Asked Questions (FAQs)

Why route Scrapy through the Crawling API instead of fetching directly?

Because a direct Scrapy request hits the target from your own IP and gets the raw, often unrendered HTML. On busy commercial sites that means CAPTCHAs, blocks, or an empty JavaScript shell. Routing through the Crawling API fetches the page behind a rotating residential IP and renders it when needed, so the HTML that reaches Scrapy is the finished page you can actually parse.

What does api.buildURL do?

It takes a target URL and returns the full Crawling API request URL for it, with your token and the target attached as query parameters. You point Scrapy at the URL buildURL returns, and the API fetches the target on your behalf. It saves you from assembling https://api.crawlbase.com/?token=...&url=... by hand and getting the escaping wrong.

Do I need the normal token or the JavaScript token?

It depends on the target. If the page serves its content in the initial HTML, the normal token is enough. If the listings only appear after the page's JavaScript runs, you need the JavaScript token so the API renders the page in a real browser before returning it. When fields you can see in your browser come back empty in Scrapy, that is the usual sign you should switch to the JavaScript token.

Can I use CSS and XPath selectors in the same spider?

Yes. Every Scrapy response exposes both response.css(...) and response.xpath(...), and you can mix them freely, even field by field. CSS is usually shorter for class and attribute matches, while XPath is handier for walking up the tree or matching on text. Use whichever reads more clearly for the field in front of you.

Yield a scrapy.Request for each URL you want to follow instead of a plain item, wrap that URL with api.buildURL so it goes back through the Crawling API, and give it a callback that parses the next page. Scrapy queues and fetches every request you yield, so a search-page-then-product-page crawl is just two parse methods. Cap the run with settings like CLOSESPIDER_ITEMCOUNT while testing.

My selectors return None. What changed?

Almost certainly the site's markup. Class names and container attributes change without notice, which breaks any selector tied to them. Open a live page in your browser's dev tools, re-inspect the element, prefer durable hooks like stable data- attributes where you can, and update the selector. Periodic selector maintenance is normal for any production crawler.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available