Web scraping turns public web pages into structured data you can analyze, and the parsing step decides how clean that data comes out. Python has several parsing libraries, but Parsel stands out for being small, fast, and built around the two selector languages most scrapers already know: XPath and CSS. It is the same engine that powers Scrapy, and it works just as well on its own when you have raw HTML in hand and want to pull fields out of it in a few readable lines.

This guide is a runnable walkthrough. You install Parsel, fetch a rendered page through the Crawling API, load the HTML into a Selector, and extract data with both XPath and CSS using .get() and .getall(). From there you loop over a list of items, read text and attributes, clean the values, and export the result to JSON and CSV. The example target is books.toscrape.com, a public sandbox built specifically for practicing scraping, so you can run every snippet end to end without touching a real production site.

What you will build

A small Python script that fetches a catalog page, builds a Parsel Selector from the returned HTML, loops over the product cards, and extracts a structured record per item. From each book card we pull these fields:

  • Title the book title, read from a link attribute.
  • Price the listed price, cleaned into a number.
  • Availability the in-stock text shown on the card.
  • Rating the star rating, read from a CSS class.
  • Link the absolute URL of the book's detail page.

Why Parsel for parsing in Python

Parsel is a standalone selector library. You hand it a string of HTML, it builds a tree, and you query that tree with XPath or CSS expressions. It occupies a useful middle ground: lighter than a full framework like Scrapy, and more selector-driven than BeautifulSoup, which leans on Python method chaining instead of selector strings. The reasons it earns a place in a scraping toolkit are straightforward:

  • Two selector languages. Use XPath when you need to navigate structure or match on text, and CSS when a short class or tag selector reads more clearly. Parsel supports both on the same object.
  • Small and fast. It is built on lxml, so parsing large documents stays quick, and there is almost no setup beyond importing one class.
  • Clean syntax. .get() returns the first match, .getall() returns every match, and chained selectors keep extraction code short and easy to maintain.

For a deeper reference on the selector languages themselves, the post on XPath and CSS selectors covers the syntax in detail. Here we focus on putting them to work with Parsel.

Why fetch through the Crawling API

Parsel parses HTML; it does not fetch pages. You still need something to retrieve the markup first, and that fetch step is where most scrapers run into trouble. A bare HTTP request works fine on a simple static page, but many modern sites render their content with JavaScript, so the raw response is a thin shell with the real data missing. Others watch for automated traffic and rate-limit or block requests that do not look like a real browser.

Fetching through the Crawling API solves both problems in one call. You send it a URL, it renders the page when needed, routes the request through a trusted rotating IP, and returns finished HTML, which you feed straight into a Parsel Selector. That keeps the fetch concerns (rendering, rotation, blocking) separate from the parse concerns (selectors, fields), which is the separation that keeps a scraper maintainable.

Prerequisites

Basic Python. You should be comfortable running a script and installing packages with pip. No prior Parsel experience is needed; this guide introduces the API as it goes.

Python 3.8 or later. Check your version with python --version. If you do not have Python, install it from python.org and make sure it is on your PATH.

A Crawlbase account and token. Sign up, open your dashboard, and copy your request token. Crawlbase includes 1,000 free requests to start, which is more than enough to work through this guide. Treat the token like a password and keep it out of version control.

Set up the project

Create a virtual environment so the project's dependencies stay isolated, then install the two libraries the script needs.

bash
python --version

python -m venv parsel_env
source parsel_env/bin/activate

pip install parsel crawlbase

On Windows, activate the environment with parsel_env\Scripts\activate instead of the source line. parsel does the extraction, and crawlbase is the official client that fetches rendered pages for you. The json and csv modules ship with the standard library, so there is nothing else to install for the export step.

Step 1: Fetch a page and build a Selector

Start by fetching one catalog page through the Crawling API and loading its HTML into a Parsel Selector. Import CrawlingAPI, initialize it with your token, request the URL, and check the pc_status header before you parse so failures stay visible instead of silent.

python
from crawlbase import CrawlingAPI
from parsel import Selector

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def fetch_html(page_url):
    response = api.get(page_url)
    if response["headers"]["pc_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

if __name__ == "__main__":
    url = "https://books.toscrape.com/catalogue/page-1.html"
    html = fetch_html(url)
    if html:
        selector = Selector(text=html)
        print(selector.xpath("//title/text()").get())

Selector(text=html) is the entry point to everything that follows: it parses the string once and gives you an object you query with .xpath() and .css(). The final line reads the page title with an XPath expression, where /text() selects the text node and .get() returns the first match as a string. Run the file and you should see the catalog page's title printed, which confirms the fetch and the parse both work before you write a single field selector.

Crawlbase Crawling API

The fetch_html step above is the part Parsel cannot do on its own, and on a real target it is where rendering and blocking get hard. The Crawling API takes your token, renders JavaScript pages when needed, rotates through residential IPs server-side, and hands back finished HTML, so you can feed it straight into a Selector without running a headless browser fleet or a proxy pool yourself. Start on the free tier with your 1,000 requests.

Step 2: Extract with XPath and CSS

Parsel lets you query the same Selector with either language. XPath stands for XML Path Language and navigates the document tree by structure, while CSS selectors target elements by tag, class, or id the same way a stylesheet does. The two examples below pull the same kind of value so you can compare the styles directly.

python
# XPath: select the text of the first h1
heading = selector.xpath("//h1/text()").get()

# CSS: select the text inside a known element
price = selector.css("p.price_color::text").get()

# Attributes: @attr in XPath, ::attr() in CSS
link_xpath = selector.xpath("//article//h3/a/@href").get()
link_css = selector.css("article h3 a::attr(href)").get()

Two patterns carry most of the work. To read text, use /text() in XPath or ::text in CSS. To read an attribute such as href or src, use @attribute in XPath or ::attr(attribute) in CSS. In every case .get() returns the first match, or None if nothing matches, so a missing element does not raise an error.

get vs getall

.get() returns the first matching value as a string. .getall() returns a list of every match. Reach for .get() when you expect a single value like a price, and .getall() when you want a whole column such as every title on the page.

Step 3: Loop over a list of items

Real pages hold many repeated items, not one. The pattern is to select the repeating container once, then iterate, running scoped selectors against each element to build one record per item. On the books sandbox, every product is an <article class="product_pod">, so that is the container we loop over.

python
def parse_books(selector):
    books = []
    for card in selector.css("article.product_pod"):
        title = card.css("h3 a::attr(title)").get()
        price = card.css("p.price_color::text").get()
        availability = card.css("p.instock.availability::text").getall()
        rating = card.css("p.star-rating::attr(class)").get()
        href = card.css("h3 a::attr(href)").get()

        books.append({
            "title": title,
            "price": price,
            "availability": availability,
            "rating": rating,
            "href": href,
        })
    return books

Calling .css("article.product_pod") returns a SelectorList you can iterate; each card is itself a Selector, so the inner .css() calls run against just that one card. The title lives in the link's title attribute, the price in a price_color paragraph, and the rating in a class such as star-rating Three, which is why we read the whole class attribute and clean it in the next step. The availability field uses .getall() because its text is split across whitespace nodes; joining and stripping them gives a single clean string.

Step 4: Clean and normalize the values

Raw selector output usually needs a light pass before it is useful. Prices carry a currency symbol, the rating comes back as a two-word class, and the availability text arrives with surrounding whitespace. A few standard string operations turn each into a clean value.

python
BASE = "https://books.toscrape.com/catalogue/"
WORDS = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

def clean_book(card):
    price_text = card.css("p.price_color::text").get(default="")
    price = float(price_text.replace("£", "").strip() or 0)

    rating_class = card.css("p.star-rating::attr(class)").get(default="")
    rating_word = rating_class.replace("star-rating", "").strip()
    rating = WORDS.get(rating_word)

    stock = " ".join(card.css("p.instock.availability::text").getall())
    href = card.css("h3 a::attr(href)").get(default="")

    return {
        "title": card.css("h3 a::attr(title)").get(),
        "price": price,
        "availability": stock.strip(),
        "rating": rating,
        "link": BASE + href,
    }

Two small habits make this code resilient. First, .get(default="") supplies a fallback so a missing element yields an empty string rather than None, which keeps the downstream .replace() and .strip() calls from raising. Second, the price parse strips the currency symbol (the £ escape is the pound sign) and converts to float, so the value sorts and filters as a number. The rating maps the word in the class to an integer, and the relative href is joined onto the base URL to produce an absolute link.

Step 5: Assemble the full script

Now wire the pieces into one runnable script: fetch the page, build the Selector, loop over the cards through clean_book, and export the records to both JSON and CSV.

python
import csv
import json
from crawlbase import CrawlingAPI
from parsel import Selector

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

BASE = "https://books.toscrape.com/catalogue/"
WORDS = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

def fetch_html(page_url):
    response = api.get(page_url)
    if response["headers"]["pc_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['pc_status']}")
    return None

def clean_book(card):
    price_text = card.css("p.price_color::text").get(default="")
    price = float(price_text.replace("£", "").strip() or 0)
    rating_class = card.css("p.star-rating::attr(class)").get(default="")
    rating = WORDS.get(rating_class.replace("star-rating", "").strip())
    stock = " ".join(card.css("p.instock.availability::text").getall())
    href = card.css("h3 a::attr(href)").get(default="")
    return {
        "title": card.css("h3 a::attr(title)").get(),
        "price": price,
        "availability": stock.strip(),
        "rating": rating,
        "link": BASE + href,
    }

def parse_books(html):
    selector = Selector(text=html)
    return [clean_book(card) for card in selector.css("article.product_pod")]

def save_outputs(records):
    with open("books.json", "w") as f:
        json.dump(records, f, indent=2)
    if not records:
        return
    with open("books.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=records[0].keys())
        writer.writeheader()
        writer.writerows(records)

def main():
    url = "https://books.toscrape.com/catalogue/page-1.html"
    html = fetch_html(url)
    if not html:
        return
    records = parse_books(html)
    save_outputs(records)
    print(f"Saved {len(records)} books")

if __name__ == "__main__":
    main()

parse_books builds the Selector once and returns a list of cleaned records through a list comprehension over the cards. save_outputs writes a JSON file and a CSV that uses the keys of the first record as its header, so you get the data in whichever shape your downstream tool wants. To cover the whole catalog, wrap main in a loop over page-1.html through page-50.html and extend one combined list; the parse logic does not change.

What the output looks like

Run the script with python books_scraper.py and you get a clean structured record per book, ready for analysis, a database, or a spreadsheet.

json
[
  {
    "title": "A Light in the Attic",
    "price": 51.77,
    "availability": "In stock",
    "rating": 3,
    "link": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
  },
  {
    "title": "Tipping the Velvet",
    "price": 53.74,
    "availability": "In stock",
    "rating": 1,
    "link": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
  }
]

The matching CSV carries the same columns, one row per book, which drops straight into pandas or any spreadsheet for sorting by price or filtering on rating.

Common mistakes to avoid

A few habits separate a scraper that holds up from one that breaks on the next run.

  • Inspect the page before writing selectors. Open the page in your browser's dev tools and confirm the class names and structure. A selector aimed at an element that does not exist returns nothing, and that is the most common reason a scrape comes back empty.
  • Always handle missing data. Use .get(default="") or guard against None so a single absent field does not crash the whole loop. Pages are rarely as uniform as they look.
  • Strip and normalize text. Web text carries stray whitespace and currency symbols. Clean it with .strip() and .replace() at parse time so your stored values are consistent.
  • Pace your requests. Fetching pages in a tight loop is the quickest way to get throttled. Add a short delay between requests and keep your volume reasonable.

Scraping responsibly

Parsel only parses HTML you already hold, but how you obtain that HTML still matters. A few principles keep any scraping project on the right side of the line, whatever the target.

Check the site's terms of service and its robots.txt before you collect anything, and treat both as boundaries rather than suggestions. Stay on public data that any visitor can see without logging in, and keep your request rate reasonable so you are not straining the site's servers. When a project touches personal data, the obligations grow: regulations such as the GDPR and CCPA govern how personal information may be collected and used, so handle those cases with extra care or avoid them entirely. The example here uses a sandbox built for practice precisely so you can learn the mechanics without any of those concerns, and the same discipline carries over when you point your scraper at a real site. For more on operating within a site's limits, see how to scrape websites without getting blocked.

Recap

Key takeaways

  • Parsel is selector-first. Build one Selector(text=html) and query it with XPath or CSS, whichever reads more clearly for the element at hand.
  • get and getall cover most extraction. .get() returns the first match as a string, .getall() returns every match as a list, and .get(default="") keeps missing fields from crashing the run.
  • Text and attributes have a fixed pattern. Read text with /text() or ::text, and attributes with @attr or ::attr(), in XPath and CSS respectively.
  • Loop over a container, not the whole page. Select the repeating element once, then run scoped selectors against each item to build one clean record apiece, and export to JSON and CSV.
  • Separate fetch from parse. Let the Crawling API handle rendering, rotation, and blocking, then hand the finished HTML to Parsel so your extraction code stays simple.

Frequently Asked Questions (FAQs)

What is Parsel and why use it for web scraping?

Parsel is a small, fast Python library for extracting data from HTML and XML using XPath and CSS selectors. It is the same selector engine Scrapy uses, and it works well as a standalone tool when you already have the HTML and want to pull fields out of it. People choose it for the clean syntax, the support for both selector languages on the same object, and how easily it slots into an existing pipeline.

What is the difference between Parsel and BeautifulSoup?

Both parse HTML, but they differ in style. Parsel is selector-driven: you write XPath or CSS expressions and call .get() or .getall(). BeautifulSoup leans on Python method chaining such as find and find_all. Parsel also supports XPath natively, which BeautifulSoup does not. Choose whichever fits how you prefer to express selections.

What is the difference between get and getall in Parsel?

.get() returns the first matching value as a string, or None if nothing matches. .getall() returns a list of every matching value. Use .get() for a single field like a price or a title, and .getall() when you want a whole set, such as every link on a page. Passing .get(default="value") supplies a fallback for missing elements.

How do I handle pages that load content with JavaScript?

Parsel parses whatever HTML you give it, so the question is how you fetch that HTML. If a page renders its content with JavaScript, a raw request returns a thin shell with the data missing. Fetching through the Crawling API renders the page first and returns finished HTML, which you then load into a Selector exactly as shown here. The parsing code does not change.

Can I export Parsel results to JSON or CSV?

Yes. Parsel hands you plain Python values, so once you have built a list of dictionaries you write JSON with the standard json module and CSV with csv.DictWriter, as the full script does. From there the data drops into pandas or a database without any extra conversion.

Why use Parsel with the Crawling API instead of a plain request?

A plain request often fails before Parsel ever runs: the page may render client-side, or the site may block traffic that does not look like a real browser. The Crawling API handles rendering, IP rotation, and CAPTCHA challenges, then returns clean HTML. That keeps the fetch concerns out of your parsing code, so Parsel can focus on the one thing it does well, which is turning HTML into structured fields.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available