Web Scraping With XPath and CSS Selectors

Q: How do I write selectors that do not break when the site changes?

Anchor on stable hooks like id, data-testid, itemprop, or ARIA roles rather than generated visual classes. Keep selectors short by matching the nearest meaningful container instead of tracing the full ancestry, avoid hard-coded position indexes where a stable label exists, and use contains for multi-value classes. Then fail loud on missing fields so a markup change shows up as a clear signal, not a silent blank.

Every scraper lives or dies on one decision: how it locates the element it wants inside a page full of markup. Get that wrong and your script breaks the next time the site nudges a class name or shifts a wrapper; get it right and the same parser runs for months. The two query languages you reach for here are XPath and CSS selectors for web scraping, and most working scrapers use a mix of both rather than picking a side.

This guide is a practical selector reference. We walk through CSS selectors and XPath side by side, show the equivalent expression in each language for the same elements, run both against real Python libraries, and cover when one clearly beats the other. By the end you will know which to reach for on a given page, and how to write selectors that survive a redesign instead of shattering on it.

The two languages at a glance

CSS selectors are the patterns you already write in stylesheets: .price, #header, div > span. Browsers evaluate them constantly, every scraping library supports them, and they read cleanly for the common cases. They are the shortest path to "grab this element" when a page has sensible classes and IDs.

XPath ("XML Path Language") is a full query language for navigating a document tree. It treats the page as nodes you can walk in any direction: down to children, up to ancestors, sideways to siblings. It can match on text content, filter with boolean conditions, and combine predicates in ways CSS simply cannot express. That power costs some verbosity, but it pays off on messy or deeply nested pages.

Both target the same DOM. The difference is reach and ergonomics: CSS is concise and familiar, XPath is precise and expressive. Knowing where each one stops is the whole game.

CSS selectors, field by field

CSS selectors locate elements by tag, class, ID, attribute, relationship, and position. Here are the building blocks you will use daily, with the markup each one targets.

Tag, class, and ID. The three most common starting points. A bare tag name matches every element of that type, a leading dot matches a class, and a leading hash matches an ID.

css

a                  /* every anchor on the page */
.product-title     /* any element with class product-title */
#product-price     /* the element with id product-price */
span.price-label   /* span elements that also have class price-label */

Descendant vs child. A space means "anywhere inside," however deep. A > means "direct child only," one level down. The distinction matters when a layout nests the same tag at several depths and you want just the immediate one.

css

div.price-container span      /* any span inside, at any depth */
div.price-container > span    /* only spans that are direct children */

Attributes. Square brackets match on any attribute, not just class and ID. Exact match with [attr=val], substring with [attr*=val], prefix with [attr^=val], suffix with [attr$=val]. Attribute selectors are often your most stable hook, since data attributes change less than visual classes.

css

a[role='link']            /* anchors with role exactly "link" */
[data-testid='price']     /* any element with that test id */
a[href^='/product/']      /* anchors whose href starts with /product/ */

Position. Pseudo-classes select by place in the sibling order. :first-child, :last-child, and the workhorse :nth-of-type(n) let you grab the nth element of a given tag, which is how you pull "the second row" or "the fourth list item" out of a repeating block.

css

.product-list li:first-child       /* first item in the list */
ul.specs li:nth-of-type(3)         /* the third li */
table tr:nth-of-type(2) td         /* cells in the second row */

That set covers the large majority of real extraction work. Where CSS runs out of room is matching on the text an element contains, and walking up the tree from a known node. For those, you reach for XPath.

XPath, field by field

XPath expresses a path through the document. A leading // means "search anywhere in the tree," a single / means "direct child," and square brackets hold predicates that filter the matched nodes. Here are the same kinds of targets written in XPath.

Tag and descendant. The double slash is your everyday opener; it finds matching elements at any depth.

xpath

//div                         (: every div on the page :)
//div[@class='price-container']/span   (: direct span children :)

Predicates on attributes. Inside the brackets you test attributes with @. Exact match is [@class='x']; for classes that carry multiple space-separated values, contains(@class, 'x') is safer because it matches when x is one of several.

xpath

//*[@id='product-price']                  (: by id :)
//*[contains(@class, 'product-title')]    (: class among many :)
//a[@href]                                (: any anchor that has an href :)

Text matching. This is XPath's headline feature. You can select an element by the text it holds, exactly with text()='...' or loosely with contains(text(), '...'). CSS has nothing equivalent.

xpath

//button[text()='Add to Cart']
//span[contains(text(), 'In stock')]
//label[normalize-space()='Email address']

Position. XPath indexes are 1-based and live in predicates. You can also use functions like last() and position() to pick from the end or a range.

xpath

(//div[@class='product'])[1]      (: first matching product :)
//ul[@class='specs']/li[3]        (: the third li :)
//ul/li[last()]                   (: the final li :)

Axes. The real power move. Axes let you move in directions CSS cannot: following-sibling, preceding-sibling, parent, and ancestor. The classic case is a label-value pair where you know the label text and want the value beside it.

xpath

(: the value cell next to the "Founded" label :)
//th[text()='Founded']/following-sibling::td

(: walk up from a price to its product card :)
//span[@class='price']/ancestor::div[@class='card']

Those last two are the kind of query that has no clean CSS equivalent at all, which is exactly why XPath stays in the toolkit.

Side by side: the same element, both ways

Seeing the languages line up makes the trade-offs concrete. For the common targets the two are near-equivalent, and CSS usually reads shorter.

text

Goal                      CSS                          XPath
all anchors               a                            //a
class match               .product-title               //*[contains(@class,'product-title')]
id match                  #product-price               //*[@id='product-price']
tag + class               span.price-label             //span[@class='price-label']
descendant                .box span                    //*[@class='box']//span
direct child              .box > span                  //*[@class='box']/span
attribute exact           a[role='link']               //a[@role='link']
nth of type               li:nth-of-type(3)            //li[3]
text match                (not possible)               //button[text()='Buy']
walk up the tree          (not possible)               //span/ancestor::div[@class='card']

The pattern is clear: for tag, class, ID, attribute, and position the choice is mostly taste, and CSS wins on brevity. The last two rows are where XPath stands alone.

Running both in Python

Theory only goes so far; here is how each language looks in code. We use parsel (the selector library Scrapy is built on) because it speaks both CSS and XPath against the same parsed document, so you can compare them line for line. BeautifulSoup and lxml are the other common choices, noted after.

bash

python -m venv selectors_env
source selectors_env/bin/activate

pip install parsel

Load some markup once, then query it both ways. Note that parsel's .css() and .xpath() both return a selector list, so the access pattern is identical regardless of language.

python

from parsel import Selector

html = """
<div class="card">
  <h2 class="product-title">Wireless Mouse</h2>
  <span class="price">$24.99</span>
  <a role="link" href="/product/mouse">Details</a>
</div>
"""

sel = Selector(text=html)

# CSS: concise and familiar
title = sel.css("h2.product-title::text").get()
price = sel.css("span.price::text").get()
link  = sel.css("a[role='link']::attr(href)").get()

# XPath: the same three fields
title = sel.xpath("//h2[@class='product-title']/text()").get()
price = sel.xpath("//span[@class='price']/text()").get()
link  = sel.xpath("//a[@role='link']/@href").get()

print(title, price, link)

For BeautifulSoup the CSS path is soup.select_one("span.price") and soup.select(...) for many; it does not support XPath natively. When you need XPath specifically, lxml is the standard tool: tree.xpath("//span[@class='price']/text()") on a parsed lxml.html document. parsel is the convenient middle ground because it hands you both APIs on one object.

CSS compiles to XPath under the hood

Libraries like parsel and lxml translate a CSS selector into XPath before running it (via the cssselect package). That is why anything you can express in CSS has an XPath equivalent, but not the reverse: text matching and upward axes have no CSS form to translate from. When a CSS selector cannot say what you mean, dropping to XPath is the natural next step, not a workaround.

When XPath wins

Reach for XPath when the page fights back. Three situations make it the clear choice.

You need to match on text. "The button that says Add to Cart" or "the row whose label is Founded" can only be expressed by content. //button[text()='Add to Cart'] and contains(text(), ...) have no CSS equivalent.
You need to walk up the tree. When you can reliably find a leaf, like a unique price, but the element you actually want is its container, ancestor::div[@class='card'] climbs back up. CSS only goes down and across, never up.
You need compound conditions. XPath predicates combine with and and or: //div[@class='item' and @data-available='true'], or filtering on position and attribute at once. Stacking conditions this way is awkward or impossible in CSS.

The label-and-value pattern is the one you will hit most. On a spec table or profile sidebar, the field you want sits in a cell next to a stable label, while its own class is generic or absent. Anchoring on the label text and stepping sideways with following-sibling is far more durable than counting :nth-of-type positions that shift when a field is added or removed.

When CSS wins

For the everyday majority, CSS is the better default. It is shorter, more readable, and the syntax is one most developers already carry from front-end work, so a teammate can review your selectors without learning a second language. On a well-structured page with sane classes and IDs, .product-card .price says everything you need in less space than its XPath twin.

CSS also pairs naturally with browser-automation tools. When you are driving a headless browser and need to scrape dynamic content, the same CSS selectors you would write in document.querySelector carry straight over, which keeps your selector vocabulary consistent across the static-parse and live-DOM parts of a project. For simple, fast, repeated extraction on a tidy layout, CSS is the right tool, and you only escalate to XPath when CSS genuinely cannot express the target.

Crawlbase Crawling API

Selectors are unavoidable when you parse raw HTML yourself, but they are not always your job to write. The Scraper API auto-parses common page types, such as product pages, search results, and reviews, into structured JSON, so for supported targets you skip XPath and CSS entirely and just read fields off the response. Where you do need custom parsing, pair it with rendered HTML and parse with the selectors above. Start on the free tier.

Start free

Writing selectors that do not break

The hard part of scraping is not picking a language; it is writing selectors that survive the site's next deploy. The same robustness rules apply to both XPath and CSS.

Prefer stable attributes over visual classes. Hashed or utility classes like css-1x7a9q or mt-4 are generated and change often. A data-testid, an id, an itemprop, or an ARIA role is far more likely to outlive a restyle. Anchor on those when they exist.
Avoid long, deep chains. A selector like body > div > div > section > div:nth-child(2) > ul > li encodes the entire layout, so any wrapper added anywhere along that path breaks it. Match the nearest meaningful container and one stable hook instead of tracing the full ancestry.
Do not lean on brittle positions. :nth-of-type(4) assumes the count never changes. When a stable label or attribute is available, anchor on that and navigate relatively (which is where XPath axes shine), rather than hard-coding an index.
Use contains for multi-value classes. An element with class="btn btn-primary active" will not match [@class='btn-primary'] exactly. Use contains(@class, 'btn-primary') in XPath, or the plain .btn-primary class selector in CSS, which already matches one class among many.
Fail loud, not silent. Wrap extraction so a missing field returns None instead of crashing, then log which selector came back empty. That turns a site change from a mysterious blank record into a clear signal of which selector needs maintenance.

Treat selectors as code that needs upkeep. Markup drifts, and a scraper that ran clean last quarter will eventually return empty fields. The fix is almost always re-inspecting the live element in dev tools and tightening the selector, not rebuilding the scraper. For the wider setup, the guide on how to scrape a website with Python walks through fetching, parsing, and storing end to end, and these selector patterns slot directly into that flow.

Skipping selectors entirely

Sometimes the best selector is no selector. If your target is a common page type, the Crawling API returns parsed JSON directly, so there is nothing to select. For everything else, you still fetch and parse yourself, and the rendered HTML you parse can come from the Crawling API when a page is client-rendered or guarded. Either way the selector skills here are what turn raw markup into clean records, and knowing both languages means you are never stuck because one of them cannot reach an element.

Recap

Key takeaways

Both target the same DOM. CSS selectors are concise and familiar; XPath is verbose but more expressive. Most real scrapers use a mix.
CSS covers tag, class, ID, attribute, and position with short, readable patterns: .class, #id, div > span, [attr=val], :nth-of-type(n).
XPath does what CSS cannot: match on text with text() and contains(), walk up the tree with ancestor, step sideways with following-sibling, and combine conditions with and/or.
Run both in Python with parsel (.css() and .xpath() on one object); BeautifulSoup is CSS-only, lxml is the go-to for XPath.
Robustness beats cleverness. Prefer stable attributes, avoid deep chains and brittle indexes, use contains for multi-value classes, and fail loud when a field goes missing.
You can skip selectors on supported page types with the Crawling API's auto-parsing, and reserve hand-written selectors for custom targets.

Frequently Asked Questions (FAQs)

Which is better for beginners, XPath or CSS selectors?

CSS selectors, in most cases. The syntax overlaps with what you already know from styling pages, it reads cleanly for tag, class, ID, and attribute targets, and every scraping library supports it. Pick up XPath next, specifically for the things CSS cannot do: matching on text content and navigating upward or sideways through the tree.

Are XPath and CSS selectors supported by all scraping libraries?

Most support at least one and many support both. parsel and Scrapy handle CSS and XPath on the same object, lxml is built for XPath, and Selenium and Playwright accept both. BeautifulSoup is the notable exception: it supports CSS through .select() but has no native XPath. Check your library's docs before committing to a selector style.

Can a CSS selector match an element by its text?

No. CSS has no way to select an element by the text it contains; it matches on tags, classes, IDs, attributes, and position only. When you need "the button that says Add to Cart" or "the cell next to the Founded label," that is exactly the case for XPath's text() and contains(text(), ...), which have no CSS equivalent.

Is XPath faster than CSS selectors?

In most scraping work the difference is negligible, since libraries often compile CSS down to XPath internally before running it. Choose based on expressiveness and readability rather than raw speed. If a CSS selector says what you need clearly, use it; reach for XPath when you need text matching, upward navigation, or compound conditions that CSS cannot express.

How do I write selectors that do not break when the site changes?

Anchor on stable hooks like id, data-testid, itemprop, or ARIA roles rather than generated visual classes. Keep selectors short by matching the nearest meaningful container instead of tracing the full ancestry, avoid hard-coded position indexes where a stable label exists, and use contains for multi-value classes. Then fail loud on missing fields so a markup change shows up as a clear signal, not a silent blank.

When should I skip selectors altogether?

When your target is a common page type that an auto-parsing service already understands. The Crawling API returns structured JSON for supported targets like product pages and search results, so there is no HTML to parse and no selector to maintain. Keep hand-written XPath and CSS for custom pages or fields the auto-parser does not cover.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

The two languages at a glance

CSS selectors, field by field

XPath, field by field

Side by side: the same element, both ways

Running both in Python

When XPath wins

When CSS wins

Writing selectors that do not break

Skipping selectors entirely

Key takeaways

Frequently Asked Questions (FAQs)

Which is better for beginners, XPath or CSS selectors?

Are XPath and CSS selectors supported by all scraping libraries?

Can a CSS selector match an element by its text?

Is XPath faster than CSS selectors?

How do I write selectors that do not break when the site changes?

When should I skip selectors altogether?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Inside Modern Anti-Bot Evasion: A Systems View

How to Scrape Local Business Listings with Python: names, addresses, ratings, and more

Build a Website Change Tracker with Python: snapshots and SHA-256 diffs

The infrastructure brief, in your inbox.

We use cookies

Customize cookies