BeautifulSoup in Python is the library most people reach for when they need to pull structured data out of a messy HTML document. It turns a raw page into a navigable tree of Python objects, then gives you a small, readable API for finding the elements you care about and reading their text or attributes. You do not need to learn a query language or write a parser; you describe what you want with a tag name, an attribute, or a CSS selector, and BeautifulSoup hands it back.

This guide is a hands-on tour of that API. We install BeautifulSoup with a fast parser, build a soup from sample markup, then walk through find and find_all, the CSS-selector methods select and select_one, navigating the tree by parent and sibling, and reading text versus attributes. We finish with a realistic worked example that extracts a list of records and follows pagination. One thing to keep front of mind throughout: BeautifulSoup only parses. It never fetches a URL or runs JavaScript, so the HTML you give it has to already contain the data you want.

What BeautifulSoup does, and what it does not

BeautifulSoup is a parsing library. You hand it a string of HTML or XML and it builds a tree you can search and navigate. That is the entire job. It does not open network connections, it does not execute scripts, and it has no idea what a browser would render. Everything you extract has to be present in the markup you pass in.

That boundary matters because the two halves of a scrape are separate concerns. Fetching the page is one problem; parsing it is another. For static pages you can pair BeautifulSoup with the requests library to get the HTML. For pages that build their content client-side with JavaScript, a plain fetch returns a near-empty shell and there is nothing for BeautifulSoup to find. We come back to that case later. For now, treat BeautifulSoup as the parsing half of the pipeline and nothing more.

Install BeautifulSoup and a parser

BeautifulSoup itself ships in the beautifulsoup4 package. It also needs a parser to do the actual work of reading HTML. The standard library includes html.parser, which has zero extra dependencies and is fine for most jobs. For speed and for tolerance of broken markup, install lxml as well and use it as the parser.

bash
python -m venv bs_env
source bs_env/bin/activate

pip install beautifulsoup4 lxml requests

On Windows, activate the environment with bs_env\Scripts\activate instead of the source line. The requests install is optional; we use it only to fetch static pages in the worked example. Once everything is in place, you import the class from bs4, not from a package named after the library.

Create a soup

Building a soup takes two arguments: the markup and the name of the parser. To follow along without hitting a live site, start from an inline HTML string so the input is predictable.

python
from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1 id="title">Books</h1>
    <ul class="catalog">
      <li class="book"><a href="/b/1">Dune</a><span class="price">12.99</span></li>
      <li class="book"><a href="/b/2">Neuromancer</a><span class="price">9.50</span></li>
    </ul>
  </body>
</html>
"""

soup = BeautifulSoup(html, "lxml")
print(soup.title)  # None here; no <title> in the markup
print(soup.h1.get_text())  # Books

Swap "lxml" for "html.parser" if you did not install lxml; the rest of the API is identical. Accessing a tag by name, like soup.h1, returns the first matching element as a shortcut. It is handy for quick checks but limited, so the real searching happens through the methods below.

Pick a parser on purpose

The parser you choose changes how broken HTML is repaired. html.parser is built in and dependency-free. lxml is faster and more forgiving of malformed pages, which is most real-world pages. html5lib parses exactly like a browser but is slower. When two parsers disagree on a tricky page, that is usually the cause, so name the parser explicitly rather than letting BeautifulSoup guess.

find and find_all

The two workhorse methods are find and find_all. find returns the first element that matches, or None if nothing matches. find_all returns a list of every match, which is empty when nothing matches. Both take a tag name and optional filters.

python
first_book = soup.find("li")
print(first_book.a.get_text())  # Dune

all_books = soup.find_all("li")
print(len(all_books))  # 2

for book in all_books:
    print(book.a.get_text())

Filters narrow the search. You can match on a CSS class, an id, an arbitrary attribute, or a dictionary of attributes. Because class is a reserved word in Python, BeautifulSoup uses the keyword argument class_ with a trailing underscore.

python
# By class
prices = soup.find_all("span", class_="price")

# By id
heading = soup.find(id="title")

# By any attribute, via the attrs dict
links = soup.find_all("a", attrs={"href": True})

# Limit how many you get back
one_link = soup.find_all("a", limit=1)

You can also pass a list of tag names to match any of them, or a compiled regular expression to match tag names or attribute values by pattern. For most scraping, class and attribute filters cover the ground, and the CSS-selector methods below are often cleaner for nested conditions.

select and select_one with CSS selectors

If you already think in CSS selectors, select and select_one let you reuse that knowledge directly. select returns a list of every match; select_one returns the first match or None. They accept the same selector syntax you would write in a stylesheet or pass to document.querySelectorAll.

python
# Descendant: every <a> inside a .book li
titles = soup.select("li.book a")

# First price under the catalog list
first_price = soup.select_one("ul.catalog .price")

# Attribute selector
internal = soup.select("a[href^='/b/']")

# By id
heading = soup.select_one("#title")

Selectors shine when the target is defined by its position in the tree, like "the link inside the second list item." A long chain of find calls reads worse than the equivalent one-line selector. Whether you prefer find_all or select is mostly taste; the two are interchangeable for most tasks, and a single script often mixes both. For a deeper comparison of selector styles, see web scraping with XPath and CSS selectors.

Once you have an element, you can move around the tree relative to it instead of searching from the top again. Every tag exposes its parent, its children, and its siblings, which is exactly what you need when the data you want is near an element you already found.

python
price = soup.select_one(".price")

# Up: the <li> that contains this price
row = price.parent

# Down: direct children, ignoring whitespace text nodes
children = [c for c in row.children if c.name]

# Sideways: the <a> just before the price in the same <li>
title_link = price.find_previous_sibling("a")
print(title_link.get_text())  # Dune

A few notes that save confusion. .children and .contents include text nodes such as the whitespace between tags, so filtering on c.name keeps only real elements. .find_next_sibling and .find_previous_sibling skip over those text nodes for you and accept a tag name to match. Use .find_parent to walk up to a specific ancestor rather than just the immediate parent. Relative navigation is the most reliable way to handle pages where the useful value sits next to a stable label.

Get text and attributes

Extraction comes down to two things: the text inside an element and the values of its attributes. For text, get_text returns all the string content of an element and its descendants joined together. Pass strip=True to trim surrounding whitespace, which you almost always want.

python
link = soup.select_one("li.book a")

# Text content
print(link.get_text(strip=True))  # Dune

# Attribute by key; raises KeyError if absent
print(link["href"])  # /b/1

# Safe attribute read with a default
print(link.get("title", ""))

Reading an attribute with square brackets, like link["href"], raises a KeyError when the attribute is missing, so prefer link.get("href") when an attribute may not exist. The difference between text and attributes trips up beginners: a link's visible label comes from get_text, but its destination URL comes from the href attribute, and the two have nothing to do with each other.

Guard against missing elements

When a selector finds nothing, find and select_one return None, and calling .get_text() on None raises an AttributeError. Real pages are inconsistent: not every row has a price, not every card has a rating. Check that an element exists before reading from it, or wrap a small helper that returns None when the lookup fails, so one absent field does not crash a whole run.

A worked example: extract a list of records

Now put the pieces together on a static page that is built for practice scraping. The site quotes.toscrape.com serves plain server-rendered HTML, so requests can fetch it and BeautifulSoup can parse it directly. Each quote sits in a div.quote block with the text, the author, and a list of tags, which is a clean stand-in for the kind of repeated record you scrape in real jobs.

python
import requests
from bs4 import BeautifulSoup

def parse_quotes(html):
    soup = BeautifulSoup(html, "lxml")
    records = []
    for block in soup.select("div.quote"):
        text_el = block.select_one("span.text")
        author_el = block.select_one("small.author")
        tags = [t.get_text(strip=True) for t in block.select("a.tag")]
        records.append({
            "quote": text_el.get_text(strip=True) if text_el else None,
            "author": author_el.get_text(strip=True) if author_el else None,
            "tags": tags,
        })
    return records

url = "https://quotes.toscrape.com/"
resp = requests.get(url, timeout=15)
if resp.status_code == 200:
    for row in parse_quotes(resp.text):
        print(row)

The pattern here is the one you reuse everywhere: select the repeating container with select, then run a second, scoped query inside each container to pull individual fields. Scoping the per-field lookups to block rather than the whole document is what keeps row two's author from leaking into row one. Checking each element before calling get_text means a quote missing an author yields None instead of crashing the loop.

Follow pagination

One page is a demo; a full dataset usually spans many. The practice site links the next page through a li.next > a element, so the loop is straightforward: parse the current page, look for the next-page link, resolve it against the base URL, and stop when the link is gone.

python
import time
from urllib.parse import urljoin

base = "https://quotes.toscrape.com/"
next_url = base
all_rows = []

while next_url:
    resp = requests.get(next_url, timeout=15)
    if resp.status_code != 200:
        break

    soup = BeautifulSoup(resp.text, "lxml")
    all_rows.extend(parse_quotes(resp.text))

    next_link = soup.select_one("li.next a")
    next_url = urljoin(base, next_link["href"]) if next_link else None
    time.sleep(1)

print(f"Collected {len(all_rows)} quotes")

Two details make this robust. urljoin turns a relative href like /page/2/ into a full URL without string gymnastics, so it keeps working if the path shape changes. The time.sleep(1) spaces requests out so you are not hammering the server, which is both polite and the simplest way to stay under a rate limit. For a fuller treatment of fetching and structuring data end to end, see how to scrape a website with Python.

When BeautifulSoup is not enough: JavaScript pages

Everything above assumes the data is in the HTML you fetched. Plenty of modern sites do not work that way. They send a minimal HTML shell and build the real content in the browser with JavaScript, pulling data from background API calls after the page loads. Fetch one of those with requests and the body you hand to BeautifulSoup has empty containers where the records should be. BeautifulSoup is doing its job correctly; the data was never in the string.

You have two ways out. You can run a real browser yourself with a tool like Selenium or Playwright, wait for the content to render, and pass the rendered page_source to BeautifulSoup. That works but means running and maintaining a browser fleet, and on protected sites you also have to manage proxies and challenges. The other way is to offload the fetch-and-render step to a service that returns finished HTML, then parse that HTML with the same BeautifulSoup code you already wrote. Either way, the parsing layer does not change; only how you obtain the HTML does. For more on this split, see how to scrape JavaScript pages with Python.

Crawlbase Crawling API

BeautifulSoup only parses; it cannot render a JavaScript page or get you past an aggressive block. The Crawling API does the fetch-and-render half for you: send it a URL with a JS token, it runs the page in a real browser behind rotating residential IPs, and it returns finished HTML. You then parse that HTML with the exact same BeautifulSoup code in this guide. Try it on the free tier first.

Here is the shape of that pairing. The fetch goes through the Crawling API with a JavaScript token, and the returned body flows straight into your existing parser.

python
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

response = api.get("https://example.com/spa-page", {"ajax_wait": "true", "page_wait": 4000})

if response["status_code"] == 200:
    html = response["body"].decode("utf-8")
    soup = BeautifulSoup(html, "lxml")
    # same find/select calls as before
    print(soup.select_one("h1").get_text(strip=True))

If you would rather route your own client through rotating IPs instead of calling a managed endpoint, the Smart AI Proxy gives you residential rotation as a drop-in proxy, and for pre-parsed JSON the Crawling API returns structured fields for supported sites without any BeautifulSoup at all.

Recap

Key takeaways

  • BeautifulSoup only parses. It builds a searchable tree from HTML you already have; it never fetches a URL or runs JavaScript.
  • Install beautifulsoup4 plus a parser. Use html.parser for zero dependencies or lxml for speed and tolerance of broken markup, and name the parser explicitly.
  • Learn four methods. find and find_all search by tag and filters; select and select_one search by CSS selector. They are interchangeable for most tasks.
  • Read text and attributes separately. get_text(strip=True) gives the visible content; element["href"] or element.get("href") gives an attribute value.
  • Scope, guard, and paginate. Select the repeating container, query each field inside it, check for None, and follow next-page links with urljoin and a small delay.
  • For JavaScript pages, fix the fetch. Pair the Crawling API or a headless browser to get rendered HTML, then parse it with the same BeautifulSoup code.

Frequently Asked Questions (FAQs)

How do I install BeautifulSoup in Python?

Install it with pip install beautifulsoup4. The import name differs from the package name: you write from bs4 import BeautifulSoup in your code. BeautifulSoup also needs a parser to do the work. The built-in html.parser needs nothing extra, but installing lxml with pip install lxml gives you a faster, more forgiving parser, which is worth it for real pages.

What is the difference between find and find_all?

find returns the single first element that matches your criteria, or None if nothing matches. find_all returns a list of every matching element, which is empty when there are no matches. Use find when you expect exactly one element, like a page's main heading, and find_all when you are collecting many, like every row in a list. The CSS-selector equivalents are select_one and select.

How do I get the text inside an element versus an attribute?

Use element.get_text(strip=True) for the visible text content, including text from nested tags, with surrounding whitespace trimmed. Use element["href"] to read an attribute value, or element.get("href") to read it safely with a default when the attribute may be missing. A link's label and its destination URL are separate: the label is text, the URL is the href attribute.

Why does BeautifulSoup return an empty result on some pages?

Almost always because the data is not in the HTML you parsed. Many sites render content in the browser with JavaScript, so a plain fetch returns an empty shell and BeautifulSoup correctly finds nothing. BeautifulSoup does not run JavaScript. To handle those pages, get rendered HTML first, either with a headless browser like Selenium or Playwright or with the Crawling API, then parse that rendered HTML with the same code.

Can BeautifulSoup handle pagination on its own?

Not by itself, because BeautifulSoup does not fetch pages. You handle pagination with a loop: parse the current page, use BeautifulSoup to find the next-page link, fetch that URL with your HTTP client, and repeat until there is no next link. Resolve relative links with urllib.parse.urljoin and add a short delay between requests so you do not overload the server.

Should I use lxml or html.parser as the parser?

Use lxml when you can: it is faster and handles malformed HTML more gracefully, which covers most real-world pages. Use the built-in html.parser when you want zero extra dependencies and the pages are well-formed. For markup that must be parsed exactly the way a browser would, html5lib is the most accurate, at the cost of speed. Always pass the parser name explicitly so behavior stays consistent across machines.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available