Most of the data on the internet lives in unstructured HTML, scattered across pages that were built for human eyes rather than for your scripts. Web scraping is how you turn that into structured records you can save, query, and analyze, without copy-pasting by hand or waiting for a website to ship an API it may never build. Price tracking, market research, lead lists, and training-set collection all start the same way: fetch a page, parse it, and store the fields you care about.

This guide walks the full Python workflow end to end. You will set up the toolkit, send a request, parse HTML with CSS selectors, handle JavaScript-rendered pages, follow pagination, store the results as CSV or JSON, and deal with the blocks that show up the moment you scale. Every snippet is real and copy-pasteable, and the whole walkthrough stays scoped to public data on a practice site built for learning.

What you will build

A small, complete Python scraper that reads a paginated list of book listings from a public practice site, pulls a clean record from each one, walks every page until there are no more, and writes everything to disk. The same shape, fetch then parse then loop then store, is the backbone of nearly every scraper you will ever write.

  • Title. The product name from each listing card.
  • Price. The displayed price string, ready to clean into a number.
  • Availability. Whether the item is in stock.
  • Rating. The star rating attached to each card.
  • URL. The absolute link to the detail page.

We target books.toscrape.com, a sandbox built specifically for practicing scraping. It is static, well structured, and fair game, so you can focus on technique without fighting blocks on your first attempt.

How web scraping works

A scraper is just an HTTP client plus a parser. The client requests a URL and the server returns HTML; the parser loads that HTML into a tree you can query by tag, class, or CSS selector, and you copy the values you want into a list of records. Search engines have worked this way since the early crawlers of 1993, and the mechanics have barely changed: discover URLs, fetch each one, extract structured fields, and move on.

What has changed is the modern web. Many sites now ship a near-empty HTML shell and render the visible content in the browser with JavaScript, and most serious targets defend themselves against automated traffic. Those two realities, client-side rendering and bot defense, are the reason a "comprehensive" guide cannot stop at requests and BeautifulSoup. We will start with the simple stack because it teaches the fundamentals, then show where it breaks and what replaces it.

Set up the Python toolkit

The Python scraping ecosystem is deep, but you only need a handful of tools to cover almost every job. Here is the modern toolkit and when each piece earns its place.

  • requests sends HTTP requests and returns the response. It is the right default for static pages.
  • BeautifulSoup parses HTML into a navigable tree and is forgiving of the messy markup real pages always have.
  • lxml is a fast parser backend that BeautifulSoup can use, and it brings full XPath support when you need it.
  • Selenium or Playwright drive a real browser so they can render JavaScript and interact with a page by clicking and typing.
  • Scrapy is a full crawling framework with built-in concurrency, retries, and pipelines, for when one script grows into a real project.

If you want a wider survey of what is available, see the best Python web scraping libraries. For this tutorial, start with a clean virtual environment and the two libraries that do the core work.

bash
python --version

python -m venv scraper_env
source scraper_env/bin/activate

pip install requests beautifulsoup4 lxml

On Windows, activate the environment with scraper_env\Scripts\activate instead of the source line. You need Python 3.8 or later; check with python --version and install from python.org if it is missing. With the environment active, you are ready to send your first request.

Step 1: Send a request and read the response

Every scrape begins with one HTTP request. Send a GET to the URL, confirm the status code is 200 before you do anything else, and the page's HTML is in hand.

python
import requests

url = "https://books.toscrape.com/catalogue/page-1.html"
headers = {"User-Agent": "Mozilla/5.0 (scraper tutorial)"}

response = requests.get(url, headers=headers, timeout=10)
if response.status_code == 200:
    print(response.text[:500])
else:
    print(f"Request failed: {response.status_code}")

Two small habits pay off right away. A User-Agent header makes your request look like a browser rather than an anonymous script, which many sites prefer. A timeout stops the scraper from hanging forever when a server stalls. Run this and you should see the first 500 characters of real HTML printed to your terminal, which confirms the fetch works before you write a single selector.

Crawlbase Crawling API

That bare requests.get works on a static practice page, but on a real target it breaks on JavaScript and gets blocked at scale. The Crawling API takes the same URL, renders the page in a real browser behind a rotating residential IP, and returns finished HTML, so the parsing code in the next steps stays identical and you skip running a headless browser fleet and proxy pool yourself.

Step 2: Parse HTML with selectors

Raw HTML is just a string. To select elements you load it into BeautifulSoup, which turns the markup into a tree you can query by tag name and CSS class. Open the page in your browser, right-click a book card, and choose Inspect to read the structure: on this site each book sits in an article.product_pod, with the title in the h3 a title attribute, the price in p.price_color, availability in p.instock, and the rating encoded as a class on p.star-rating.

python
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")
books = soup.select("article.product_pod")

print(f"Found {len(books)} books on this page")

The "lxml" argument tells BeautifulSoup to parse with the fast lxml backend you installed; if you skip the install, pass "html.parser" instead, which ships with Python. The select method takes a CSS selector and returns every match as a list, so article.product_pod hands you all twenty book cards on the page. If you prefer find and find_all, they do the same job with a method-call style. For a deeper tour of both styles, see how to use BeautifulSoup in Python, and for the difference between CSS selectors and XPath, see web scraping with XPath and CSS selectors.

Step 3: Extract clean fields

Now pull the data out of each card. Loop over the elements, read the value from each child, and collect one tidy dictionary per book. Wrapping the selectors in a small helper keeps a missing field from crashing the whole run.

python
from urllib.parse import urljoin

BASE = "https://books.toscrape.com/catalogue/"

def text_of(element, selector):
    el = element.select_one(selector)
    return el.get_text(strip=True) if el else None

def parse_books(soup):
    rows = []
    for card in soup.select("article.product_pod"):
        link = card.select_one("h3 a")
        rating = card.select_one("p.star-rating")
        rows.append({
            "title": link["title"] if link else None,
            "price": text_of(card, "p.price_color"),
            "availability": text_of(card, "p.instock"),
            "rating": rating["class"][1] if rating else None,
            "url": urljoin(BASE, link["href"]) if link else None,
        })
    return rows

The text_of helper queries a single element and returns None when it is missing, instead of throwing on a .get_text() call against nothing. The title and URL come from attributes rather than text, so we read them off the <a> tag directly. The rating is stored as a second class on p.star-rating (for example class="star-rating Three"), so we take the second class name. urljoin turns the relative href into an absolute URL. Call parse_books(soup) and you get a clean list of dictionaries, one per book.

Step 4: Handle JavaScript-rendered pages

The practice site is static, which is exactly why it is a good first target. Many real sites are not: they send a near-empty shell and build the content in the browser with JavaScript. requests only retrieves that initial shell and never runs scripts, so when you parse the response the fields you saw in your browser are simply not there.

The classic fix is a real browser. Playwright (or Selenium) launches Chromium, lets the page's JavaScript run, and then hands you the fully rendered HTML, which flows into the same BeautifulSoup parser you already wrote.

python
# pip install playwright && playwright install chromium
from playwright.sync_api import sync_playwright

def render(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
    return html

soup = BeautifulSoup(render(url), "lxml")

The wait_until="networkidle" option holds until the page stops making network requests, which is usually enough for client-rendered content to appear. This works, but a headless browser is heavy: it is slow at volume, hungry for memory, and brittle when a site detects automation. For the full treatment of this problem, see how to scrape JavaScript pages with Python and the dedicated Playwright web scraping guide.

Step 5: Follow pagination

One page is a demo; the real catalog runs across many pages. This site links the next page with a li.next a element, and when it is gone you have reached the end. So the loop is simple: fetch the current page, parse it, find the next link, and repeat until there is no next link.

python
import time

def scrape_all():
    all_rows = []
    next_url = BASE + "page-1.html"
    while next_url:
        response = requests.get(next_url, headers=headers, timeout=10)
        if response.status_code != 200:
            print(f"Stopped at {next_url}: {response.status_code}")
            break
        soup = BeautifulSoup(response.text, "lxml")
        all_rows.extend(parse_books(soup))

        next_link = soup.select_one("li.next a")
        next_url = urljoin(next_url, next_link["href"]) if next_link else None
        time.sleep(1)
    return all_rows

The while next_url loop runs until the next-link selector returns nothing, at which point next_url becomes None and the loop ends naturally. The site's href is relative, so urljoin resolves it against the current page. The time.sleep(1) between pages is not optional politeness on a real target: pacing your requests is the single easiest way to stay under a site's rate limits.

Step 6: Store the data as CSV or JSON

Data that lives only in memory disappears when the script ends. Write it to disk so you can open it in a spreadsheet, load it into pandas, or feed it to whatever comes next. Python's built-in csv and json modules handle both formats with no extra dependencies. CSV is ideal for flat, tabular records; JSON preserves nested structure and is friendlier to other programs. If you are unsure which to pick, see JSON vs CSV main differences.

python
import csv, json

def save_csv(rows, filename="books.csv"):
    if not rows:
        return
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows(rows)

def save_json(rows, filename="books.json"):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(rows, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    data = scrape_all()
    save_csv(data)
    save_json(data)
    print(f"Saved {len(data)} books")

DictWriter matches each dictionary's keys to CSV columns, so the header row writes itself from the field names you already chose. newline="" prevents blank lines between rows on Windows, and encoding="utf-8" keeps accented characters intact. For larger projects you would write to a database instead of a file, but the records are identical: a list of dictionaries maps cleanly onto SQL rows or a document store. Run the script and you have a full export of every book across every page. That is a complete, working scraper.

What the output looks like

Each record is a flat dictionary, which serializes neatly to JSON. A single entry from books.json looks like this.

json
{
  "title": "A Light in the Attic",
  "price": "£51.77",
  "availability": "In stock",
  "rating": "Three",
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}

The price still carries its currency symbol and the rating is a word rather than a number, which is normal: scrapers capture what the page shows, and a separate cleaning pass converts "£51.77" to 51.77 and "Three" to 3 before analysis. Keeping extraction and cleaning as distinct steps makes both easier to debug.

Why scrapers get blocked, and how to stay unblocked

The practice site never fights back, but real targets do. Two walls show up the moment you scale, and neither is solvable by tweaking selectors.

The first is anti-bot defense. Datacenter IPs, repetitive request patterns, and traffic that does not look like a real browser get challenged with CAPTCHAs or blocked outright. Your scraper might work for ten requests and then start returning 403s or empty pages. The second is client-side rendering, covered in Step 4: a bare fetch cannot see content the browser builds with JavaScript. You can fight both yourself by maintaining a pool of rotating residential proxies and running a headless browser fleet, but stitching those together and keeping them healthy is most of the engineering effort, and none of it is the data you actually want.

A managed crawling API folds both into a single request. You send it the URL, it renders the page in a real browser behind a trusted rotating IP, and it returns finished HTML for the exact same parser you already wrote. Install the official client alongside your existing libraries.

bash
pip install crawlbase

Keep your Crawlbase token handy; it is the authentication key for every call. The swap is one line: where you called requests.get, you call the API instead, and the returned HTML flows into the same parse_books function.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def fetch(url):
    options = {"ajax_wait": "true", "page_wait": 2000}
    result = api.get(url, options)
    if result["status_code"] == 200:
        return result["body"].decode("utf-8")
    return None

html = fetch(url)
soup = BeautifulSoup(html, "lxml")  # same parser, unchanged
rows = parse_books(soup)

The ajax_wait and page_wait options matter on a client-rendered target: ajax_wait waits for asynchronous content to finish, and page_wait holds for a fixed number of milliseconds so late elements appear before capture. Because the API returns HTML, dropping it in is a one-line change rather than a rewrite. For the full anti-blocking playbook, including header strategy and proxy rotation, see how to scrape websites without getting blocked.

Two token types

Crawlbase offers a normal token for fetching static HTML and a JavaScript token that renders the page in a real browser first. Use the normal token for static pages like the practice catalog; switch to the JavaScript token for any site that builds its content client-side. If your parsed fields come back empty on a real target, the JavaScript token is usually the fix.

Scaling beyond a single script

The fetch-parse-loop-store pattern carries you a long way, but two needs eventually push you past a single script. The first is concurrency: scraping pages one at a time is slow once you have thousands of URLs. The second is structure: retries, deduplication, and data pipelines do not belong in an ad-hoc loop. This is where Scrapy earns its place. It gives you parallel requests, automatic retries, request scheduling, and item pipelines out of the box, so you describe what to extract and the framework handles the orchestration.

Even with Scrapy, the two walls from the previous section do not disappear: you still need rendering for JavaScript pages and trusted IPs to avoid blocks at volume. The clean separation is to let the framework manage concurrency and pipelines while a managed API handles rendering and rotation, so a Scrapy spider's downloader simply routes each request through the Crawling API. That keeps your code about the data and offloads the infrastructure that has nothing to do with it.

Scrape responsibly and legally

Scraping public data is generally permissible, but how you do it and what you collect matter more than the act itself. Before you point a scraper at any site, read its robots.txt and its terms of service: the first signals which paths the site asks automated clients to avoid, and the second sets the rules you agree to by using it. Pace your requests so you never strain the server, identify your client honestly, and prefer a site's official API when one exists, since an API is the access path the owner actually built for programmatic use and it spares you the fragility of parsing HTML.

Stay on the right side of the line by scoping collection to public, non-personal data. Avoid anything behind a login, anything that requires accepting terms you would be circumventing, and personal data covered by privacy regimes like GDPR and the CCPA, where collection can require consent and a lawful basis. Do not redistribute copyrighted media you scrape, and when a project is commercial or touches regulated data, get the legal sign-off you would get for any other data source. Responsible scraping is mostly common sense: take only what is public, take it gently, and respect the wishes the site has already published.

Recap

Key takeaways

  • The core loop is fetch, parse, loop, store. requests gets the HTML, BeautifulSoup extracts fields, pagination walks the pages, and the csv or json module saves the result.
  • Match the tool to the page. requests and BeautifulSoup cover static sites; Playwright or Selenium render JavaScript; Scrapy adds concurrency and pipelines at scale.
  • Inspect before you select. Open the page's dev tools to find the tags and classes that hold your data, then map each field to a CSS selector.
  • Plain requests has two limits. It cannot run JavaScript and it gets blocked at scale, neither of which selectors can fix.
  • A managed API solves both in one call. The Crawling API renders the page behind a trusted rotating IP and returns finished HTML, so your existing parser keeps working unchanged.

Frequently Asked Questions (FAQs)

What is web scraping?

Web scraping is the automated extraction of data from web pages. A script requests a URL, the server returns HTML, and a parser pulls out the specific fields you want and saves them in a structured format like CSV, JSON, or a database. It is how you turn pages built for human reading into data you can query and analyze at scale.

Which Python libraries do I need to start?

For a typical static site, requests and BeautifulSoup are enough: requests downloads the page and BeautifulSoup extracts fields by tag and CSS class. Add lxml for faster parsing and XPath support, Playwright or Selenium when a site renders with JavaScript, and Scrapy when you need concurrency and pipelines for a larger project.

Why is my scraped data empty when the page clearly has content?

Almost always because the site renders its content with JavaScript. requests only retrieves the initial HTML shell and does not run scripts, so the data you see in your browser is not present in what you parse. Render the page first, either with a headless browser or with the Crawling API's JavaScript token, before BeautifulSoup can find the fields.

How do I scrape multiple pages?

Find the link or pattern the site uses for its next page, then loop. If there is a "next" button, follow its href until it disappears, as shown in Step 5. If the URLs follow a number pattern like page-2.html, you can build them in a range loop instead. Either way, add a short delay between pages to stay polite and unblocked.

How do I avoid getting blocked while scraping?

Pace your requests with a delay, send a realistic User-Agent header, and avoid hammering a single path. At scale you also need IPs that look like real visitors, which one machine cannot provide. Routing through rotating residential IPs, whether via the Crawling API or the Smart AI Proxy, is what keeps high-volume runs from tripping rate limits and CAPTCHAs.

Scraping public data is generally permissible, but it depends on the site's terms of service, your jurisdiction, and what you do with the data. Check the robots.txt and terms before you start, avoid personal data covered by privacy laws like GDPR and the CCPA, never scrape content behind a login, and prefer an official API when one exists. When in doubt, collect only public data and keep your volume low enough that you are not straining the server.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available