Python is the default language for web scraping, and for good reason: a few lines of requests and BeautifulSoup turn a live web page into structured data you can save, query, and analyze. If you can read HTML and write a loop, you can build a working scraper today.
This guide shows you how to scrape a website with Python end to end. You install the standard stack, fetch a page, parse it, select the elements you want, extract clean fields, loop through pagination, and write the results to CSV. We use a public practice site so every snippet actually runs. Then comes the honest part: plain requests falls apart on JavaScript-rendered pages and gets blocked at scale, so you will see how the Crawling API fixes both problems in a single call.
What you will build
A small Python scraper that reads a paginated list of quotes from a public practice site, extracts each quote's text, author, and tags, follows the "next" link until there are no more pages, and saves everything to a CSV file. The same pattern, fetch then parse then loop then store, is the backbone of almost every scraper you will ever write.
We target quotes.toscrape.com, a site built specifically for learning to scrape. It is static, well structured, and fair game, so it lets you focus on the technique without fighting blocks on your first try.
Prerequisites
You do not need much to get started.
Basic Python. You should be comfortable running a script and installing packages with pip. Loops, functions, and dictionaries are enough.
Python 3.8 or later. Check your version with python --version. If you do not have it, install it from python.org.
That is it for the first half of the tutorial. The two libraries you need install in one command, which we cover next.
Set up the project
Create a virtual environment so the project's dependencies stay isolated from the rest of your system, then install the two libraries that do the work.
python --version python -m venv scraper_env source scraper_env/bin/activate pip install requests beautifulsoup4
On Windows, activate the environment with scraper_env\Scripts\activate instead of the source line. Two dependencies carry the tutorial: requests fetches the page over HTTP, and beautifulsoup4 parses the returned HTML so you can pull out elements by tag and CSS class.
Step 1: Fetch a page
Every scrape starts with one HTTP request. Send a GET to the URL, check that the status code is 200 before you do anything else, and you have the page's HTML in hand.
import requests url = "https://quotes.toscrape.com/page/1/" headers = {"User-Agent": "Mozilla/5.0 (scraper tutorial)"} response = requests.get(url, headers=headers, timeout=10) if response.status_code == 200: print(response.text[:500]) else: print(f"Request failed: {response.status_code}")
Two small habits pay off immediately. A User-Agent header makes your request look like a browser rather than an anonymous script, which many sites prefer. A timeout stops your scraper from hanging forever when a server stalls. Run this and you should see the first 500 characters of real HTML printed to your terminal. That confirms the fetch works before you write a single selector.
Step 2: Parse the HTML with BeautifulSoup
Raw HTML is just a string. To select elements you load it into BeautifulSoup, which turns the markup into a tree you can query by tag name and CSS class. Open the page in your browser, right-click a quote, and choose Inspect to see the structure: on this site each quote sits in a div.quote, with the text in span.text, the author in small.author, and tags in a.tag.
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "html.parser") quotes = soup.select("div.quote") print(f"Found {len(quotes)} quotes on this page")
The html.parser argument tells BeautifulSoup which engine to use; it ships with Python and needs no extra install. The select method takes a CSS selector and returns every matching element as a list, so div.quote hands you all ten quote blocks on the page. If you prefer find and find_all, they do the same job with a method-call style instead of selectors. For a deeper tour of both, see how to use BeautifulSoup in Python.
Step 3: Extract the fields
Now pull the data out of each quote block. Loop over the elements, read the text from each child, and collect a clean dictionary per quote. Wrapping the selectors in a small helper keeps a missing field from crashing the whole run.
def text_of(element, selector): el = element.select_one(selector) return el.get_text(strip=True) if el else None def parse_quotes(soup): rows = [] for quote in soup.select("div.quote"): tags = [t.get_text(strip=True) for t in quote.select("a.tag")] rows.append({ "text": text_of(quote, "span.text"), "author": text_of(quote, "small.author"), "tags": ", ".join(tags), }) return rows
The text_of helper does two useful things at once: it queries a single element and returns None when the element is missing, instead of throwing on a .get_text() call against nothing. Tags need a list comprehension because there are several per quote, and joining them into one string keeps each row flat and CSV-friendly. Call parse_quotes(soup) and you get a tidy list of dictionaries, one per quote.
Step 4: Follow pagination
One page is a demo; the real list runs across many pages. This site links the next page with a li.next a element, and when it is gone you have reached the end. So the loop is simple: fetch the current page, parse it, find the next link, and repeat until there is no next link.
import time BASE = "https://quotes.toscrape.com" def scrape_all(): all_rows = [] next_url = f"{BASE}/page/1/" while next_url: response = requests.get(next_url, headers=headers, timeout=10) if response.status_code != 200: print(f"Stopped at {next_url}: {response.status_code}") break soup = BeautifulSoup(response.text, "html.parser") all_rows.extend(parse_quotes(soup)) next_link = soup.select_one("li.next a") next_url = BASE + next_link["href"] if next_link else None time.sleep(1) return all_rows
The while next_url loop runs until the selector for the next link returns nothing, at which point next_url becomes None and the loop ends naturally. The href on the site is relative, so prepend the base URL to make it absolute. The time.sleep(1) between pages is not optional politeness on a real target: pacing your requests is the single easiest way to stay under a site's rate limits.
Step 5: Save to CSV
Data that lives only in memory disappears when the script ends. Write it to a CSV file so you can open it in a spreadsheet, load it into pandas, or feed it to whatever comes next. Python's built-in csv module handles this without extra dependencies.
import csv def save_csv(rows, filename="quotes.csv"): if not rows: return with open(filename, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=rows[0].keys()) writer.writeheader() writer.writerows(rows) if __name__ == "__main__": data = scrape_all() save_csv(data) print(f"Saved {len(data)} quotes to quotes.csv")
DictWriter matches each dictionary's keys to CSV columns, so the header row writes itself from the field names you already chose. The newline="" argument prevents blank lines between rows on Windows, and encoding="utf-8" keeps quotation marks and accented author names intact. Run the script and you have a full CSV of every quote across every page. That is a complete, working scraper.
Where plain requests stops working
The practice site above is static, which is exactly why it is a good first target. Real-world sites are rarely that kind. Two problems show up the moment you point this code at a serious target, and neither is solvable by tweaking selectors.
JavaScript-rendered pages
Many modern sites send a near-empty HTML shell and build the visible content in the browser with JavaScript. requests only retrieves that initial shell; it does not run any scripts. So when you parse the response you find none of the data you saw in your browser, because that data only appears after the page's JavaScript executes. A bare fetch simply cannot see it. For the full picture of this problem, see how to scrape JavaScript pages with Python.
Blocks at scale
The second wall is anti-bot defense. Datacenter IPs, repetitive request patterns, and traffic that does not look like a real browser get challenged with CAPTCHAs or blocked outright. Your scraper might work for ten requests and then start returning 403s or empty pages. Adding headers and sleeps helps a little, but at any real volume you need IPs that read as genuine visitors, which a single machine cannot provide. The deeper playbook lives in how to scrape websites without getting blocked.
The fix: render and rotate in one call
You can solve both problems yourself by running a headless browser to render JavaScript and maintaining a pool of rotating residential proxies for the IPs. That works, but stitching those pieces together and keeping them healthy is most of the engineering effort, and it has nothing to do with the data you actually want.
The Crawling API folds both into a single request. You send it the URL, it renders the page in a real browser behind a trusted rotating IP, and it returns finished HTML for you to parse with the exact same BeautifulSoup code you already wrote. Install the official client alongside the libraries you have.
pip install crawlbase
Here is the before and after. The plain fetch on a JavaScript-heavy page returns a shell; the Crawling API call returns the rendered page. The parsing layer below does not change at all.
# Before: plain requests, breaks on JS pages and blocks response = requests.get(url, headers=headers, timeout=10) html = response.text # After: Crawling API renders the page behind a trusted IP from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"}) options = {"ajax_wait": "true", "page_wait": 5000} result = api.get(url, options) html = result["body"].decode("utf-8") if result["status_code"] == 200 else None # Same parser as before, unchanged soup = BeautifulSoup(html, "html.parser") rows = parse_quotes(soup)
The two wait options matter on a client-rendered target. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late elements appear before capture. Use the JavaScript token for sites that render in the browser; for static pages the normal token is faster. The important point is that html flows into the same parse_quotes function you wrote in Step 3, so adopting the API is a one-line swap, not a rewrite.
Crawlbase offers two token types. The normal token fetches static HTML, which is all you need for a site like the quotes practice page. The JavaScript (JS) token renders the page in a real browser first, which you need for any site that builds its content client-side. If your parsed fields come back empty on a real target, switching to the JS token is usually the fix.
Plain requests breaks on JavaScript pages and gets blocked at scale. The Crawling API renders the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so the BeautifulSoup code you already wrote keeps working on targets a bare fetch cannot touch. Try it on the free tier before you wire up a headless fleet of your own.
Useful Python scraping libraries
The two-library stack handles most static jobs, but a few others are worth knowing as your needs grow.
- requests is the workhorse HTTP client for fetching pages. Simple, reliable, and the right default for static targets.
- BeautifulSoup parses HTML and XML into a navigable tree. It is forgiving of messy markup, which real pages always have.
- Selenium drives a real browser, so it can render JavaScript and interact with pages by clicking and typing. Powerful, but heavy to run and slow at volume.
- Scrapy is a full crawling framework with built-in concurrency, retries, and pipelines. Reach for it when one script grows into a real project.
- pandas is not a scraper, but it is where scraped data often lands for cleaning, analysis, and export to other formats.
Habits that keep a scraper healthy
A scraper that works once is easy; one that keeps working takes a few disciplines. These apply whether you use plain requests or a managed API.
-
Pace your requests. A small delay between requests, like the
time.sleep(1)above, keeps you under rate limits and off block lists. Hammering a site in a tight loop is the fastest way to get throttled. -
Handle errors. Pages change, fields go missing, and servers hiccup. Check status codes, guard selectors against
None, and wrap risky parsing so one bad page does not kill the whole run. - Expect markup to drift. Class names and structures change without notice. When a field starts coming back empty, re-inspect the live page and update the selector. Periodic maintenance is normal, not a sign of a broken scraper.
-
Respect the target. Read the site's
robots.txtand terms, keep your volume reasonable, and collect only public data.
Key takeaways
- The core loop is fetch, parse, loop, store. requests gets the HTML, BeautifulSoup extracts fields, pagination walks the pages, and the csv module saves the result.
- Inspect before you select. Open the page's dev tools to find the tags and classes that hold your data, then map each field to a CSS selector.
-
Pace and guard your code. Add a delay between requests and return
Noneon missing elements so one bad page does not crash the run. - Plain requests has two limits. It cannot run JavaScript and it gets blocked at scale, neither of which selectors can fix.
- The Crawling API solves both in one call. It renders the page behind a trusted rotating IP and returns finished HTML, so your existing BeautifulSoup parser keeps working unchanged.
Frequently Asked Questions (FAQs)
Do I need both requests and BeautifulSoup?
For a typical static site, yes, and they pair naturally. requests fetches the page over HTTP and gives you the raw HTML as a string; BeautifulSoup turns that string into a tree you can query by tag and CSS class to pull out individual fields. requests does the downloading, BeautifulSoup does the extracting.
Why is my scraped data empty when the page clearly has content?
Almost always because the site renders its content with JavaScript. requests only retrieves the initial HTML shell and does not run scripts, so the data you see in your browser is not present in what you parse. You need to render the page first, either with a headless browser or with the Crawling API's JavaScript token, before BeautifulSoup can find the fields.
How do I scrape multiple pages?
Find the link or pattern the site uses for its next page, then loop. If there is a "next" button, follow its href until it disappears, as shown in Step 4. If the URLs follow a number pattern like /page/2/, you can build them in a range loop instead. Either way, add a short delay between pages to stay polite and unblocked.
How do I avoid getting blocked while scraping?
Pace your requests with a delay, send a realistic User-Agent header, and vary your targets instead of hammering one path. At scale you also need IPs that look like real visitors, which a single machine cannot provide. Routing through rotating residential IPs, whether via the Crawling API or the Smart AI Proxy, is what keeps high-volume runs from tripping rate limits.
When should I use the Crawling API instead of plain requests?
Use plain requests for static, low-volume targets where a bare fetch returns the data, like the practice site in this guide. Switch to the Crawling API when the page is JavaScript-rendered, when you are getting blocked or challenged, or when you need to scrape at a volume that a single IP cannot sustain. Because the API returns HTML, your existing parser does not change.
Is web scraping with Python legal?
Scraping public data is generally permissible, but it depends on the site's terms of service, your jurisdiction, and what you do with the data. Check the site's robots.txt and terms before you start, avoid personal data covered by privacy laws like GDPR, and never scrape content behind a login. When in doubt, collect only public data and keep your volume low enough that you are not straining the server.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

