Wikipedia holds a lot of its most useful data inside tables: lists of countries by area, population figures, sports standings, historical timelines, and thousands of other sortable grids. That tabular data is public and well structured, which makes it ideal raw material for research and analysis, but copying it cell by cell into a spreadsheet does not scale past one or two tables.

This guide shows you how to scrape Wikipedia tables with Python the clean way. You build a small, runnable script that fetches an article through the Crawling API, locates the table you want, parses it straight into a pandas DataFrame, cleans it, and exports a tidy CSV. The walkthrough stays scoped to public table data, and the legality section near the end covers Wikipedia's CC-BY-SA license and the sanctioned bulk paths, so read it before you point this at many articles. This post is specifically about tables; for page titles, images, and infobox fields, see the companion guide on how to scrape Wikipedia.

What you will build

A Python script that takes a public Wikipedia article URL, fetches the rendered HTML through the Crawling API, selects a specific table by its CSS class, reads it into a pandas DataFrame, cleans the columns, and writes the result to CSV. The running example is the article "List of countries and dependencies by area". The output record carries these fields per row:

  • Rank the row position in the source table where present.
  • Country / dependency the name as it appears in the table cell.
  • Total area the area figure in square kilometres and miles.
  • Land area the land-only area where the table breaks it out.
  • Water area the water area and its percentage of the total.
  • Notes any footnote or annotation column the table carries.

Understanding the structure of Wikipedia tables

Wikipedia tables are written in a mix of HTML and wikitext, but by the time the page reaches your scraper they have been rendered into ordinary <table>, <tr>, <th>, and <td> elements with header rows and data cells.

The detail that matters most for scraping is the CSS class. Most data tables carry the wikitable class, which applies the standard bordered style you see across articles. That shared class is your anchor: it lets you select data tables and skip layout or navigation tables on the same page, and sortable tables add a sortable class as a further hint of clean columnar data. An article often contains several tables, so you will usually narrow down by class, by caption text, or by index once you know which one you want.

Why a plain request can fall short

For one article a bare HTTP request returns usable HTML, because Wikipedia serves article content server-side. The trouble shows up at volume: fetch dozens or hundreds of articles in a tight loop from a single datacenter IP and you look nothing like a normal reader, which is exactly the traffic pattern rate limits and bot defenses are built to slow down. You can also hit transient blocks or partial responses that quietly break an unattended run.

Routing requests through the Crawling API smooths both out. It fetches each page behind a rotating pool of trusted IPs, so a larger crawl spreads across many addresses instead of hammering one, and returns finished HTML you can hand straight to a parser. For the bigger picture, see how to scrape websites without getting blocked.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to the parsing side, the BeautifulSoup guide pairs well with this tutorial.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda, and make sure Python is on your PATH.

A Crawlbase account and token. Sign up, open your dashboard, and copy your normal token from the account docs page. Crawlbase includes 1,000 free requests to start, which is plenty for working through this guide. Treat the token like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create a virtual environment so project dependencies stay isolated, then install the libraries the scraper needs.

bash
python --version

python -m venv wiki_env
source wiki_env/bin/activate

pip install crawlbase beautifulsoup4 pandas lxml

On Windows, activate the environment with wiki_env\Scripts\activate instead of the source line. Four dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 isolates the exact table you want, pandas reads that table into a DataFrame and exports CSV, and lxml is the fast HTML parser pandas uses under the hood for read_html. The csv module ships with the standard library, so there is nothing more to install for the export step.

Step 1: Fetch a rendered Wikipedia article

Start by getting the article HTML. Import the CrawlingAPI class, initialize it with your token, request the article URL, and check the response status before you parse. Checking the Crawlbase status_code first keeps failures loud instead of silent.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    response = api.get(page_url)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    article_url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area"
    html = crawl(article_url)
    print(html[:500] if html else "No HTML returned")

Run the script with python wikipedia_scraper.py and you should see real article markup in the first 500 characters, which confirms the fetch works before you write a selector. The crawl function returns decoded HTML on a 200 status and None otherwise, so the rest of the script can guard on a falsy return instead of crashing on a bad response.

Crawlbase Crawling API

The crawl helper above keeps your code simple: one call returns finished HTML for any article. Behind that call the Crawling API rotates through trusted IPs and handles blocks for you, so when you scale from one article to hundreds of tables you skip building and babysitting a proxy pool yourself. Point it at a public article on the free tier first.

Step 2: Locate the table you want

An article usually holds several tables, so the next step is selecting the right one. Load the HTML into BeautifulSoup and use select_one with the wikitable class to grab the first data table. When a page has many wikitables, switch to select and index the one you want, or filter by the table's caption text.

python
from bs4 import BeautifulSoup

def find_table(html, index=0):
    soup = BeautifulSoup(html, "html.parser")
    tables = soup.select("table.wikitable")
    if not tables:
        print("No wikitable found on this page.")
        return None
    print(f"Found {len(tables)} wikitable(s); using index {index}.")
    return tables[index]

The find_table helper returns one rendered <table> element, or None with a clear message if the page has no wikitable. The index argument lets you target the second or third data table on busier articles without rewriting the selector. Printing the table count is a small touch that tells you, on the very first run, whether you are aiming at the table you expect.

Why isolate the table first

You could hand the whole page to pandas.read_html and get back a list of every table, but that mixes data tables with sidebars and navigation boxes and leaves you guessing at indexes. Selecting the one wikitable you want with BeautifulSoup first, then parsing only that element, gives you a predictable single DataFrame every time.

Step 3: Parse the table into a DataFrame

With the table element in hand, pandas does the heavy lifting. Pass the table's HTML to pandas.read_html, which returns a list of DataFrames; because you isolated a single table, you take the first element. This is the core of table scraping: one function turns rendered HTML rows and cells into a typed, columnar DataFrame.

python
import pandas as pd
from io import StringIO

def table_to_df(table):
    frames = pd.read_html(StringIO(str(table)))
    if not frames:
        return None
    df = frames[0]
    print(f"Parsed table with {df.shape[0]} rows and {df.shape[1]} columns.")
    return df

Wrapping str(table) in StringIO matches how current pandas expects HTML passed to read_html, and it keeps the function quiet of deprecation warnings. The printed shape is your sanity check: if the row and column counts match what you see in the article, the parse landed on the right table. If you would rather walk the rows yourself, you can loop over table.select("tr") with BeautifulSoup and read each th and td cell by hand, but read_html is faster and handles header detection for you.

Step 4: Clean the DataFrame

Raw Wikipedia tables come with quirks: multi-level column headers, footnote markers in cells, blank columns, and merged-cell artifacts. A short cleaning pass turns the parsed frame into something analysis-ready before you export it.

python
def clean_df(df):
    # Flatten multi-level headers into single strings.
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join(str(p) for p in col if "Unnamed" not in str(p)).strip()
                      for col in df.columns]

    # Strip footnote markers like [1] and [a] from string cells.
    df = df.replace(r"\[.*?\]", "", regex=True)

    # Drop fully empty rows and columns.
    df = df.dropna(axis=0, how="all").dropna(axis=1, how="all")

    # Trim surrounding whitespace on string cells.
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].astype(str).str.strip()

    return df.reset_index(drop=True)

Each step targets a real-world mess. Flattening a MultiIndex turns nested header rows into single readable column names and drops the Unnamed placeholders pandas inserts for blank header cells. The regex replace removes the bracketed footnote markers Wikipedia sprinkles through cells, such as [1] or [a], which otherwise pollute numeric columns. Dropping all-empty rows and columns clears merged-cell artifacts, and the final strip trims stray whitespace. Adapt the steps to the table in front of you; not every table needs every pass.

Step 5: Assemble the full script

Now wire the pieces into one runnable script: fetch the article, isolate the table, parse it, clean it, and export to CSV.

python
from io import StringIO

import pandas as pd
from bs4 import BeautifulSoup
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    response = api.get(page_url)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def find_table(html, index=0):
    soup = BeautifulSoup(html, "html.parser")
    tables = soup.select("table.wikitable")
    if not tables:
        return None
    return tables[index]

def table_to_df(table):
    frames = pd.read_html(StringIO(str(table)))
    return frames[0] if frames else None

def clean_df(df):
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join(str(p) for p in col if "Unnamed" not in str(p)).strip()
                      for col in df.columns]
    df = df.replace(r"\[.*?\]", "", regex=True)
    df = df.dropna(axis=0, how="all").dropna(axis=1, how="all")
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].astype(str).str.strip()
    return df.reset_index(drop=True)

def main():
    article_url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area"
    html = crawl(article_url)
    if not html:
        return

    table = find_table(html, index=0)
    if table is None:
        print("No table found.")
        return

    df = table_to_df(table)
    if df is None:
        print("Table could not be parsed.")
        return

    df = clean_df(df)
    df.to_csv("wikipedia_table.csv", index=False)
    print(f"Saved {df.shape[0]} rows to wikipedia_table.csv")
    print(df.head())

if __name__ == "__main__":
    main()

The script fetches the article, isolates the first wikitable, parses it into a DataFrame, runs the cleaning pass, and writes wikipedia_table.csv with a header row and no index column. Printing df.head() at the end gives you an instant preview in the terminal. To target a different article or a different table on the same page, change article_url and the index argument to find_table.

What the output looks like

Run the full script with python wikipedia_scraper.py and you get a clean CSV, one row per table row, ready for pandas, a database, or a spreadsheet.

csv
Rank,Country / dependency,Total area (km2),Total area (mi2),Land,Water,Notes
,World,510072000,196940000,148940000,361132000,
1,Russia,17098246,6601665,16376870,721391,
2,Antarctica,14200000,5500000,14200000,0,
3,Canada,9984670,3855100,9093507,891163,
4,China,9596961,3705407,9326410,270550,
5,United States,9525067,3677649,9147593,377424,

The columns line up with the source table, the footnote markers are gone, and the figures are clean enough to load straight into pandas for filtering, sorting, or joining against other datasets. Swap in any article with a wikitable and the same five functions produce the same tidy output.

Scaling to many articles

Pulling one table is the building block. To collect the same table across many articles, wrap the five functions in a loop over a list of URLs, pace it with a short sleep, and concatenate the cleaned frames. A few habits keep a longer run healthy.

  • Pace your requests. Add a short sleep between requests and keep volume reasonable, especially against a community-run resource like Wikipedia.
  • Lean on rotation. Routing each fetch through the Crawling API spreads requests across many IPs, so a larger crawl does not concentrate on one address.
  • Prefer the sanctioned bulk paths. For large volumes, the Wikipedia API and the official database dumps are the intended route and load the live site far less than scraping, as covered in the next section.

For larger crawls, the async Crawler queues requests and delivers results to a webhook, which suits running many articles without holding connections open. To push the cleaned tables into a warehouse, the approach in web scraping to SQL carries over from the DataFrames you build here.

Scraping public Wikipedia tables for research or analysis is generally acceptable, but the important part is the license, not just the access. Wikipedia's text content is published under the Creative Commons Attribution-ShareAlike license (CC-BY-SA): you are free to reuse and adapt it provided you attribute the source and share any derivative work under the same license. If you publish or redistribute a table you pulled, credit Wikipedia and link back to the article you took it from, and keep the same-license condition in mind for anything you build on top of it. Scraping does not remove these terms; it only moves the data.

Stay inside a few sensible lines. Respect Wikipedia's Terms of Use and its robots.txt, keep your request rate low so you are not straining a nonprofit's servers, and limit yourself to tables and other public content rather than republishing whole articles. This guide is scoped to tabular data for exactly that reason: it is the slice most useful to reuse and the easiest to attribute cleanly.

When you need data at volume, scraping is usually the wrong tool. The Wikimedia Foundation provides sanctioned paths built for this: the official MediaWiki API for structured queries, and the periodic database dumps for bulk download of entire wikis. These load the live site far less, give you cleaner data, and are the route the project explicitly invites for heavy use. Reach for those first, and treat live scraping as the option for one-off or small jobs where a dump would be overkill.

Recap

Key takeaways

  • The wikitable class is your anchor. Most Wikipedia data tables carry the wikitable CSS class, so you select the right grid by class and skip layout and navigation tables on the same page.
  • pandas read_html does the parse. Isolate one table with BeautifulSoup, then pass its HTML to read_html to get a typed DataFrame in a single call, rather than walking rows and cells by hand.
  • Clean before you export. Flatten multi-level headers, strip bracketed footnote markers, and drop empty rows and columns so the CSV is analysis-ready.
  • Route fetches through the Crawling API to scale. One crawl call returns finished HTML and rotates IPs for you, so collecting tables across many articles does not hammer one address.
  • Attribute and use the sanctioned paths. Wikipedia content is CC-BY-SA, so credit the source on reuse, respect the Terms of Use and robots.txt, and prefer the Wikipedia API and database dumps for bulk work.

Frequently Asked Questions (FAQs)

How do I extract a specific table from a Wikipedia page?

Select it by CSS class with BeautifulSoup. Most data tables carry the wikitable class, so soup.select("table.wikitable") returns every data table on the page; index the one you want and pass its HTML to pandas.read_html. When several tables share the class, narrow down by index or by the table's caption text, as the find_table helper in this guide shows.

Should I use pandas read_html or BeautifulSoup to parse the table?

Both, in sequence. Use BeautifulSoup to isolate the exact <table> you want by class, then hand that single element to pandas.read_html to turn it into a DataFrame. Passing the whole page to read_html works too, but it returns a list with every table on the page, including sidebars, which leaves you guessing at indexes.

Why route Wikipedia requests through the Crawling API at all?

For one article you may not need to, since Wikipedia serves content server-side. The value shows up at volume: fetching many articles from a single datacenter IP looks like a bot and invites rate limiting and transient blocks. The Crawling API rotates trusted IPs and returns finished HTML, so a larger crawl spreads across many addresses.

Scraping public tables for research is generally fine, but Wikipedia content is licensed under CC-BY-SA, so if you republish a table you must attribute the source and share derivative work under the same license. Respect the Terms of Use and robots.txt, keep your rate low, and stick to tables rather than redistributing whole articles. For bulk needs, the Wikipedia API and database dumps are the sanctioned path.

How do I save the scraped table to CSV?

Once the table is a DataFrame, call df.to_csv("wikipedia_table.csv", index=False). The index=False argument keeps pandas from writing its row index as an extra column, so you get a clean header row and one row per table row. From there the CSV loads straight back into pandas or any spreadsheet.

How is this different from the general scrape Wikipedia guide?

This tutorial is scoped to tables: selecting a wikitable, parsing it with pandas, cleaning the columns, and exporting CSV. The companion guide on how to scrape Wikipedia covers the broader page, such as the title, images, and infobox fields. Use this one when your target is columnar data and that one when you need other parts of an article.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available