Wikipedia is the largest reference work ever assembled, and almost all of it is structured the same way: a title, a lead summary, a stack of section headings, an infobox of key facts, and any number of data tables. That regularity makes it one of the most useful public sources on the web for research, content enrichment, knowledge graphs, and training datasets. This guide shows you how to scrape Wikipedia with Python in a clean, repeatable way.

Everything here stays scoped to public encyclopedia content: article titles, lead summaries, section headings, infobox fields, and tables that anyone can read without logging in. We will build a small scraper that fetches a rendered article page through the Crawling API, parses those fields with BeautifulSoup, and exports them to JSON and CSV. Wikipedia text is freely licensed under CC-BY-SA, so before you point this at anything real, read the attribution and legality section near the end, and consider the official MediaWiki API, which is the sanctioned path for most projects.

What you will build

A Python script that takes a Wikipedia article URL, fetches the page through the Crawling API, and parses a handful of structured fields:

  • Article title the page heading from the firstHeading element.
  • Summary the lead intro paragraphs that open the article.
  • Section headings the h2 and h3 headings that outline the article.
  • Infobox fields the label-value pairs from the side infobox (born, citizenship, occupation, and so on).
  • Tables the wikitable data tables, parsed into rows you can write to CSV.

Each field maps to a stable Wikipedia CSS class or element id, so the scraper works across most article pages, not just one.

Why scrape through the Crawling API

A Wikipedia article page is mostly server-rendered HTML, so a plain request will often return usable markup. The friction shows up at scale. Wikimedia asks automated clients to identify themselves, respects rate limits strictly, and will throttle or block traffic that hammers its servers from a single address. If you run a tight loop from one datacenter IP, you will hit those limits quickly, and a blocked IP stops the whole job.

Routing requests through the Crawling API solves the operational side. You send it the article URL, it fetches the page behind a rotating pool of trusted IPs, handles retries, and hands back finished HTML you can parse. That keeps any single address from tripping a rate limit, and it means you do not maintain a proxy pool yourself. The parsing approach below is the same one you would use against any HTML source, so if you are new to it, our primer on how to use BeautifulSoup in Python covers the extraction side in depth.

Prefer the official API where you can

Wikipedia runs the MediaWiki API and publishes full database dumps. For large, structured pulls those are the preferred path, and we recommend them in the legality section. This tutorial covers HTML scraping because it is the most direct way to read exactly what a reader sees on a single page, and the techniques transfer to any site.

Prerequisites

A few things to have in place first. None take long.

Basic Python. You should be comfortable running a script and installing packages with pip. Familiarity with HTML structure helps, since the whole job is targeting elements by class and id.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token from the account docs page. Wikipedia is static HTML, so the normal request token is the right one here; you do not need JavaScript rendering. Keep the token out of version control.

Set up the project

Create an isolated virtual environment, then install the three libraries the scraper needs.

bash
python --version

python -m venv wikipedia_env
source wikipedia_env/bin/activate

pip install crawlbase beautifulsoup4 pandas

On Windows, activate with wikipedia_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML, and pandas turns the infobox and tables into frames you can write to CSV.

Step 1: Fetch the article HTML

Start by pulling the raw page. Import CrawlingAPI, initialize it with your token, and request an article URL. Check the status code before parsing so failures stay loud instead of silent. We will use the Wikipedia article on Ada Lovelace as the running example, since it has a rich infobox and several tables.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    response = api.get(page_url)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    page_url = "https://en.wikipedia.org/wiki/Ada_Lovelace"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

Run this and you should see the opening of the article's HTML printed to your terminal. That confirms the request worked and gives you finished markup to parse. The CrawlingAPI object is created once and reused, which is the pattern you want when you start fetching many pages in a loop later.

Crawlbase Crawling API

Wikipedia rate-limits traffic that hits its servers hard from one address, and a blocked IP stops your whole job. The Crawling API fetches each article through a rotating pool of trusted IPs and handles retries, so no single address trips a limit and you skip running a proxy pool yourself. Start on the free tier and point it at a public article.

Step 2: Parse the title and summary

With HTML in hand, load it into BeautifulSoup. Wikipedia gives the page heading the id firstHeading, and the lead summary is the run of paragraphs inside the main content before the first section heading. We grab the title from that id, then collect the opening paragraphs as the summary.

python
from bs4 import BeautifulSoup

def parse_title_and_summary(html):
    soup = BeautifulSoup(html, "html.parser")

    title = soup.find("h1", id="firstHeading").get_text(strip=True)

    content = soup.find("div", class_="mw-parser-output")
    summary = []
    for p in content.find_all("p", recursive=False):
        text = p.get_text(strip=True)
        if text:
            summary.append(text)
        if len(summary) >= 3:
            break

    return title, " ".join(summary)

The find('h1', id='firstHeading') call targets the page title exactly, the same selector you would find by inspecting the page in your browser's dev tools. For the summary, we walk the direct child paragraphs of mw-parser-output, the wrapper around an article's body, and take the first few non-empty ones. Limiting to three keeps the lead intro without pulling the entire article into one string.

Step 3: Collect the section headings

Wikipedia outlines every article with a consistent heading structure. In the rendered page the section title text sits inside elements with the class mw-headline, nested under h2 and h3 tags. Collecting those gives you the table of contents in document order.

python
def parse_sections(soup):
    sections = []
    for tag in soup.find_all(["h2", "h3"]):
        headline = tag.find("span", class_="mw-headline")
        if headline:
            sections.append({
                "level": tag.name,
                "heading": headline.get_text(strip=True),
            })
    return sections

We only record headings that contain an mw-headline span, which filters out unrelated h2 and h3 tags in the page chrome and keeps the real article sections. Recording the level lets you rebuild the hierarchy later: h2 for top-level sections, h3 for subsections.

Step 4: Extract the infobox fields

The infobox is the boxed summary of key facts on the right side of many articles. Wikipedia marks it with the infobox class, and each fact is a table row pairing a label cell (infobox-label) with a data cell (infobox-data). Walking those rows turns the infobox into a clean dictionary of fields.

python
def parse_infobox(soup):
    infobox = soup.find("table", class_="infobox")
    if not infobox:
        return {}

    fields = {}
    for row in infobox.find_all("tr"):
        label = row.find("th", class_="infobox-label")
        data = row.find("td", class_="infobox-data")
        if label and data:
            key = label.get_text(strip=True)
            value = data.get_text(" ", strip=True)
            fields[key] = value

    image = infobox.select_one(".infobox-image img")
    if image and image.has_attr("src"):
        fields["image"] = "https:" + image["src"]

    return fields

Each infobox-label and infobox-data pair becomes one key-value entry, so a person's infobox yields fields like Born, Died, Resting place, Citizenship, and Occupation without you naming them in advance. The get_text(" ", strip=True) call joins multi-line values with a space, which keeps lists like multiple nationalities readable on one line. We also pull the lead image from .infobox-image img and prepend https: because Wikipedia serves image URLs protocol-relative.

Step 5: Parse the data tables

Article tables use the wikitable class. Pandas reads HTML tables directly, so the cleanest approach is to hand each table's HTML to pandas.read_html and let it return a DataFrame. That gives you rows and columns ready for CSV with almost no manual cell walking. If you want the general technique beyond Wikipedia, see our guide on how to scrape tables from a website.

python
import pandas as pd
from io import StringIO

def parse_tables(soup):
    tables = []
    for node in soup.find_all("table", class_="wikitable"):
        try:
            df = pd.read_html(StringIO(str(node)))[0]
            tables.append(df)
        except ValueError:
            continue
    return tables

We loop over every wikitable, pass its HTML to read_html, and keep the first DataFrame it returns for each. Wrapping the call in a try means a malformed table is skipped rather than crashing the run. The result is a list of DataFrames, one per table, which you can write to separate CSV files or concatenate as needed. For cleaning those frames before downstream use, our guide on how to structure and clean web-scraped data for AI and ML covers the common steps.

Step 6: Put it together and export

Now wire the pieces into one runnable script. It fetches the article, parses every field, writes the structured data to JSON, and writes each table to its own CSV file.

python
import json
from io import StringIO
import pandas as pd
from bs4 import BeautifulSoup
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    response = api.get(page_url)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def scrape_article(html):
    soup = BeautifulSoup(html, "html.parser")

    title = soup.find("h1", id="firstHeading").get_text(strip=True)

    content = soup.find("div", class_="mw-parser-output")
    summary = []
    for p in content.find_all("p", recursive=False):
        text = p.get_text(strip=True)
        if text:
            summary.append(text)
        if len(summary) >= 3:
            break

    sections = []
    for tag in soup.find_all(["h2", "h3"]):
        headline = tag.find("span", class_="mw-headline")
        if headline:
            sections.append(headline.get_text(strip=True))

    infobox = {}
    box = soup.find("table", class_="infobox")
    if box:
        for row in box.find_all("tr"):
            label = row.find("th", class_="infobox-label")
            data = row.find("td", class_="infobox-data")
            if label and data:
                infobox[label.get_text(strip=True)] = data.get_text(" ", strip=True)

    tables = []
    for node in soup.find_all("table", class_="wikitable"):
        try:
            tables.append(pd.read_html(StringIO(str(node)))[0])
        except ValueError:
            continue

    return {
        "title": title,
        "summary": " ".join(summary),
        "sections": sections,
        "infobox": infobox,
    }, tables

def main():
    page_url = "https://en.wikipedia.org/wiki/Ada_Lovelace"
    html = crawl(page_url)
    if not html:
        return

    article, tables = scrape_article(html)
    article["source_url"] = page_url
    article["license"] = "CC BY-SA 4.0"

    with open("wikipedia.json", "w", encoding="utf-8") as f:
        json.dump(article, f, indent=2, ensure_ascii=False)

    for i, df in enumerate(tables):
        df.to_csv(f"wikipedia_table_{i}.csv", index=False)

    print(json.dumps(article, indent=2, ensure_ascii=False))
    print(f"Saved {len(tables)} table(s) to CSV")

if __name__ == "__main__":
    main()

The script writes the title, summary, sections, and infobox to wikipedia.json, and each wikitable to its own numbered CSV. Notice the two extra fields we attach before saving: source_url and license. Recording where the data came from and that it is CC-BY-SA licensed is not optional bookkeeping, it is how you stay compliant with the attribution requirement, which the legality section covers below.

What the output looks like

Run the full script and the JSON file holds a clean, structured record of the article.

json
{
  "title": "Ada Lovelace",
  "summary": "Augusta Ada King, Countess of Lovelace, was an English mathematician and writer...",
  "sections": [
    "Biography",
    "Work",
    "Commemoration",
    "In popular culture"
  ],
  "infobox": {
    "Born": "Augusta Ada Byron 10 December 1815 London, England",
    "Died": "27 November 1852 (aged 36) Marylebone, London, England",
    "Occupation": "Mathematician"
  },
  "source_url": "https://en.wikipedia.org/wiki/Ada_Lovelace",
  "license": "CC BY-SA 4.0"
}

Alongside it you get one CSV per table, named wikipedia_table_0.csv, wikipedia_table_1.csv, and so on, each with the header row and data rows that pandas parsed. From here the JSON feeds a knowledge base or content pipeline, and the CSVs drop straight into a spreadsheet or a database.

Scaling to many articles

To build a dataset, wrap the fetch-and-parse loop over a list of article URLs and pace the requests. Wikipedia is sensitive to bursty traffic, so a small delay between pages keeps you a polite client.

python
import time

urls = [
    "https://en.wikipedia.org/wiki/Ada_Lovelace",
    "https://en.wikipedia.org/wiki/Alan_Turing",
    "https://en.wikipedia.org/wiki/Grace_Hopper",
]

dataset = []
for url in urls:
    html = crawl(url)
    if html:
        article, _ = scrape_article(html)
        article["source_url"] = url
        dataset.append(article)
    time.sleep(2)

print(f"Collected {len(dataset)} articles")

The time.sleep(2) between requests is the single most important habit when collecting from Wikipedia at any volume. Pace your run, keep batches modest, and stop when you have the sample you need. For large structured pulls, though, the MediaWiki API and the database dumps are genuinely faster and lighter on Wikimedia's servers, and we recommend them next.

Wikipedia is one of the more permissive large sites to work with, because its text content is freely licensed under Creative Commons Attribution-ShareAlike (CC-BY-SA). That means you are allowed to reuse and even redistribute article text, including in commercial projects, on two conditions: you attribute the source (credit Wikipedia and the contributors, typically with a link back to the article), and you share any derivative work under the same license. Some embedded media, such as certain images, carries its own separate license, so check the file page before reusing an image. The license and source_url fields the script records exist precisely so you can meet the attribution requirement downstream.

Permissive licensing is not a license to ignore the servers. Wikimedia's terms ask automated clients to respect rate limits, identify themselves, and avoid degrading the service for human readers. Hammering the site from a tight loop is both impolite and likely to get your IP blocked. Keep volume modest, pace your requests as shown above, and read the site's robots.txt. This tutorial stays on public encyclopedia content only; it does not touch user account pages, edit histories tied to individuals, or anything behind authentication.

For anything beyond a handful of pages, the sanctioned path is the official tooling. The MediaWiki API returns structured article content, including parsed sections and infobox data, without you scraping rendered HTML at all, and Wikimedia also publishes complete database dumps of the entire encyclopedia for bulk use. Both are faster, more stable, and far lighter on Wikimedia's infrastructure than HTML scraping at scale. Reach for HTML scraping when you need exactly what a reader sees on a single page; reach for the API or the dumps when you need volume.

Recap

Key takeaways

  • Wikipedia is highly structured. Title (firstHeading), lead summary, mw-headline sections, the infobox, and wikitable tables map to stable selectors that work across most articles.
  • Route through the Crawling API to stay unblocked. Rotating trusted IPs and built-in retries keep any single address from tripping Wikimedia's rate limits, with no proxy pool to maintain.
  • Let pandas do the tables. Passing each wikitable to pandas.read_html turns it into a DataFrame that exports straight to CSV.
  • Attribute the source. Wikipedia text is CC-BY-SA, so record the source URL and license and credit Wikipedia in any reuse; check media files for their own licenses.
  • Prefer the official path at scale. The MediaWiki API and database dumps are the sanctioned, lighter way to pull structured Wikipedia data in volume.

Frequently Asked Questions (FAQs)

Do I need the JavaScript token to scrape Wikipedia?

No. Wikipedia article pages are server-rendered static HTML, so the normal request token is enough. The JavaScript token exists for client-rendered sites that build their content in the browser, which Wikipedia does not. Using the normal token keeps each request lighter and avoids the extra rendering step.

How do I extract the infobox fields reliably?

Find the table.infobox element, then walk its rows pairing each th.infobox-label with its td.infobox-data cell. That label-value structure is consistent across articles, so the same loop turns a person's infobox, a country's infobox, or a film's infobox into a clean dictionary without hardcoding field names.

Can I scrape Wikipedia tables into CSV?

Yes. Select every table.wikitable, pass each one's HTML to pandas.read_html, and you get a DataFrame per table that you write out with to_csv. Wrapping the read in a try means a malformed table is skipped instead of stopping the run.

Wikipedia text is licensed under CC-BY-SA, which permits commercial reuse as long as you attribute Wikipedia and the contributors and share derivative work under the same license. Embedded media can carry its own license, so check each file before reusing images. Record the source URL and license with your data so attribution is straightforward later.

Should I use the MediaWiki API instead of scraping?

For anything at volume, yes. The MediaWiki API returns structured article content directly, and Wikimedia publishes full database dumps for bulk use. Both are faster and far lighter on Wikipedia's servers than HTML scraping. HTML scraping is the right tool when you need exactly what a reader sees on a specific page; the API and dumps are better for building large datasets.

How do I avoid getting blocked while scraping Wikipedia?

Keep your request rate low, add a real delay between pages as shown in the scaling section, and route through rotating IPs so no single address gets throttled. The Crawling API handles rotation and retries for you. Read the site's robots.txt, keep batches modest, and prefer the official API or dumps for large jobs so you stay a polite client.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available