HTML tables are how the web stores its most analysis-ready data: stock listings, sports standings, population figures, product comparisons, historical records. A table on a public page is already rows and columns, so the hard part is rarely the analysis. It is getting the data off the page and into a tool you can work with, cleanly and repeatably, without copying cells by hand.

This guide shows you three ways to scrape tables from a website, each suited to a different skill level and workflow: a no-code Google Sheets formula, a few lines of Python with pandas, and an R script using rvest. The running example is a public Wikipedia table, which is a good fit because the data is open and the markup is clean. Every method here stays scoped to public data, and the short responsible-scraping section near the end is worth reading before you point any of this at a real source.

What you will build

By the end you will have three working ways to pull a public table into a usable format. Each one targets the same kind of source, an HTML <table> on a public page, and produces the same kind of output, a tidy grid you can sort, filter, or export. The pieces you will end up with are:

  • A Google Sheets formula that imports a live table with one IMPORTHTML call, no code required.
  • A Python script that reads tables with pandas.read_html in one line, with a BeautifulSoup fallback for messy markup.
  • An R script that extracts tables with rvest's html_table into a data frame.
  • A clean tabular output from each method, ready to export to CSV or feed into analysis.

Why some tables need more than a plain request

For a clean, server-rendered page like most of Wikipedia, all three methods below work directly: the table markup is already in the HTML, so a formula or a one-line read pulls it straight out. That is the easy and common case, and it is where you should start.

The trouble starts on sources that do not hand you the table so freely. Two problems show up most often. First, some sites build their tables with JavaScript after the page loads, so the raw HTML you fetch is missing the rows you came for. Second, many sites watch for scraper-shaped traffic and rate limit or challenge requests from datacenter IPs after the first handful of hits. When either happens, IMPORTHTML returns an error, and a plain requests.get in Python returns a block page or empty markup instead of the table.

The fix for both is the same: fetch the page through a browser that runs its scripts, from an IP the site reads as a real visitor. You can assemble that yourself with a rotating IP pool and a headless browser, but that is most of the work. In the Python section we use the Crawling API for exactly these blocked or JavaScript-rendered sources, while keeping the plain path for everything that does not need it.

Prerequisites

What you need depends on which method you pick. None of them take long to set up.

For Google Sheets: just a Google account and a browser. There is nothing to install.

For Python: Python 3.8 or later. Confirm your version with python --version. You should be comfortable running a script and installing packages with pip. If parsing HTML is new to you, our guide to using BeautifulSoup in Python covers the basics this section assumes.

For R: a recent R install (4.0 or later is fine) and the ability to install packages from CRAN.

For blocked or JavaScript-rendered sources: a Crawlbase account and token. Sign up, open your dashboard, and copy your normal token. Treat the token like a password and keep it out of version control. You only need this for the Python path when a plain request fails.

Method 1: Google Sheets with IMPORTHTML

The fastest way to get a table off a public page needs no code at all. Google Sheets has a built-in IMPORTHTML function that fetches a page, finds a table or list by index, and drops it into your sheet as live cells. It even refreshes on its own, so the data stays current.

Open a new sheet, click an empty cell, and enter the formula. It takes three arguments: the page URL, whether you want a "table" or a "list", and the index of which one to grab (the first table on the page is 1, the second is 2, and so on).

html
=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_largest_cities", "table", 1)

Press Enter and Sheets fetches the page, pulls out the first table, and spills it across the cells starting from where you typed the formula. If the first table is not the one you want, increase the index until you land on it; a page often has navigation or infobox tables ahead of the main data. Once the data lands, you may want to resize columns or reformat cells, and you can leave the formula in place so the sheet refreshes when the source changes.

When IMPORTHTML returns an error

IMPORTHTML fetches raw HTML and does not run JavaScript, so it only sees tables that are present in the page source. If it returns #N/A or an empty result, the table is probably built by JavaScript after load, or the site blocked the request. Those are exactly the cases the Python method with the Crawling API handles, since it can render the page and fetch from a trusted IP first.

Crawlbase Crawling API

That IMPORTHTML formula works because Wikipedia serves its tables as plain HTML from an open source. The moment a table is built with JavaScript or sits behind bot protection, the formula returns #N/A and so does a plain Python request. The Crawling API rotates through residential IPs server-side, renders the page in a real browser when you ask it to, and hands back finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public page on the free tier first.

Method 2: Python with pandas and the Crawling API

Python is the workhorse for this. For a clean public table, pandas.read_html does almost everything in one line: give it a URL or a chunk of HTML and it returns a list of every table on the page as data frames. Start by installing the libraries.

bash
python -m venv tables_env
source tables_env/bin/activate

pip install pandas lxml beautifulsoup4 crawlbase

On Windows, activate the environment with tables_env\Scripts\activate instead of the source line. pandas does the table reading, lxml is the parser it uses under the hood, beautifulsoup4 is the fallback for messy markup, and crawlbase is the official client for the Crawling API.

For a public, server-rendered page, the simplest version is genuinely one call. Pass the URL to read_html and index into the list it returns.

python
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_largest_cities"

# read_html returns a list of every table it finds on the page
tables = pd.read_html(url)
print(f"Found {len(tables)} tables")

# pick the one you want by index, then work with it as a DataFrame
df = tables[0]
print(df.head())
df.to_csv("cities.csv", index=False)

That is the whole job when the page cooperates: read_html scans the HTML, builds a data frame for each <table>, and you keep the one you want and write it to CSV. Just like the Sheets index, you may need to try a couple of indices to find the main table rather than a sidebar or navigation block.

When the source blocks plain requests or builds its table with JavaScript, fetch the HTML through the Crawling API first, then hand that HTML to read_html. The only change is where the markup comes from.

python
import pandas as pd
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def fetch_html(url, render=False):
    options = {"country": "US"}
    if render:
        # use the JS token for JavaScript-built tables
        options["ajax_wait"] = "true"
    response = api.get(url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

url = "https://en.wikipedia.org/wiki/List_of_largest_cities"
html = fetch_html(url)

if html:
    tables = pd.read_html(html)
    df = tables[0]
    print(df.head())
    df.to_csv("cities.csv", index=False)

The country option pins the request to a US exit IP, and passing render=True tells the call to use a JavaScript token (here via ajax_wait) so the page renders before you get the HTML. With the markup in hand, read_html behaves exactly as before. This is the version to reach for the moment the plain path returns a block page or an empty table list.

Falling back to BeautifulSoup for messy tables

Sometimes read_html trips up: a table with merged cells, no real <table> tag, or rows scattered across nested markup. When that happens, drop down to BeautifulSoup and build the rows yourself by reading each cell. This is the same field-by-field approach as our guide to CSS selectors, applied to table cells.

python
import pandas as pd
from bs4 import BeautifulSoup

def table_to_rows(html, index=0):
    soup = BeautifulSoup(html, "html.parser")
    table = soup.find_all("table")[index]
    rows = []
    for tr in table.find_all("tr"):
        cells = tr.find_all(["th", "td"])
        row = [cell.get_text(strip=True) for cell in cells]
        if row:
            rows.append(row)
    return rows

rows = table_to_rows(html)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df.head())

The helper walks every <tr>, reads each <th> and <td> as stripped text, and skips empty rows. Treating the first row as the header and the rest as data gives you a clean data frame even when read_html could not parse the table on its own. Pass the same Crawling API HTML to this helper when the messy table also happens to be blocked or rendered.

Method 3: R with rvest

If your analysis already lives in R, you do not need to leave it. The rvest package reads HTML and its html_table function turns every table on a page into a data frame, much like pandas does. Install the package once from your R console.

r
install.packages("rvest")

With rvest loaded, read the page, extract its tables, and pick the one you want. The pattern mirrors the Python one: read the HTML, get a list of tables, index into it.

r
library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_largest_cities"

# read the page, then pull every table into a list of data frames
page <- read_html(url)
tables <- page %>% html_elements("table") %>% html_table()

# keep the first table and inspect it
df <- tables[[1]]
head(df)

# write it out for analysis elsewhere
write.csv(df, "cities.csv", row.names = FALSE)

R uses double brackets, tables[[1]], to pull a single element out of the list, where Python uses single brackets. Aside from that, the flow is identical: read the page, collect the tables, keep the one you need, and export it. From here the data frame behaves like any other in R, ready for dplyr, ggplot2, or a CSV export. If the source blocks the direct read, fetch the rendered HTML through the Crawling API as in the Python section and pass that string to read_html instead of the URL.

What the output looks like

Whichever method you use, the result is the same shape: a clean grid with a header row and one row per record. Exported to CSV, the largest-cities table looks like this.

csv
Rank,City,Country,Population,Year
1,Tokyo,Japan,37468000,2018
2,Delhi,India,28514000,2018
3,Shanghai,China,25582000,2018
4,Sao Paulo,Brazil,21650000,2018
5,Mexico City,Mexico,21581000,2018

The exact columns depend on the source table, but the structure is consistent: the header becomes your column names and each subsequent row is a record. From a CSV like this, you can load straight into a spreadsheet, a notebook, or a database without any further cleanup beyond the occasional type conversion.

Scraping tables responsibly

The methods above are easy to run, which makes it worth being deliberate about where you point them. Check each source's terms of service and its robots.txt before you collect anything at volume, and prefer genuinely public data, the kind a visitor sees without logging in, over anything behind an account or paywall. Pace your requests so you are not hammering a server in a tight loop, and lean on an official API or a published dataset when the source offers one, since that is almost always the lighter and more reliable route.

If you republish a table you scraped, credit the source and respect its license. Open datasets like Wikipedia carry clear reuse terms (attribution under a Creative Commons license, in that case), and many other sites do not permit redistribution at all. Collecting public figures for your own analysis is a far lighter use than reposting someone else's compiled data as your own, so when in doubt, link back to the source rather than copying it wholesale.

Recap

Key takeaways

  • Pick the tool that fits your workflow. Google Sheets IMPORTHTML for no-code, pandas read_html for Python, and rvest html_table for R all pull a public table in a line or two.
  • Index into the list of tables. All three methods return every table on the page, so try a couple of indices to land on the main data rather than a sidebar or navigation block.
  • Drop to BeautifulSoup when markup is messy. Merged cells or non-standard tables that break read_html can be rebuilt row by row from <tr>, <th>, and <td> tags.
  • Use the Crawling API for blocked or JavaScript tables. When a formula or plain request returns an empty result, fetch the rendered HTML behind a trusted IP, then parse it with the same code.
  • Scrape responsibly. Respect each source's ToS and robots.txt, prefer public data and official APIs, and credit or license any data you republish.

Frequently Asked Questions (FAQs)

What is the easiest way to scrape a table from a website?

For a public, server-rendered page, Google Sheets is the simplest. Type =IMPORTHTML(url, "table", 1) into a cell and Sheets fetches the page and drops the first table into your spreadsheet as live cells, no code required. It works for any page whose table is present in the raw HTML, and it refreshes on its own.

Why does IMPORTHTML return #N/A or an empty result?

IMPORTHTML reads raw HTML and does not run JavaScript, so it cannot see tables that a page builds with scripts after it loads. It also fails when the site blocks the request. In both cases, switch to the Python method and fetch the page through the Crawling API, which can render the page in a real browser and request it from a trusted IP before parsing.

How do I scrape a table with Python?

The shortest path is pandas.read_html(url), which returns a list of data frames, one per table on the page. Index into the list to keep the one you want and export it with to_csv. For sources that block plain requests or render tables with JavaScript, fetch the HTML through the Crawling API first and pass that markup to read_html instead of the URL.

How do I scrape a table in R?

Install rvest, read the page with read_html, then run html_elements("table") %>% html_table() to get a list of tables as data frames. Pick one with double-bracket indexing, such as tables[[1]], and export it with write.csv. The flow mirrors the pandas approach almost exactly.

What if pandas read_html cannot parse the table?

Some tables have merged cells, missing <table> tags, or rows nested in unusual markup that read_html cannot handle. Fall back to BeautifulSoup: loop over each <tr>, read every <th> and <td> as text, and build the rows yourself into a data frame. That gives you full control over how each cell is interpreted.

How do I scrape tables from a site that blocks me?

Sites that rate limit or challenge automated traffic need a request that looks like a real visitor. Route the fetch through rotating residential IPs so no single address trips a limit, and render JavaScript when the table is built client-side. The Crawling API handles both in one call; once you have the returned HTML, parse it with pandas, BeautifulSoup, or rvest exactly as you would any other page.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available