Excel remains the workspace where a lot of analysis actually happens. Once a dataset lands in a spreadsheet, you can sort it, filter it, chart it, and hand it to a colleague who never has to touch a terminal. The catch is getting clean, structured data into the sheet in the first place, and copying rows from a web page by hand is slow, error-prone, and impossible to repeat on a schedule.
This guide shows you how to scrape data from a website into Excel with Python. You build a small, runnable script that fetches a rendered page through the Crawling API, parses the records you care about with BeautifulSoup, and writes a clean .xlsx file with a header row, one record per line, and basic formatting using openpyxl. The walkthrough stays on a neutral example page and public data, so you can run every step end to end before pointing it at anything of your own.
What you will build
A Python script that fetches a public listing page through the Crawling API, extracts a structured record per item, and saves the whole set to a formatted Excel workbook. The running example uses a generic public catalog page at a placeholder URL, so swap in your own target once the flow works. We pull these fields:
- Title the name or heading of each item.
- Price the listed price as shown on the page.
- Availability the stock or status label.
- Rating the review score where the page exposes one.
- Link the canonical URL of the item's own page.
The output is a single spreadsheet with those columns, a bold frozen header row, and auto-sized columns, ready to sort, filter, or chart inside Excel or Google Sheets.
Why a plain request often fails
The old way to get web data into Excel was the built-in From Web query: paste a URL, let Excel pull the tables, and refresh on a timer. It still exists, but it breaks on most modern sites for two reasons, and the same two reasons trip up a naive Python request.
First, many pages render their content in the browser with JavaScript. The initial HTML that a bare HTTP client receives is a thin shell, and the prices, titles, and stock labels only appear after the page's scripts run. Pull the data out of that first response and you capture an empty template instead of the real rows. Second, sites watch for automated traffic. Datacenter IP addresses and request patterns that do not look like a real browser get rate-limited, IP-blocked, or shown a CAPTCHA before they ever reach the content, which is exactly what happens when an Excel auto-refresh hits a site too often from one address.
So a reliable scraper needs two things in a single request: a browser that actually renders the page, and an IP the site reads as a real visitor. You can build that yourself with a headless browser plus a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call: you send it the URL, it renders the page behind a trusted IP, and it returns finished HTML for you to parse.
Crawlbase gives you two token types. The normal token fetches static HTML and is the right choice when the data is already in the page source. The JavaScript (JS) token renders the page in a real browser first, which you need when content loads client-side. Start with the normal token; switch to the JS token only if the fields come back empty.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a script and installing packages with pip. If the parsing side is new to you, the BeautifulSoup guide is a good companion, and the broader scraping with Python walkthrough covers the fundamentals.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda, and make sure Python is on your PATH.
A Crawlbase account and token. Sign up, open your dashboard, and copy your token from the account docs page. Crawlbase includes 1,000 free requests to start, which is plenty for working through this guide, and you pay only for successful requests. Treat the token like a password and keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the three libraries the script needs.
python --version python -m venv excel_env source excel_env/bin/activate pip install crawlbase beautifulsoup4 openpyxl
On Windows, activate the environment with excel_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML so you can pull individual fields by CSS selector, and openpyxl writes the .xlsx file with formatting. If you would rather work in DataFrames, pandas can write Excel too, and there is a short note on that near the end.
Step 1: Fetch a page through Crawlbase
Start by getting a finished page. Import the CrawlingAPI class, initialize it with your token, and request the target URL. Checking the Crawlbase pc_status before you parse keeps failures loud instead of silent, so you never try to scrape an error page.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def crawl(page_url): response = api.get(page_url) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed: {response['headers']['pc_status']}") return None if __name__ == "__main__": catalog_url = "https://example.com/products" html = crawl(catalog_url) print(html[:500] if html else "No HTML returned")
Run this with python excel_scraper.py and you should see the first 500 characters of real page markup, which confirms the request works before you write a single selector. If the page renders its data client-side, the body will look like an empty template; in that case add the JS token by passing {"token": "YOUR_CRAWLBASE_TOKEN", "javascript": "true"} when you create the client, and the API will render it in a browser first. For the neutral catalog example here, the normal token is enough.
The fetch step above is the part that quietly breaks Excel's own From Web query at any volume: a single IP refreshing a site gets throttled, and client-side pages return an empty shell. The Crawling API takes your token, renders the page when needed, and rotates through residential IPs server-side, then hands you finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public page on the free tier first.
Step 2: Parse the records you want
A catalog page is a grid of item cards, each with a title, a price, a stock label, and a link. Load the rendered HTML into BeautifulSoup, select every card, and read the fields out of each one. Guarding each lookup means a card that is missing a field returns None instead of crashing the loop.
from bs4 import BeautifulSoup def text_of(card, selector): el = card.select_one(selector) return el.get_text(strip=True) if el else None def parse_records(html, base_url): soup = BeautifulSoup(html, "html.parser") records = [] for card in soup.select("article.product"): link_el = card.select_one("a.product-link") href = link_el["href"] if link_el else None records.append({ "title": text_of(card, "h2.product-title"), "price": text_of(card, "span.price"), "availability": text_of(card, "span.stock"), "rating": text_of(card, "span.rating"), "link": urljoin(base_url, href) if href else None, }) return records
The text_of helper queries one element inside a card and returns its stripped text, or None when the element is absent. The select("article.product") call returns every item card, and the loop reads the title, price, availability, and rating from each. urljoin turns a relative href into a full URL so the link column is usable on its own. The selectors here (article.product, h2.product-title, span.price) are illustrative; on a real target you inspect the page in your browser's dev tools and replace them with the actual class names.
Generated class names and layout markup change without notice on production sites. Treat any selector as a starting template, not a contract. When records come back empty, re-inspect the live page in dev tools and update the selector. Periodic selector maintenance is normal for any scraper, not a sign something is broken.
Step 3: Write the records to Excel with openpyxl
Now turn the list of records into a spreadsheet. openpyxl builds a workbook in memory: you create a sheet, write a bold header row, append one row per record, freeze the header, size the columns to their content, then save. The function below takes the records and a filename and does all of it.
from openpyxl import Workbook from openpyxl.styles import Font from openpyxl.utils import get_column_letter HEADERS = ["Title", "Price", "Availability", "Rating", "Link"] KEYS = ["title", "price", "availability", "rating", "link"] def save_to_excel(records, filename="products.xlsx"): wb = Workbook() ws = wb.active ws.title = "Products" ws.append(HEADERS) for cell in ws[1]: cell.font = Font(bold=True) for record in records: ws.append([record.get(key) for key in KEYS]) ws.freeze_panes = "A2" for i, header in enumerate(HEADERS, start=1): column = ws[get_column_letter(i)] width = max(len(str(c.value)) for c in column if c.value) ws.column_dimensions[get_column_letter(i)].width = width + 2 wb.save(filename) print(f"Saved {len(records)} rows to {filename}")
A few details make the sheet pleasant to use. ws.append(HEADERS) writes the column titles, and the loop over ws[1] makes that first row bold. freeze_panes = "A2" pins the header so it stays visible while you scroll. The column-width loop measures the longest value in each column and sets the width to fit it, with a little padding, so nothing is cut off. Reading values through record.get(key) means a record missing a field writes a blank cell instead of raising. The result is a tidy workbook you can open in Excel, Numbers, or Google Sheets.
Step 4: Assemble the full script
Now wire the three pieces into one runnable file: fetch the page, parse the records, and save the spreadsheet.
from urllib.parse import urljoin from crawlbase import CrawlingAPI from bs4 import BeautifulSoup from openpyxl import Workbook from openpyxl.styles import Font from openpyxl.utils import get_column_letter api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) HEADERS = ["Title", "Price", "Availability", "Rating", "Link"] KEYS = ["title", "price", "availability", "rating", "link"] def crawl(page_url): response = api.get(page_url) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed: {response['headers']['pc_status']}") return None def text_of(card, selector): el = card.select_one(selector) return el.get_text(strip=True) if el else None def parse_records(html, base_url): soup = BeautifulSoup(html, "html.parser") records = [] for card in soup.select("article.product"): link_el = card.select_one("a.product-link") href = link_el["href"] if link_el else None records.append({ "title": text_of(card, "h2.product-title"), "price": text_of(card, "span.price"), "availability": text_of(card, "span.stock"), "rating": text_of(card, "span.rating"), "link": urljoin(base_url, href) if href else None, }) return records def save_to_excel(records, filename="products.xlsx"): wb = Workbook() ws = wb.active ws.title = "Products" ws.append(HEADERS) for cell in ws[1]: cell.font = Font(bold=True) for record in records: ws.append([record.get(key) for key in KEYS]) ws.freeze_panes = "A2" for i, header in enumerate(HEADERS, start=1): column = ws[get_column_letter(i)] width = max(len(str(c.value)) for c in column if c.value) ws.column_dimensions[get_column_letter(i)].width = width + 2 wb.save(filename) print(f"Saved {len(records)} rows to {filename}") def main(): catalog_url = "https://example.com/products" html = crawl(catalog_url) if not html: return records = parse_records(html, catalog_url) save_to_excel(records) if __name__ == "__main__": main()
The main function fetches the catalog page, parses it into a list of records, and saves the spreadsheet. Returning early when the fetch fails keeps the run from writing an empty file on a bad response. Adjust the catalog_url and the selectors in parse_records to fit your own target, and the rest of the script stays the same.
What the output looks like
Run the full script with python excel_scraper.py and you get a products.xlsx file with a bold header row, one row per item, and columns sized to their content. Conceptually, the data lands like this:
Title,Price,Availability,Rating,Link Wireless Mouse,$24.99,In stock,4.5,https://example.com/products/wireless-mouse Mechanical Keyboard,$79.00,In stock,4.7,https://example.com/products/mechanical-keyboard USB-C Hub,$39.50,Out of stock,4.2,https://example.com/products/usb-c-hub Laptop Stand,$32.00,In stock,4.6,https://example.com/products/laptop-stand
The spreadsheet carries those same five columns with the header frozen, so you can sort by price, filter on availability, or chart the ratings without any further cleanup. From here the data is plain tabular content, which means it also moves cleanly into other formats; for the tradeoffs between spreadsheet-friendly and code-friendly shapes, see JSON vs CSV.
Scaling across pages
One catalog page is usually a slice of a larger set. Most listings paginate with a query parameter or a path segment, so you fetch each page, parse it, and collect the records into one list before writing the workbook. A short pause between requests keeps the run polite.
import time def collect_all(base_url, max_pages): all_records = [] for page in range(1, max_pages + 1): page_url = f"{base_url}?page={page}" html = crawl(page_url) if not html: break records = parse_records(html, page_url) if not records: break all_records.extend(records) time.sleep(2) return all_records
Swap collect_all(catalog_url, max_pages=5) in for the single crawl call in main, then pass the combined list to save_to_excel. The loop stops early when a page returns nothing or no records, so you do not have to know the exact page count in advance. The time.sleep(2) between requests paces the run. For much larger jobs, the async Crawler queues requests and delivers results to a webhook, which suits crawling many pages without holding open connections.
If you already work in DataFrames, you can skip openpyxl's manual sheet building: pandas.DataFrame(records).to_excel("products.xlsx", index=False) writes the same columns in one line (it uses openpyxl under the hood). You lose the per-column formatting from Step 3 but gain everything pandas offers for cleaning and analysis. See the pandas analysis guide for what comes next.
Scraping responsibly
Getting data into a spreadsheet is the easy part; collecting it responsibly is the part that keeps your project defensible. Stick to a few rules whatever the target. Scrape only public data that any visitor can see without logging in, and stay away from anything behind authentication. Read the site's terms of service and its robots.txt, and treat both as the boundary for what you collect. Keep your request rate reasonable so you are not straining the site's servers; the short sleeps in the pagination loop above are a sensible floor, not a ceiling.
When the data involves people, the bar is higher. Personal data is regulated under frameworks like the GDPR and the CCPA, so avoid collecting names, contact details, or anything tied to identifiable individuals unless you have a clear legal basis and a genuine need. For most analytics work you do not, and a catalog of public product fields like the one in this guide stays well clear of that line. Used this way, scraping into Excel is a routine, low-risk way to keep a dataset current.
Key takeaways
- Excel's own From Web query breaks at scale. A single IP refreshing a site gets throttled and client-side pages return an empty shell, which is why a code-driven scraper is more reliable.
-
Fetch through the Crawling API. One call renders the page when needed and rotates IPs server-side, returning finished HTML; check
pc_statusbefore you parse so failures stay loud. -
Parse defensively. Select each card, read fields with a guarded helper that returns
Noneon a miss, and expect to update selectors when a site's markup drifts. -
Write a clean workbook with openpyxl. A bold frozen header, one row per record, and auto-sized columns give you a sheet that is ready to sort, filter, or chart, or use pandas
to_excelfor a one-liner. - Scrape responsibly. Stay on public data, respect ToS and robots.txt, pace your requests, and apply GDPR or CCPA care whenever personal data is involved.
Frequently Asked Questions (FAQs)
Can I scrape a website directly into Excel without writing code?
Excel's built-in From Web query (under the Data ribbon) can pull tables from a URL with no code, and it works for simple, static, table-based pages. It struggles on modern sites that render content with JavaScript, and a scheduled auto-refresh from a single IP tends to get rate-limited or blocked. For anything beyond simple static tables, a small Python script that fetches through the Crawling API and writes the workbook is far more reliable.
Should I use openpyxl or pandas to write the Excel file?
Both work, and pandas uses openpyxl under the hood. Use openpyxl directly when you want control over formatting like bold headers, frozen panes, and column widths, as shown in Step 3. Use pandas.to_excel when you are already working in DataFrames and want a one-line export; you give up the manual formatting but gain everything pandas offers for cleaning and analysis.
Do I need the normal token or the JS token?
Start with the normal token. It fetches static HTML and is enough whenever the data is already in the page source, which covers many catalog and listing pages. Switch to the JS token only when the fields come back empty, which means the page renders its content client-side and needs a real browser. The JS token renders the page first, then hands back the finished HTML.
My parsed records are empty. What went wrong?
Usually one of two things. Either the selectors do not match the live page, in which case you re-inspect it in your browser's dev tools and update them, or the page renders client-side and the normal token returned an empty shell, in which case you switch to the JS token. Print the first few hundred characters of the HTML to see which case you are in before changing anything else.
How do I scrape more than one page of results into the same spreadsheet?
Loop over the pages, parse each one, and collect the records into a single list before writing the workbook. The collect_all function in the scaling section shows the pattern: it walks pages until one returns no records, with a short sleep between requests, then you pass the combined list to save_to_excel once.
Is it legal to scrape data into Excel?
Scraping public data for your own analysis is generally fine, but it depends on the site's terms of service, your jurisdiction, and what you do with the data. Stay on public pages, respect the site's terms and robots.txt, keep your request rate reasonable, and apply GDPR or CCPA care whenever the data concerns identifiable people. For bulk or commercial use, check whether the site offers an official API or data license.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
