Web Scraping to SQL: store and analyze with Python

Q: How do I avoid inserting the same record twice?

Put a UNIQUE constraint on a column that identifies the record, such as the product URL, then insert with INSERT OR IGNORE (or ON CONFLICT on PostgreSQL). The database skips any row whose key already exists, so re-running the scraper adds only new listings instead of duplicating the old ones.

Q: Why convert the price to a number before storing it?

Because a price kept as text like "$1,299.00" cannot be compared or averaged correctly: it sorts alphabetically and breaks aggregate functions. Stripping the symbol and commas to store a numeric price means WHERE price < 500, AVG(price), and ORDER BY price all behave as you expect.

Q: Do I need the JavaScript token for this?

Only if your target renders its content client-side. The normal token returns static HTML, which is enough for server-rendered pages. If the listing fills in after the page's scripts run, use the JS token and add the ajax_wait and page_wait options so the dynamic content loads before the HTML is captured.

Q: How do I scale this to many pages without overloading the database?

Paginate the scrape and pass the combined records to a batched executemany insert, committing every few thousand rows rather than after each one. For large crawls, the async Crawler queues requests and posts results to a webhook so you are not holding connections open, and indexing the columns you query keeps reads fast as the table grows.

Web scraping gets you the data, but storing and analyzing it is just as important. A folder of loose JSON or CSV files is fine for a one-off pull, yet the moment you want to track prices over time, compare records across sources, or answer a real question, that flat pile becomes the bottleneck. A SQL database gives scraped data a structured home you can query, filter, and aggregate without writing throwaway parsing code every time.

This guide builds the full path end to end with Python and SQL. You scrape a small set of product-style records from a neutral example page through the Crawling API, design a SQL schema, insert the parsed rows, and run analytical queries to pull insights back out. It stays scoped to public, illustrative data and a placeholder URL, so you can follow it safely and swap in your own target later.

What you will build

A runnable Python flow that fetches a rendered listing page through the Crawling API, parses each product card with BeautifulSoup, writes the records into a SQLite table, and then queries that table for price bands, averages, and category counts. Each scraped record carries these fields:

Name the product title shown on the listing card.
Price the listed price, stored as a number so you can compare and aggregate it.
Category the section or product type the item belongs to.
URL the link to the product's own page.
Scraped at a timestamp the database fills in automatically, so you can track when each row was collected.

Why store scraped data in SQL

Where the data lands decides how useful it is later. Files work until you need to ask a question that spans rows: which items are under a price threshold, what the average price is per category, how this week's pull compares to last week's. SQL was built for exactly that, and a relational database gives you a few things a folder of files does not.

Structured storage. Tables and columns enforce a consistent shape, so every record has the same fields in the same types.
Efficient querying. A single SELECT filters, sorts, and aggregates thousands of rows without any manual looping.
Data integrity. Primary keys, types, and constraints keep the data consistent and catch malformed rows on the way in.
Scalability. Whether you store thousands or millions of records, a SQL engine handles the growth and stays fast with the right indexes.
Repeatable analysis. Queries are reusable. Once you write a useful one, you run it again on tomorrow's data without rebuilding anything.

If you are weighing storage formats more broadly, the comparison in JSON vs CSV covers when a flat file is still the right call, and the data modeling guide goes deeper on schema design than we have room for here.

Why a plain request fails on a modern listing page

Before the storage work matters, you need clean records to store, and that is where a naive HTTP request tends to fall down. Many modern listing pages render their content in the browser through JavaScript: the first HTML response is a thin shell, and the product cards appear only after the page's scripts run. Pull data from that initial response and you capture a fragment of the list, or nothing at all.

The second problem is access. Sites that publish commercially valuable listings watch for automated traffic and block requests from datacenter IPs or anything that does not look like a real browser. So a reliable scraper needs two things in one request: a browser that renders the page, and an IP the site reads as a genuine visitor. You can build that with a headless browser and a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into a single call and returns finished HTML you can parse. For the parsing side, the Python web scraping guide and the guide to structuring scraped data are useful companions.

Prerequisites

A few things should be in place before you write any code. None take long.

Basic Python and SQL. You should be comfortable running a Python script and reading a SELECT statement. You do not need to be a database administrator; this guide uses SQLite, which ships with Python, so there is no server to install.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org and make sure Python is on your PATH.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token. Crawlbase includes up to 20,000 free requests to start, which is plenty for working through this guide. Treat the token like a password and keep it out of version control. If your target renders content client-side, use the JavaScript (JS) token so the page is rendered before you receive it.

Set up the project

Create a virtual environment so dependencies stay isolated, then install the two libraries the scraper needs.

bash

python --version

python -m venv scraper_env
source scraper_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with scraper_env\Scripts\activate instead of the source line. The crawlbase package is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields. Both sqlite3 and json ship with the Python standard library, so there is nothing more to install for the storage step.

Step 1: Fetch a rendered page

Start by getting a finished page. Import the CrawlingAPI class, initialize it with your token, and request the listing URL. We use a placeholder, https://example.com/products; point it at any public listing page you have the right to collect. Checking the Crawlbase cb_status (legacy pc_status) before you parse keeps failures loud instead of silent.

python

from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

LISTING_URL = "https://example.com/products"

def crawl(page_url):
    response = api.get(page_url)
    if response["headers"]["cb_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['cb_status']}")
    return None

if __name__ == "__main__":
    html = crawl(LISTING_URL)
    print(html[:500] if html else "No HTML returned")

Run this and you should see real page markup printed back, which confirms the request worked before you write a single selector. If your target renders its cards client-side, pass the JS token instead and add {"ajax_wait": "true", "page_wait": 5000} as a second argument to api.get so the dynamic content has time to load.

Crawlbase Crawling API

The crawl function above hides the hard part: getting finished HTML back from a page that renders in the browser and blocks datacenter traffic. The Crawling API runs the page in a real browser, rotates through residential IPs server-side, and hands you the rendered HTML in one call, so you skip running a headless fleet and a proxy pool yourself. Start on the free tier and point it at a public listing page first.

Start free

Step 2: Parse the records you want to store

With finished HTML in hand, load it into BeautifulSoup and pull a record out of each product card. The selectors below assume a card layout with a title, a price, a category tag, and a link, a common shape for listing pages. Each lookup is guarded so a missing field returns a safe default instead of crashing the run. The detail that matters for storage is the price: scrape it as text, then strip the currency symbol and commas so it lands in the database as a number you can compare and average.

python

import re
from bs4 import BeautifulSoup

def parse_price(text):
    if not text:
        return None
    cleaned = re.sub(r"[^0-9.]", "", text)
    return float(cleaned) if cleaned else None

def parse_products(html):
    soup = BeautifulSoup(html, "html.parser")
    records = []
    for card in soup.select("div.product-card"):
        name = card.select_one("h2.product-title")
        price = card.select_one("span.price")
        category = card.select_one("span.category")
        link = card.select_one("a.product-link")
        records.append({
            "name": name.get_text(strip=True) if name else None,
            "price": parse_price(price.get_text() if price else None),
            "category": category.get_text(strip=True) if category else "Uncategorized",
            "url": link["href"] if link and link.get("href") else None,
        })
    return records

parse_price strips everything that is not a digit or a decimal point, so "$1,299.00" becomes the float 1299.0. That single step makes the later price queries possible: text sorts alphabetically and cannot be averaged, but a number does both correctly. Adjust the selectors (product-card, product-title, price, category, product-link) to match your real target's markup; the rest of the flow stays the same.

Step 3: Design the SQL schema

Now design the table that holds these records. A good schema mirrors your record shape, picks the right type per column, and adds a primary key plus a timestamp the database fills in for you. Note the choices below: price is REAL (a number, not text) so it can be compared and aggregated, and scraped_at defaults to the current time so every row records when it was collected.

sql

CREATE TABLE IF NOT EXISTS products (
    id        INTEGER PRIMARY KEY AUTOINCREMENT,
    name      TEXT NOT NULL,
    price     REAL,
    category  TEXT,
    url       TEXT UNIQUE,
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Speed up the price filters and category rollups you will run later
CREATE INDEX IF NOT EXISTS idx_price ON products(price);
CREATE INDEX IF NOT EXISTS idx_category ON products(category);

Two design choices are worth calling out. The UNIQUE constraint on url stops the same product being inserted twice, which matters because scrapers re-run and you do not want yesterday's rows duplicated today. The indexes keep the price and category queries fast as the table grows. On MySQL or PostgreSQL the schema reads almost identically: write AUTO_INCREMENT or SERIAL in place of AUTOINCREMENT, and DECIMAL(10,2) is a stricter choice than REAL for money. The data modeling guide covers when each type is the right call.

Step 4: Insert the parsed records

With the table designed, connect from Python, create it if needed, and insert the parsed records. Always use parameterized queries: the ? placeholders let the driver handle escaping, which keeps malformed text from breaking the insert and closes off injection. executemany writes every record in one batched call, far faster than a loop of single inserts when you have hundreds of rows.

python

import sqlite3

SCHEMA = """
CREATE TABLE IF NOT EXISTS products (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    price REAL,
    category TEXT,
    url TEXT UNIQUE,
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
"""

def store_records(records, db_path="scraped_data.db"):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute(SCHEMA)

    rows = [
        (r["name"], r["price"], r["category"], r["url"])
        for r in records if r["name"]
    ]
    cursor.executemany(
        "INSERT OR IGNORE INTO products (name, price, category, url) "
        "VALUES (?, ?, ?, ?)",
        rows,
    )

    conn.commit()
    inserted = cursor.rowcount
    conn.close()
    return inserted

INSERT OR IGNORE pairs with the UNIQUE constraint on url from Step 3: a row whose URL is already present is skipped silently rather than raising an error, so re-running the scraper tops up new listings without duplicating old ones. Committing once after the batch keeps the write fast. For MySQL, the same code shape applies with mysql.connector in place of sqlite3 and %s placeholders instead of ?; everything downstream is identical.

Why parameterized queries

Never build an INSERT by pasting scraped text straight into the SQL string. Scraped values are untrusted input and can contain quotes or characters that break the statement or, worse, alter it. The ? placeholders hand that text to the driver as data, never as code, which is both safer and less error-prone.

Step 5: Analyze the data with SQL queries

The data is now in a queryable shape, which is the whole point of using SQL. Here are three kinds of analysis you will reach for constantly: filtering and sorting, aggregating for summaries, and grouping for per-category insight. Each is a plain SELECT you can run from the SQLite shell, a database GUI, or from Python.

First, filter and sort. This pulls every product under a price threshold, cheapest first:

sql

SELECT name, price, category
FROM products
WHERE price < 500
ORDER BY price ASC;

Next, aggregate the whole table into a one-line summary. Functions like COUNT, AVG, MIN, and MAX collapse many rows into the numbers you actually want to report:

sql

SELECT
    COUNT(*) AS total_products,
    ROUND(AVG(price), 2) AS average_price,
    MIN(price) AS cheapest,
    MAX(price) AS most_expensive
FROM products;

Finally, group by category to see how the catalogue breaks down. GROUP BY runs the aggregates once per category, and HAVING filters those groups after the fact:

sql

SELECT
    category,
    COUNT(*) AS items,
    ROUND(AVG(price), 2) AS avg_price
FROM products
GROUP BY category
HAVING COUNT(*) > 1
ORDER BY avg_price DESC;

To run these from Python instead of a shell, open the same connection and read the rows back. This snippet runs the per-category rollup and prints each group:

python

def category_summary(db_path="scraped_data.db"):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute("""
        SELECT category, COUNT(*), ROUND(AVG(price), 2)
        FROM products
        GROUP BY category
        ORDER BY 3 DESC
    """)
    for category, count, avg_price in cursor.fetchall():
        print(f"{category}: {count} items, avg {avg_price}")
    conn.close()

If your analysis grows past what plain SQL is comfortable with, pull the table into a DataFrame and continue there; the pandas analysis guide picks up from exactly this point.

Wire it into one flow

The four functions chain into a single runnable script. Scrape, parse, store, then analyze:

python

def main():
    html = crawl(LISTING_URL)
    if not html:
        return
    records = parse_products(html)
    inserted = store_records(records)
    print(f"Parsed {len(records)} records, inserted {inserted} new rows")
    category_summary()

if __name__ == "__main__":
    main()

Each run fetches the page, writes only new rows thanks to INSERT OR IGNORE, and prints the per-category summary. Schedule it on a cron job and the table becomes a growing history you can query over time, which is where SQL earns its keep over a folder of dated CSV files.

What the output looks like

A stored record carries the structured fields plus the database-assigned id and timestamp. Selecting a couple of rows back returns something like this:

json

[
  {
    "id": 1,
    "name": "Wireless Headphones",
    "price": 129.99,
    "category": "Audio",
    "url": "https://example.com/products/wireless-headphones",
    "scraped_at": "2026-06-11 09:42:18"
  },
  {
    "id": 2,
    "name": "Mechanical Keyboard",
    "price": 89.50,
    "category": "Accessories",
    "url": "https://example.com/products/mechanical-keyboard",
    "scraped_at": "2026-06-11 09:42:18"
  }
]

The category summary query against that same table prints a compact rollup, one line per category with a count and an average price, which is the kind of insight a pile of raw files cannot give you without extra code.

Scaling beyond a single page

The flow above handles one page. Real datasets span many, and a few adjustments take this from a demo to something production-shaped.

Paginate. Most listing pages expose a page parameter or a next link. Loop over the pages, call crawl and parse_products for each, and pass the combined records to store_records in one batch.
Batch your inserts. executemany and a single commit per batch already minimize round trips. For very large jobs, commit every few thousand rows rather than all at once so a failure does not lose the whole run.
Index what you query. The two indexes from Step 3 cover price and category filters. Add an index on any other column you filter or sort on heavily.
Go async for volume. For thousands of pages, the async Crawler queues requests and delivers results to a webhook, so you are not holding open connections while you crawl. Pair it with batched inserts and the database keeps up.

If you are designing the larger collect-parse-store-analyze system rather than a single script, the data pipeline architecture guide covers how these stages fit together at scale.

Scraping responsibly

Keep collection within bounds. Scrape only public data that does not sit behind a login, read and respect the site's terms of service and its robots.txt, and keep your request rate reasonable so you are not straining the target's servers. When the data touches identifiable individuals, privacy rules such as GDPR and CCPA apply, so avoid personal data unless you have a lawful basis and a clear need for it. The example here uses a placeholder URL and illustrative product fields precisely to stay clear of those concerns; apply the same judgment when you point the scraper at a real site.

Recap

Key takeaways

SQL beats flat files for analysis. A structured table lets you filter, aggregate, and group thousands of records with a single query instead of throwaway parsing code.
Clean the data on the way in. Strip currency symbols and commas so price lands as a number; a numeric column sorts and averages correctly where text does not.
Design the schema deliberately. Pick real types, add a primary key and a UNIQUE URL to block duplicates, and index the columns you filter on.
Insert safely and in batches. Parameterized ? placeholders and executemany with one commit per batch keep inserts both secure and fast.
Analyze with plain SELECT. WHERE, ORDER BY, the aggregate functions, and GROUP BY turn stored rows into price bands, averages, and per-category insight.

Frequently Asked Questions (FAQs)

Why store scraped data in SQL instead of a CSV or JSON file?

Files are fine for a single pull, but they make cross-row questions painful. A SQL database enforces a consistent structure, lets you filter, sort, and aggregate with one query, and scales to millions of rows with indexes. It also handles repeated runs cleanly: a UNIQUE constraint stops duplicates so you can append to the same table over time rather than juggling dated files.

Which database should I use, SQLite, MySQL, or PostgreSQL?

SQLite is ideal for small to medium projects and local work because it ships with Python and needs no server, which is why this guide uses it. MySQL suits web applications and larger shared datasets, and PostgreSQL is strong for complex queries and analytics. The Python flow here changes only the connection and the placeholder syntax when you move between them; the schema and queries stay nearly identical.

How do I avoid inserting the same record twice?

Put a UNIQUE constraint on a column that identifies the record, such as the product URL, then insert with INSERT OR IGNORE (or ON CONFLICT on PostgreSQL). The database skips any row whose key already exists, so re-running the scraper adds only new listings instead of duplicating the old ones.

Why convert the price to a number before storing it?

Because a price kept as text like "$1,299.00" cannot be compared or averaged correctly: it sorts alphabetically and breaks aggregate functions. Stripping the symbol and commas to store a numeric price means WHERE price < 500, AVG(price), and ORDER BY price all behave as you expect.

Do I need the JavaScript token for this?

Only if your target renders its content client-side. The normal token returns static HTML, which is enough for server-rendered pages. If the listing fills in after the page's scripts run, use the JS token and add the ajax_wait and page_wait options so the dynamic content loads before the HTML is captured.

How do I scale this to many pages without overloading the database?

Paginate the scrape and pass the combined records to a batched executemany insert, committing every few thousand rows rather than after each one. For large crawls, the async Crawler queues requests and posts results to a webhook so you are not holding connections open, and indexing the columns you query keeps reads fast as the table grows.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why store scraped data in SQL

Why a plain request fails on a modern listing page

Prerequisites

Set up the project

Step 1: Fetch a rendered page

Step 2: Parse the records you want to store

Step 3: Design the SQL schema

Step 4: Insert the parsed records

Step 5: Analyze the data with SQL queries

Wire it into one flow

What the output looks like

Scaling beyond a single page

Scraping responsibly

Key takeaways

Frequently Asked Questions (FAQs)

Why store scraped data in SQL instead of a CSV or JSON file?

Which database should I use, SQLite, MySQL, or PostgreSQL?

How do I avoid inserting the same record twice?

Why convert the price to a number before storing it?

Do I need the JavaScript token for this?

How do I scale this to many pages without overloading the database?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

How to Scrape Google People Also Ask: full PAA extraction guide

Introducing the New Crawlbase Dashboard: a cleaner control center

13 Tips to Master Data Crawling: crawls that do not break

The infrastructure brief, in your inbox.

We use cookies

Customize cookies