How to Store Scraped Data on the Cloud

Q: What is the difference between Crawlbase Cloud Storage and uploading to my own bucket?

Crawlbase Cloud Storage is a one-parameter path: add &store=true to a Crawling API request and the raw response is saved server-side, retrievable by an RID, with no infrastructure of your own to set up. Uploading to your own bucket or database gives you full control over format, schema, retention, and location, which you want for structured datasets. The two are complementary: the store parameter for quick raw-response archival, your own stores for the processed data.

Q: Will the S3 code work with providers other than AWS?

Yes. The boto3 client takes an endpoint_url parameter; leave it unset for AWS S3, or point it at any S3-compatible provider such as a self-hosted MinIO instance or another cloud's object store. The rest of the code is unchanged, which is why the example reads the endpoint from an environment variable.

Scraping a page is only half the job. The records you pull have to live somewhere durable, somewhere your teammates and your analysis tools can reach them, and somewhere a laptop crash cannot wipe out. A local CSV is fine for a one-off, but the moment a scrape becomes a recurring pipeline, a single hard drive turns into a liability: capacity runs out, transfers between machines get clumsy, and one disk failure costs you work you cannot get back.

This guide shows you how to store scraped data on the cloud with Python end to end. You build a small, runnable flow that fetches a page through the Crawling API, structures the results into clean records, and then writes them to durable cloud destinations: an object store using S3-style buckets and a managed relational database. Everything here uses a neutral example URL and environment-variable placeholders for credentials, so you can adapt it to your own target and provider without changing the shape of the code.

What you will build

A Python script that scrapes a small dataset from a public example listing page through the Crawling API, normalizes each row into a typed record, and ships those records to the cloud two ways. You can keep one path or both. The pieces are:

Scrape a rendered page fetched through the Crawling API, returning finished HTML.
Transform the raw HTML into a list of structured records with consistent field names and types.
Object storage a JSON Lines file uploaded to an S3-style bucket for cheap, durable archival.
Managed database the same records inserted into a Postgres table for querying and joins.
Crawlbase Cloud Storage an optional one-parameter path that saves the raw crawl response server-side.

Why store scraped data in the cloud

Local storage is convenient until it is not. As a scrape grows from a few hundred rows to a recurring job feeding a dashboard, three problems show up at once. Capacity becomes a recurring cost: you buy disks to keep backups safe and spend time managing them. Access gets awkward: data trapped on one machine is hard to share with a team or feed into a tool running elsewhere. And durability is fragile: power issues, firmware corruption, and plain human error can take a single disk down, and with it any work that was not copied somewhere else.

Cloud storage answers all three. Object stores and managed databases are built for redundancy, so your data is replicated across locations rather than sitting on one drive. They scale without you provisioning hardware, they are reachable from anywhere with credentials, and they hand off backup and durability to the provider. For a scraping pipeline that means you can treat collected data as a durable asset from the moment it lands, not something you have to babysit on a local disk.

Two destinations, and when to use each

This tutorial writes to two kinds of cloud store because scraped data usually wants both. Object storage (S3-compatible buckets) is the right home for raw and archival data: cheap per gigabyte, indifferent to file shape, and ideal for keeping the untouched scrape output so you can reprocess it later. A managed relational database (Postgres here) is the right home for the structured, queryable copy, where consistent columns and types let you filter, aggregate, and join with SQL. The common pattern is to write to both, and the code below does exactly that. For a deeper comparison, see cloud storage versus local storage and the advantages of cloud storage.

Prerequisites

A few things should be in place before you write any code. None take long.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda, and make sure Python is on your PATH.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token. Crawlbase includes up to 20,000 free requests to start, which is plenty for working through this guide. Treat the token like a password and keep it out of version control. If your target renders content client-side, use the JavaScript token; for a static page the normal token is fine.

Cloud credentials. For the object-storage path you need an S3-style bucket and an access key pair. For the database path you need a connection string to a managed Postgres instance. Both are supplied through environment variables in the code below, never hardcoded.

Comfort with Python and basic scraping. If the parsing side is new to you, the BeautifulSoup guide and the scraping with Python walkthrough are good companions.

Set up the project

Create a virtual environment so dependencies stay isolated, then install the libraries the flow needs.

bash

python --version

python -m venv cloud_env
source cloud_env/bin/activate

pip install crawlbase beautifulsoup4 boto3 psycopg2-binary

On Windows, activate the environment with cloud_env\Scripts\activate instead of the source line. Four dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML, boto3 talks to S3-style object storage, and psycopg2-binary connects to Postgres. The json module ships with the standard library, so the archival format needs nothing extra.

Step 1: Scrape a page through the Crawling API

Start by fetching a finished page. Import the CrawlingAPI class, initialize it with your token, and request the target URL. Checking the Crawlbase cb_status (legacy pc_status) before you parse keeps failures loud instead of silent. We use a neutral example listing page here; swap in your own URL when you adapt the flow.

python

from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    response = api.get(page_url)
    if response["headers"]["cb_status"] == "200":
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['headers']['cb_status']}")
    return None

if __name__ == "__main__":
    page_url = "https://example.com/products"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

Run this with python cloud_pipeline.py and you should see real page markup printed, confirming the fetch works before you write a single selector. If your target fills content client-side, initialize the client with your JavaScript token and pass {"ajax_wait": "true", "page_wait": 5000} to api.get so the API renders the page first. For JS-heavy targets, the scraping JavaScript pages with Python guide covers the details.

Crawlbase Crawling API

That single api.get call above is doing more than a plain request would. The Crawling API renders the page when you pass a JavaScript token, rotates through residential IPs server-side, and handles CAPTCHAs, so you get finished HTML back without running a headless browser fleet or a proxy pool yourself. Point it at a public page on the free tier first, then scale the same code up.

Start free

Step 2: Transform the HTML into structured records

Raw HTML is not something you want to store directly in a database. The transform step turns it into a list of dictionaries with consistent field names and types, so every record has the same shape. Load the HTML into BeautifulSoup, walk each item on the page, and pull the fields you care about. The selectors here are illustrative; replace them with the ones that match your target.

python

from bs4 import BeautifulSoup

def text_of(node, selector):
    el = node.select_one(selector)
    return el.get_text(strip=True) if el else None

def to_price(raw):
    if not raw:
        return None
    digits = raw.replace("$", "").replace(",", "").strip()
    return float(digits) if digits.replace(".", "").isdigit() else None

def transform(html):
    soup = BeautifulSoup(html, "html.parser")
    records = []
    for card in soup.select("div.product"):
        records.append({
            "name": text_of(card, "h2.title"),
            "price": to_price(text_of(card, "span.price")),
            "sku": text_of(card, "span.sku"),
            "in_stock": text_of(card, "span.stock") == "In stock",
        })
    return [r for r in records if r["name"]]

Three small helpers keep the records clean. text_of returns the stripped text of an element or None when it is missing, so a gap in one card does not crash the loop. to_price strips the currency symbol and thousands separators and casts to a float, so the database column can be numeric rather than text. The final filter drops rows with no name, which are usually layout artifacts rather than real items. The result is a list of typed records ready to store. For more on shaping scraped data well, see structuring and cleaning web-scraped data.

Step 3: Upload to S3-style object storage

The first cloud destination is an object store. Object storage is the natural home for raw or archival data: it is cheap, durable, and indifferent to the shape of what you put in it. We write the records as JSON Lines (one JSON object per line), which is easy to append to and to stream back later. Credentials come from environment variables so nothing sensitive lands in the source.

python

import os
import json
import boto3

def upload_to_s3(records, key):
    s3 = boto3.client(
        "s3",
        endpoint_url=os.environ.get("S3_ENDPOINT_URL"),
        aws_access_key_id=os.environ["S3_ACCESS_KEY"],
        aws_secret_access_key=os.environ["S3_SECRET_KEY"],
    )
    body = "\n".join(json.dumps(r) for r in records)
    s3.put_object(
        Bucket=os.environ["S3_BUCKET"],
        Key=key,
        Body=body.encode("utf-8"),
        ContentType="application/x-ndjson",
    )
    print(f"Uploaded {len(records)} records to s3://{os.environ['S3_BUCKET']}/{key}")

The endpoint_url parameter is what makes this S3-style rather than AWS-only: leave it unset for AWS S3, or point it at any S3-compatible provider (for example a self-hosted MinIO instance or another cloud's object store). Set the four environment variables before running, for example export S3_BUCKET=my-scrape-archive and the matching keys. The Key is the object path inside the bucket; a date-stamped key like scrapes/2026-06-11/products.jsonl keeps successive runs separate and easy to find.

Keep credentials out of code

Never hardcode access keys or connection strings in a script you commit. Read them from environment variables or a secrets manager, as the code here does. A key checked into version control is a key you have to rotate.

Step 4: Insert into a managed database

The second destination is a managed Postgres database, which is where the structured copy lives for querying. The function below opens a connection from a single environment variable, ensures the target table exists, and inserts the records. Using parameterized queries (the %s placeholders) keeps the values properly escaped instead of being concatenated into the SQL.

python

import os
import psycopg2
from psycopg2.extras import execute_values

CREATE = """
CREATE TABLE IF NOT EXISTS products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    price NUMERIC,
    sku TEXT,
    in_stock BOOLEAN,
    scraped_at TIMESTAMPTZ DEFAULT now()
)
"""

def save_to_db(records):
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    with conn, conn.cursor() as cur:
        cur.execute(CREATE)
        rows = [(r["name"], r["price"], r["sku"], r["in_stock"]) for r in records]
        execute_values(
            cur,
            "INSERT INTO products (name, price, sku, in_stock) VALUES %s",
            rows,
        )
    conn.close()
    print(f"Inserted {len(records)} rows into products")

Set DATABASE_URL to your managed Postgres connection string, for example postgresql://user:pass@host:5432/dbname, and keep it in the environment rather than the file. The CREATE TABLE IF NOT EXISTS makes the function safe to run repeatedly, the scraped_at column timestamps each load so you can track changes over time, and execute_values batches the inserts into one round trip instead of one query per row. Once the rows are in, you can filter and aggregate with plain SQL, then pull them into pandas for analysis.

Step 5: Assemble the full pipeline

Now wire the steps into one runnable script: scrape, transform, then send the records to both destinations. Keep whichever storage call fits your workflow; both are shown here.

python

import os
import json
from datetime import date
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# crawl, transform, upload_to_s3 and save_to_db are defined above

def main():
    page_url = "https://example.com/products"
    html = crawl(page_url)
    if not html:
        print("Nothing scraped, stopping.")
        return

    records = transform(html)
    print(f"Parsed {len(records)} records")
    if not records:
        return

    key = f"scrapes/{date.today().isoformat()}/products.jsonl"
    upload_to_s3(records, key)
    save_to_db(records)

if __name__ == "__main__":
    main()

The flow is linear and easy to reason about: fetch the page, bail early if the scrape failed, transform the HTML into records, bail again if there is nothing to store, then archive the records to the bucket and load them into the database. The date-stamped object key keeps each run's archive separate, while the database accumulates every load with a timestamp. Run it with python cloud_pipeline.py once your environment variables are set.

What the output looks like

The object-storage path writes a JSON Lines file, one record per line, which is what lands in the bucket:

json

{"name": "Aluminium Tripod", "price": 129.99, "sku": "TRP-014", "in_stock": true}
{"name": "USB-C Hub", "price": 39.5, "sku": "HUB-203", "in_stock": false}
{"name": "Wireless Mouse", "price": 24.0, "sku": "MSE-088", "in_stock": true}

The database path stores the same records as typed columns, so a quick query confirms the load and shows the shape you can analyze:

sql

SELECT name, price, in_stock FROM products WHERE in_stock = true ORDER BY price;

--      name       | price  | in_stock
-- ----------------+--------+----------
--  Wireless Mouse |  24.00 | t
--  Aluminium Tripod| 129.99 | t

With both copies in place you have a cheap, durable archive of the raw records and a queryable structured table, written by the same run.

A one-parameter shortcut: Crawlbase Cloud Storage

If your goal is simply to keep a server-side copy of each crawl response without standing up your own bucket or database first, Crawlbase Cloud Storage offers a one-parameter path. Add &store=true to a Crawling API request and a copy of the response is saved on the cloud automatically, where you can search it, retrieve it, or delete it later through the API or your dashboard.

python

from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

response = api.get("https://example.com/products", {"store": "true"})
# the response also includes a storage RID you can use to fetch it later
print(response["headers"].get("storage_url"))

Each saved request gets a unique identifier (an RID) you can use to view or delete it. This path is the quickest way to retain raw responses, and it pairs well with the async Crawler when you are running many requests and want the storage handled server-side. For larger structured datasets you will still want your own database, but for raw response archival the store parameter is hard to beat on simplicity.

Scaling the pipeline

The flow above scrapes one page. Turning it into a recurring job is mostly about pacing and resilience. A few habits keep a larger run healthy:

Batch your writes. Accumulate records and upload or insert them in batches rather than one row at a time. execute_values already batches the database inserts; do the same for object uploads by writing one file per run rather than per record.
Date-stamp your keys. Use a dated object key like scrapes/2026-06-11/products.jsonl so each run is isolated and you never overwrite history. The database's scraped_at column plays the same role on the query side.
Run on a schedule. Wrap the script in a cron job or a scheduled task so the cloud copy stays current. Because the table uses IF NOT EXISTS and the bucket key is dated, repeat runs are safe.
Offload the fetch at scale. For many pages, the async Crawler queues requests and delivers results to a webhook, which suits high volume without holding open connections.

Scraping responsibly

Collect responsibly. Scrape only public data, respect each site's terms of service and its robots.txt, and keep your request rate reasonable so you are not straining the servers you depend on. When the data you collect includes anything tied to identifiable individuals, privacy laws such as GDPR and CCPA apply, so avoid personal data unless you have a lawful basis and a clear purpose for holding it. Storing data in the cloud does not change any of this: the same care you take collecting it carries over to how you retain it, and keeping personal data longer than you need only adds risk.

Recap

Key takeaways

Storage is part of the pipeline. Treat scraped records as a durable asset from the moment they land; a local CSV is fine for a one-off but not for a recurring job.
Transform before you store. Normalize the raw HTML into typed records with consistent field names so the database column can be numeric and the archive stays consistent.
Use the right store for the job. Object storage (S3-style buckets) is cheap and durable for raw or archival data; a managed database is for the structured, queryable copy, and writing to both is a common pattern.
Keep credentials out of code. Read access keys and connection strings from environment variables or a secrets manager, never hardcoded in a committed script.
The store parameter is the shortcut. Adding &store=true saves a raw crawl response on Crawlbase Cloud Storage in one parameter, which is the fastest way to retain responses without standing up your own infrastructure first.

Frequently Asked Questions (FAQs)

Should I store scraped data in object storage or a database?

It depends on what you do with it. Object storage (S3-compatible buckets) is cheap, durable, and ideal for raw or archival data of any shape, so it is the right home for the untouched scrape output. A managed relational database is for the structured copy you query, filter, and join with SQL. Many pipelines write to both: archive the raw records in a bucket and load the cleaned records into a database.

How do I keep my cloud credentials out of the code?

Read them from environment variables or a secrets manager rather than hardcoding them. The code in this guide pulls the S3 keys and the Postgres connection string from os.environ, so nothing sensitive lives in the committed file. A key checked into version control is a key you have to rotate, so keep them in the environment.

What is the difference between Crawlbase Cloud Storage and uploading to my own bucket?

Crawlbase Cloud Storage is a one-parameter path: add &store=true to a Crawling API request and the raw response is saved server-side, retrievable by an RID, with no infrastructure of your own to set up. Uploading to your own bucket or database gives you full control over format, schema, retention, and location, which you want for structured datasets. The two are complementary: the store parameter for quick raw-response archival, your own stores for the processed data.

Will the S3 code work with providers other than AWS?

Yes. The boto3 client takes an endpoint_url parameter; leave it unset for AWS S3, or point it at any S3-compatible provider such as a self-hosted MinIO instance or another cloud's object store. The rest of the code is unchanged, which is why the example reads the endpoint from an environment variable.

How do I run this on a schedule so the cloud copy stays current?

Wrap the script in a cron job or a scheduled task that runs at whatever cadence your data changes. The pipeline is safe to repeat: the database table uses CREATE TABLE IF NOT EXISTS, the object key is date-stamped so runs never overwrite each other, and each database row carries a scraped_at timestamp so you can track changes over time. For many pages, hand the fetch off to the async Crawler so the job is not bottlenecked on one connection.

Is it safe to store scraped personal data in the cloud?

Treat that as a legal and privacy question first. Avoid collecting data tied to identifiable individuals unless you have a lawful basis and a clear purpose, since privacy laws like GDPR and CCPA apply regardless of where the data is stored. If you do hold personal data, store only what you need, retain it no longer than necessary, and secure access to it. Keeping personal data around longer than required only adds risk without adding value.

Bilal Ahmed

Software Engineer · Crawlbase

Software engineer who wrote some of the most-read pieces on the Crawlbase blog, covering web scraping, proxies, and data tooling.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why store scraped data in the cloud

Two destinations, and when to use each

Prerequisites

Set up the project

Step 1: Scrape a page through the Crawling API

Step 2: Transform the HTML into structured records

Step 3: Upload to S3-style object storage

Step 4: Insert into a managed database

Step 5: Assemble the full pipeline

What the output looks like

A one-parameter shortcut: Crawlbase Cloud Storage

Scaling the pipeline

Scraping responsibly

Key takeaways

Frequently Asked Questions (FAQs)

Should I store scraped data in object storage or a database?

How do I keep my cloud credentials out of the code?

What is the difference between Crawlbase Cloud Storage and uploading to my own bucket?

Will the S3 code work with providers other than AWS?

How do I run this on a schedule so the cloud copy stays current?

Is it safe to store scraped personal data in the cloud?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

How to Scrape Google People Also Ask: full PAA extraction guide

Introducing the New Crawlbase Dashboard: a cleaner control center

13 Tips to Master Data Crawling: crawls that do not break

The infrastructure brief, in your inbox.

We use cookies

Customize cookies