Scraping a site gives you rows of structured web data, but raw rows are not insight. The moment you want to ask questions across that data, which products dropped in price, which categories carry the most reviews, what the median rating looks like by brand, you want SQL. GoogleSQL is the SQL dialect behind Google Cloud SQL and BigQuery, and it pairs cleanly with a Crawlbase scrape: collect the data once, load it into a managed table, then query it with plain SQL instead of writing throwaway parsing scripts every time.
This guide is a runnable walkthrough of that pipeline. You fetch structured product data through the Crawling API, save it as CSV, load it into a Google Cloud SQL table, and then write GoogleSQL queries against it: filters, aggregates, ranking, and a few patterns you will reuse on your own sources. The scraping example uses public Amazon search results, but the SQL half applies to any structured dataset you collect.
What you will build
A small data pipeline that ends in a queryable SQL table. The running example pulls Amazon search results for a product, but every step downstream of the scrape is source-agnostic. You produce:
- A CSV of scraped records one row per product, with name, price, currency, rating, and review count.
- A Cloud SQL table a typed PostgreSQL table that holds the imported rows.
- A query set reusable GoogleSQL queries for filtering, aggregating, ranking, and summarizing the data.
- A repeatable loop a pattern you can point at any other scraped source by swapping the scraper and the table schema.
What GoogleSQL is, in plain terms
GoogleSQL is the name for the SQL language flavor used across Google Cloud's managed database services. In Cloud SQL it is the standard SQL you write against a managed MySQL, PostgreSQL, or SQL Server instance, where Google runs the hardware and most of the operational setup for you. In BigQuery it is the dialect you use to slice very large datasets. Either way, the idea is the same: you do not manage your own servers, you connect to a managed engine, write queries, and let Google handle the backend.
For this tutorial we use Cloud SQL with PostgreSQL because it is the most direct path from a CSV of scraped data to a table you can query. The GoogleSQL you write here is ordinary SQL: SELECT, WHERE, GROUP BY, ORDER BY, and window functions. The value is not an exotic syntax, it is having your scraped web data sitting in a managed table where these queries run in milliseconds.
Prerequisites
A few things need to be in place before you write any code. None take long.
A Crawlbase account and token. Sign up, open your dashboard, and copy your normal requests token from the account docs page. Crawlbase includes 1,000 free requests to start, which is plenty for this guide. Treat the token like a password and keep it out of version control.
Python 3 and the requests library. Confirm your version with python --version, then install the one dependency the scraper needs. Both json and csv ship with the standard library.
A Google Cloud account. You need a project with billing enabled to create a Cloud SQL instance. Google offers free credits for new accounts, so you can work through this without committing to a paid plan up front. Basic comfort with SQL helps but is not required, the queries here are explained line by line.
Set up the project
Create a working directory and install the one library the scraper uses.
python --version mkdir googlesql-crawlbase && cd googlesql-crawlbase pip install requests
That is the whole local setup. The scrape step runs in Python, and everything after it happens inside Google Cloud SQL, where you write GoogleSQL directly in the console or through any SQL client you prefer.
Step 1: Scrape structured data to CSV
The first job is to collect data worth querying. The script below calls the Crawlbase Crawling API with the amazon-serp scraper, which returns parsed JSON for an Amazon search results page, then flattens the products into a CSV. Save it as crawl_data.py and replace the token placeholder with your own.
import csv import json import requests API_TOKEN = "YOUR_CRAWLBASE_TOKEN" API_ENDPOINT = "https://api.crawlbase.com/" def crawl_data(search_url): params = { "token": API_TOKEN, "url": search_url, "scraper": "amazon-serp", } response = requests.get(API_ENDPOINT, params=params) response.raise_for_status() return response.text FIELDNAMES = ["name", "price", "currency", "rating", "reviews"] def save_to_csv(payload, filename="data.csv"): products = payload.get("body", {}).get("products", []) if not products: print("No products found") return with open(filename, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=FIELDNAMES) writer.writeheader() for p in products: writer.writerow({ "name": p.get("name", ""), "price": p.get("rawPrice", ""), "currency": p.get("currency", ""), "rating": p.get("customerReview", ""), "reviews": p.get("customerReviewCount", ""), }) print(f"Saved {len(products)} products to {filename}") if __name__ == "__main__": raw = crawl_data("https://www.amazon.com/s?k=wireless+headphones") save_to_csv(json.loads(raw), "data.csv")
Run it with python crawl_data.py. The Crawling API does the parsing for you, so instead of writing selectors you receive a clean products array and pick the fields you care about. The script keeps five: name, price (taken from the numeric rawPrice so it imports as a number), currency, rating, and reviews. The result is a data.csv ready to load into Cloud SQL.
The amazon-serp scraper above returns parsed JSON instead of raw HTML, so you skip writing and maintaining CSS selectors entirely. The Crawling API renders the page, rotates residential IPs, handles CAPTCHAs server-side, and hands back structured fields, which is exactly the clean input a SQL table wants. Point it at a public search page on the free tier and you have query-ready data in one call.
Step 2: Create a Cloud SQL instance and database
Before you can query the data, it needs somewhere to live. In the Google Cloud console, set up a managed PostgreSQL instance and a database inside it.
- Open the Google Cloud Console and create or select a project with billing enabled.
- Enable the Cloud SQL Admin API from the APIs and Services screen.
- Go to the Cloud SQL Instances page and click Create instance, then choose PostgreSQL.
- Set an instance ID, a region, and a password for the default user, then create the instance.
- Open the new instance, go to the Databases tab, and create a database named
analytics_example_db.
Google provisions and manages the engine for you. Once the instance is running and the database exists, you can open Cloud SQL Studio, the built-in query editor, and run GoogleSQL against it directly.
Step 3: Define the table
Open Cloud SQL Studio, log in to analytics_example_db, start a new query tab, and create a table that matches the CSV columns. Types matter here: price and reviews are declared as numeric so you can sum, average, and compare them later, while text fields stay as TEXT or VARCHAR.
CREATE TABLE "public"."amazon_products" ( "name" TEXT, "price" DECIMAL(10, 2), "currency" VARCHAR(10), "rating" DECIMAL(3, 2), "reviews" INTEGER );
This creates an empty amazon_products table with five typed columns. The numeric types are the point of the whole exercise: a price stored as DECIMAL can be averaged and ranked, while a price stored as text can only be displayed. Getting the schema right here is what makes the queries in Step 5 possible.
Step 4: Import the CSV into the table
With the table ready, load the scraped rows. Cloud SQL imports CSV files from a Cloud Storage bucket.
- Go to the Cloud SQL Instances page and open your instance.
- Choose Import, and select CSV as the file format.
- Upload
data.csvto a Cloud Storage bucket and select it as the source file. - Set
analytics_example_dbas the destination database andamazon_productsas the table. - Click Import.
The script writes a header row to the CSV. When you import, tell Cloud SQL to skip the first row so the column names are not loaded as a data row. If your client does not offer that toggle, drop the header line from the file before uploading.
Once the import finishes, the rows are queryable. Everything from here is GoogleSQL.
Step 5: Query the data with GoogleSQL
This is where the structured data pays off. Each query below runs in Cloud SQL Studio or any PostgreSQL client connected to the instance. Start with the basics and build up.
Read everything. The simplest query confirms the import worked and shows you the shape of the data.
SELECT * FROM amazon_products LIMIT 10;
Filter and sort. Find the well-reviewed products under a price ceiling, cheapest first. This is the kind of question that is tedious in a spreadsheet and trivial in SQL.
SELECT name, price, rating, reviews FROM amazon_products WHERE price < 100 AND reviews > 500 ORDER BY price ASC;
Aggregate. Summarize the whole result set in one row: how many products, what they cost on average, and the price spread.
SELECT COUNT(*) AS total_products, ROUND(AVG(price), 2) AS avg_price, MIN(price) AS cheapest, MAX(price) AS most_expensive FROM amazon_products;
Group and rank. When your data has a category column, GROUP BY turns a flat table into a comparison. Here the example groups by currency, but the same shape works for brand, category, or any column you scrape.
SELECT currency, COUNT(*) AS listings, ROUND(AVG(rating), 2) AS avg_rating FROM amazon_products GROUP BY currency ORDER BY listings DESC;
Window function. Rank products by review count without collapsing the rows, so you keep every product alongside its standing in the set. RANK() is standard GoogleSQL and works the same in Cloud SQL and BigQuery.
SELECT name, reviews, RANK() OVER (ORDER BY reviews DESC) AS popularity_rank FROM amazon_products ORDER BY popularity_rank LIMIT 10;
What the output looks like
The ranking query returns a row per product with its place in the order, so the most-reviewed items rise to the top while every product keeps its own line.
name,reviews,popularity_rank Wireless Over-Ear Headphones,12840,1 Noise Cancelling Earbuds,9325,2 Bluetooth Sport Headphones,7710,3 Studio Monitor Headphones,4188,4 Budget Wireless Earbuds,3902,5
The aggregate query collapses to a single summary row instead, the count, average price, and price range across the whole table. Because the data is typed and indexed in a managed engine, these queries return immediately and stay just as fast as the table grows from a few rows to a few hundred thousand.
Adapt this to your own sources
Nothing in the SQL half is specific to Amazon. To run the same pipeline on a different source, three things change and the rest stays put.
- Swap the scraper. Point the Scraper API at a different supported target, or fetch raw HTML through the Crawling API and parse it yourself. The output you want is still a flat CSV with consistent columns.
-
Match the schema. Edit the
CREATE TABLEstatement so the column names and types line up with your new fields. Keep anything you will aggregate or compare as a numeric type. - Reuse the queries. The filter, aggregate, group, and rank patterns above work on any table. Change the column names and they fit a new dataset unchanged.
For larger jobs, collect data with the async Crawler, which queues requests and delivers results to a webhook, then batch the imports into Cloud SQL on a schedule. If you want a deeper treatment of the load-and-analyze half, the guide on storing and analyzing scraped data in SQL covers schema design and indexing, and building a scalable web data pipeline covers running this on a recurring basis.
Scraping responsibly
The SQL is yours to run freely, but the scrape that feeds it should follow a few rules. Collect only public data, the product names, prices, ratings, and review counts that any visitor can see without logging in, and stay off anything behind authentication. Respect each source's terms of service and its robots.txt, and keep your request rate reasonable so you are not straining the site. When a source publishes an official API, prefer it for bulk or commercial use. And if your dataset ever includes personal data, treat it under the relevant privacy rules such as GDPR or CCPA: aggregate it, do not profile individuals, and store only what you need.
Key takeaways
-
GoogleSQL is ordinary SQL on a managed engine. It is the dialect Cloud SQL and BigQuery use, so the
SELECT,GROUP BY, and window queries you already know all apply, with Google running the backend. - Scrape to a clean CSV first. The Crawling API returns parsed JSON, so you pick fields by name and write a flat CSV instead of maintaining selectors.
-
Types make or break the queries. Declaring
priceandreviewsas numeric in theCREATE TABLEstep is what lets you average, compare, and rank them later. - The query patterns are reusable. Filter, aggregate, group, and rank queries carry over to any scraped dataset by changing only the column names.
- Adapt by swapping three things. Change the scraper, match the table schema, and reuse the queries to point the whole pipeline at a new source.
Frequently Asked Questions (FAQs)
What is the difference between GoogleSQL in Cloud SQL and in BigQuery?
GoogleSQL is the shared SQL dialect, but the engine underneath differs. Cloud SQL runs managed instances of MySQL, PostgreSQL, or SQL Server and suits transactional, moderate-sized datasets like a scraped product table. BigQuery is a serverless warehouse built for very large analytical queries. The query syntax in this guide, including the RANK() window function, works in both. Start with Cloud SQL for a project this size and move to BigQuery only when your data outgrows a single instance.
Do I have to pay to use Google Cloud SQL?
You need billing enabled on your Google Cloud project to create a Cloud SQL instance, but new accounts receive free credits, so you can complete this walkthrough without paying out of pocket. Stop or delete the instance when you are done to avoid ongoing charges. The Crawlbase side includes 1,000 free requests, which is more than enough for the scrape in this tutorial.
Can I use this method for sources other than Amazon?
Yes. The pipeline is source-agnostic after the scrape. Point the Scraper API at another supported target, or fetch and parse a page through the Crawling API, then produce a CSV with consistent columns. Update the CREATE TABLE schema to match your new fields and the same import and query steps apply unchanged.
What happens if my data format changes later?
If a source adds fields or you decide to capture more, you do not start over. Use an ALTER TABLE statement to add columns to the existing table, then import the new rows. Your existing data stays intact, and any query that does not reference the new columns keeps working exactly as before.
Why scrape to CSV instead of inserting rows directly?
A CSV is the simplest interchange format and Cloud SQL imports it natively, which keeps the scrape and the load decoupled. You can inspect the file, re-import it, or load it into a different system without re-running the scrape. For high-volume or continuous pipelines you would insert rows programmatically instead, but for a periodic batch the CSV-then-import path is the least moving parts.
Do these queries work in standard SQL clients?
Yes. Once the Cloud SQL instance is running, you can connect any PostgreSQL-compatible client, the built-in Cloud SQL Studio, psql, or a desktop tool, and run the same GoogleSQL queries. The queries in this guide use standard SQL features, so they are portable across PostgreSQL clients rather than tied to one editor.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

