Gemini AI Web Scraping in Python

Q: How do I make Gemini return reliable JSON instead of prose?

Set the generation config's response_mime_type to application/json and name the exact keys you want in the prompt. That combination makes Gemini return parseable JSON without code fences or commentary. Still wrap the json.loads call in a try/except so a rare malformed response logs the raw text instead of crashing your run.

Web scraping has always been two jobs glued together: get the page, then pull the fields you want out of it. The second job is where most scrapers rot. You write CSS selectors or XPath against a layout, the site ships a redesign, and your extraction silently returns empty strings. Large language models change the economics here. Instead of describing where a value lives in the DOM, you describe what you want in plain English and let the model read the content the way a person would.

This guide shows you how to do Gemini AI web scraping in Python the reliable way: use the Crawling API to fetch and render the target page into clean HTML or markdown, then hand that content to Google Gemini to extract structured JSON. The division of labor matters and it is the whole point of this article. Crawlbase does the fetching and rendering behind a real browser and a trusted IP; Gemini does the reading and structuring. Each tool does the part it is actually good at.

Why pair Gemini with a fetching layer at all

Gemini is a large language model from Google. It understands natural language, reads messy content, and returns structured data when you ask for it. What it does not do is fetch web pages. It has no HTTP client, no browser, no proxy pool, and no way to get past the anti-bot defenses that guard most commercial sites. Feed it a URL and it cannot open it; feed it raw HTML you scraped yourself and it will happily extract from whatever you managed to retrieve, including an empty shell.

That is the gap the fetching layer fills. Modern sites render content client-side and challenge automated traffic aggressively, so a plain requests.get often returns a 200 with none of the data you came for. You need a browser that actually runs the page's JavaScript and an IP the site reads as a real visitor. You can assemble that yourself with a headless browser plus rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into a single call: send it a URL with a JavaScript token, it renders the page and returns finished HTML, ready for Gemini.

Who does what

Keep the boundary clear in your head. Crawlbase fetches and renders the page into clean HTML or markdown. Gemini extracts structured fields from that content. Gemini never touches the network in this design, and Crawlbase never tries to understand the data. Mixing those responsibilities up is the most common reason these pipelines feel flaky.

What you will build

A small, runnable Python script that takes a product URL, retrieves the rendered page through the Crawling API as clean markdown, sends that markdown to Gemini with an extraction prompt, and writes the structured result to a JSON file. We will use a public test page so you can run every snippet as-is before pointing it at a real target.

Set up the environment

You need Python 3.8 or later. Confirm your version, create a virtual environment so project dependencies stay isolated, then install the libraries.

bash

python --version

python -m venv gemini_env
source gemini_env/bin/activate

pip install google-generativeai crawlbase python-dotenv

On Windows, activate the environment with gemini_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, google-generativeai is Google's Gemini client, and python-dotenv loads your keys from a local file so they never end up hard-coded in the script.

You need two credentials. Get a Gemini API key from Google AI Studio, and get a Crawlbase JavaScript (JS) token from your Crawlbase dashboard after signing up. Store both in a .env file in your project folder.

bash

GEMINI_API_KEY=your_gemini_key_here
CRAWLBASE_JS_TOKEN=your_crawlbase_js_token_here

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Most pages worth scraping load their content client-side, so the JS token is the safe default here. Using the normal token on a client-rendered page returns the same empty shell a plain fetch would, and Gemini cannot extract data that was never there.

Step 1: Fetch the rendered page with the Crawling API

The Crawling API can return the page already converted to markdown, which is exactly what you want before sending it to an LLM. Markdown strips the navigation, scripts, and styling noise, leaving the readable content. That cuts the token count you send to Gemini, which makes the call cheaper and the extraction more accurate. Pass format: 'markdown' and the API hands you clean text instead of raw HTML.

python

import os
from dotenv import load_dotenv
from crawlbase import CrawlingAPI

load_dotenv()

api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]})

url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

def fetch_markdown(target_url):
    options = {"format": "markdown", "ajax_wait": "true", "page_wait": 3000}
    response = api.get(target_url, options)
    return response["body"].decode("utf-8")

page_markdown = fetch_markdown(url)
print(page_markdown[:500])

The two wait options matter for client-rendered targets. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear before the page is captured. Three seconds is a reasonable start; raise it if content comes back thin. For a static page like the test book above you could even use the normal token, but keeping the JS token and these options means the same code works when you point it at a harder, client-rendered site.

Crawlbase Crawling API

Gemini reads pages, it does not fetch them. The Crawling API closes that gap in one call: pass a JS token, it renders the page in a real browser, rotates through residential IPs server-side, and returns clean HTML or LLM-ready markdown, so you skip running a headless fleet and a proxy pool yourself. Point it at a public page on the free tier first.

Start free

Step 2: Send the content to Gemini and ask for JSON

Now the interesting part. With clean markdown in hand, you describe the fields you want in a prompt and let Gemini do the extraction. The key trick for a reliable pipeline is forcing JSON output. Gemini's client supports a response MIME type, so set it to application/json and the model returns parseable JSON instead of prose with code fences around it. That single setting removes most of the brittleness people complain about with LLM extraction.

python

import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

def extract_fields(content):
    prompt = f"""You are a data extraction tool. From the page content below,
extract the book title, price, availability, and star rating.
Return only JSON with keys: title, price, availability, rating.

CONTENT:
{content}
"""
    response = model.generate_content(
        prompt,
        generation_config={"response_mime_type": "application/json"},
    )
    return response.text

raw_json = extract_fields(page_markdown)
print(raw_json)

A few things make this prompt work. It states the role ("data extraction tool") so Gemini stays terse, it names the exact keys you want so the schema is stable across runs, and it passes the markdown rather than raw HTML so the model spends its attention on content, not boilerplate. If you need a richer schema, list more keys and describe any that are ambiguous; the model handles nested objects and arrays without extra ceremony.

Step 3: Parse and save the structured result

Because you asked for a JSON MIME type, the response text is already valid JSON. Parse it into a Python dict and write it to disk. Wrap the parse in a try/except so a rare malformed response logs the raw text instead of crashing the run.

python

import json

def save_json(raw, path="book_data.json"):
    try:
        data = json.loads(raw)
    except json.JSONDecodeError:
        print("Gemini did not return valid JSON:")
        print(raw)
        return
    with open(path, "w") as f:
        json.dump(data, f, indent=2)
    print(f"Saved {path}")

save_json(raw_json)

The full script

Here is everything wired together into one runnable file. Fill in your two credentials in .env, change the URL, and adjust the prompt keys for whatever target you are extracting.

python

import os
import json
from dotenv import load_dotenv
from crawlbase import CrawlingAPI
import google.generativeai as genai

load_dotenv()
api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]})
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")

url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

def fetch_markdown(target_url):
    options = {"format": "markdown", "ajax_wait": "true", "page_wait": 3000}
    response = api.get(target_url, options)
    return response["body"].decode("utf-8")

def extract_fields(content):
    prompt = f"""You are a data extraction tool. From the page content below,
extract the book title, price, availability, and star rating.
Return only JSON with keys: title, price, availability, rating.

CONTENT:
{content}
"""
    response = model.generate_content(
        prompt,
        generation_config={"response_mime_type": "application/json"},
    )
    return response.text

def main():
    markdown = fetch_markdown(url)
    raw = extract_fields(markdown)
    try:
        data = json.loads(raw)
    except json.JSONDecodeError:
        print("Gemini did not return valid JSON:", raw)
        return
    with open("book_data.json", "w") as f:
        json.dump(data, f, indent=2)
    print(json.dumps(data, indent=2))

if __name__ == "__main__":
    main()

What the output looks like

Run it with python scraper.py and you get clean structured data written to book_data.json and echoed to the console.

json

{
  "title": "A Light in the Attic",
  "price": "£51.77",
  "availability": "In stock (22 available)",
  "rating": "Three"
}

Notice what you did not write: no CSS selectors, no XPath, no per-field parsing logic. You described the fields and the model found them. Point the same script at a different book URL, or a product page on another site, and it adapts without code changes, which is the real advantage of the AI data extraction approach over hand-tuned selectors.

Scaling to many pages

One page is a demo; a real job runs over a list of URLs. The shape stays the same: loop the URLs, fetch each through the Crawling API, extract with Gemini, and collect the rows. Two things to keep in mind as you scale. Gemini bills per token, so sending markdown rather than full HTML keeps cost down on every call, and the Crawling API has its own throughput so you do not have to manage proxies or browser instances yourself.

python

urls = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
]

results = []
for u in urls:
    markdown = fetch_markdown(u)
    raw = extract_fields(markdown)
    try:
        results.append(json.loads(raw))
    except json.JSONDecodeError:
        print(f"Skipped {u}: invalid JSON")

with open("books.json", "w") as f:
    json.dump(results, f, indent=2)

If you find yourself extracting the same well-known site over and over (Amazon, a major retailer, a job board), it is worth comparing this against the Crawling API, which returns pre-parsed JSON for supported sites without an LLM in the loop. For odd or one-off layouts where no parser exists, the Gemini approach in this guide is the flexible fallback. For background on why markdown is the right input shape for an LLM, see LLM-ready markdown for web scraping.

Limits worth knowing before you ship

The Gemini-plus-Crawlbase pipeline is flexible, but it is not the right hammer for every nail. Keep these in mind.

Token cost adds up. Gemini charges per token sent and received. Sending full HTML instead of markdown can multiply your bill for no benefit, so always trim the input. For very large pages, extract only the relevant section before the LLM call.

It is slower than rule-based parsing. An LLM round-trip takes longer than a Cheerio or BeautifulSoup selector pass. For high-frequency, low-latency jobs like second-by-second price monitoring, a dedicated parser wins. The LLM approach shines when layouts vary or change often.

Models can be wrong. On dense or repetitive pages a model can occasionally mislabel or miss a field. Forcing JSON output and naming exact keys reduces this a lot, but for anything mission-critical, validate the parsed dict against an expected schema before trusting it.

For staying unblocked at volume, the Crawling API handles IP rotation and rendering for you. If you would rather route your own traffic through a rotating pool, the Smart AI Proxy (also called the AI Proxy) gives you the same residential IP rotation as a drop-in proxy endpoint. Either way, the broader playbook lives in how to scrape websites without getting blocked.

Recap

Key takeaways

Split the job. Crawlbase fetches and renders the page; Gemini extracts the fields. Neither tool does the other's job, and that separation is what makes the pipeline reliable.
Use the JS token and markdown format. The JS token renders client-side pages; format: 'markdown' returns clean, low-token content that is ideal input for an LLM.
Force JSON output. Set Gemini's response_mime_type to application/json and name your exact keys so the result is parseable every run.
No selectors needed. You describe the fields in plain English, so the same script adapts across layouts without rewriting extraction code.
Know the trade-offs. LLM extraction is flexible but slower and billed per token, so trim input, validate output, and reach for a dedicated parser when speed matters.

Frequently Asked Questions (FAQs)

Can Gemini do web scraping on its own?

Not the fetching part. Gemini reads and structures content you give it, but it has no HTTP client, browser, or proxy pool, so it cannot open a URL or get past anti-bot defenses. You pair it with a fetching layer like the Crawling API, which renders the page and returns clean HTML or markdown; Gemini then extracts the structured fields from that content.

Why convert the page to markdown before sending it to Gemini?

Markdown strips navigation, scripts, and styling noise, leaving the readable content. That lowers the token count you send to Gemini, which cuts cost and improves accuracy because the model spends its attention on real content instead of boilerplate. The Crawling API can return markdown directly with format: 'markdown', so you do not need a separate conversion step.

Do I need the normal token or the JS token from Crawlbase?

Use the JS token for any page that renders content client-side, which is most modern sites. The normal token fetches static HTML, so on a client-rendered page it returns an empty shell and Gemini has nothing to extract. The JS token renders the page in a real browser first, so the content is present when it reaches the model.

How do I make Gemini return reliable JSON instead of prose?

Set the generation config's response_mime_type to application/json and name the exact keys you want in the prompt. That combination makes Gemini return parseable JSON without code fences or commentary. Still wrap the json.loads call in a try/except so a rare malformed response logs the raw text instead of crashing your run.

Is the Gemini approach better than the Scraper API for everything?

No, they serve different needs. For well-known sites with existing parsers, the Scraper API returns pre-parsed JSON faster and without LLM token cost. The Gemini pipeline is the flexible fallback for odd, one-off, or frequently changing layouts where no dedicated parser exists and you would rather describe fields than maintain selectors.

Will this get me blocked?

The Crawling API renders pages behind rotating residential IPs server-side, which handles most blocking for you. If you build your own fetch stack, that rotation is the part to invest in, and you can use the Smart AI Proxy as a drop-in rotating endpoint. Pace your requests, vary your targets, and watch the status codes so you can back off when a site starts challenging traffic.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why pair Gemini with a fetching layer at all

What you will build

Set up the environment

Step 1: Fetch the rendered page with the Crawling API

Step 2: Send the content to Gemini and ask for JSON

Step 3: Parse and save the structured result

The full script

What the output looks like

Scaling to many pages

Limits worth knowing before you ship

Key takeaways

Frequently Asked Questions (FAQs)

Can Gemini do web scraping on its own?

Why convert the page to markdown before sending it to Gemini?

Do I need the normal token or the JS token from Crawlbase?

How do I make Gemini return reliable JSON instead of prose?

Is the Gemini approach better than the Scraper API for everything?

Will this get me blocked?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

Build an AI Research Dataset with Web MCP: Crawl Once, Reuse Forever

The infrastructure brief, in your inbox.

We use cookies

Customize cookies