ChatGPT web scraping has quietly flipped how a lot of extraction code gets written. The old way was brittle: inspect a page, hand-write CSS selectors, and rewrite them every time the site shipped new markup. The new way leans on the model. You fetch a clean copy of the page, hand it to an OpenAI model, and ask for the fields you want as structured JSON. The selectors live in your prompt instead of your code, and they survive layout changes that would have broken a hard-coded parser.
There is one honest caveat the marketing tends to skip: ChatGPT cannot fetch the page for you. It has no reliable browser, it gets blocked like any other client, and asking it for a live URL often returns a confident hallucination instead of real data. So this guide splits the job cleanly. Crawlbase does the fetching and rendering behind a trusted IP, the OpenAI model does the understanding, and Python glues the two together. Everything below is runnable, scoped to public pages, and built to handle pages too large to send to the model in one shot.
How ChatGPT web scraping actually works
It helps to be precise about which tool does what, because conflating the two is where most tutorials go wrong. A language model is a reasoning layer, not a network client. Give it clean text and it is excellent at pulling structured fields out of messy content. Ask it to go get that content itself and it falls down: no rendering engine, no proxy pool, no CAPTCHA handling, and a strong tendency to invent plausible-looking values when it cannot actually load the page.
So a working pipeline has three stages. First, fetch and render the target with the Crawling API, which runs the page in a real browser behind rotating residential IPs and hands back finished HTML. Second, reduce that HTML to something compact, either stripped text or markdown, so you are not paying to send navigation and script tags to the model. Third, prompt an OpenAI model to read that content and return JSON matching a schema you define. The model never touches the network; it only reads what you give it.
Keep these responsibilities separate. Crawlbase is what loads the page, renders JavaScript, and gets past blocks. The OpenAI model never fetches anything: it only reads the HTML or markdown you pass it and returns structured data. If you ask ChatGPT for a live URL directly, it cannot reliably load it and may fabricate the answer.
What you need before you start
This is a beginner-to-intermediate build. You need Python 3.9 or newer, a Crawlbase account for a normal and a JavaScript token, and an OpenAI API key. The free Crawlbase tier covers more than enough requests to follow along, and the model calls here use a small, inexpensive model. Set both secrets as environment variables rather than pasting them into the script.
python --version mkdir chatgpt-scraper && cd chatgpt-scraper pip install crawlbase openai beautifulsoup4 html2text export CRAWLBASE_TOKEN="your_normal_token" export CRAWLBASE_JS_TOKEN="your_javascript_token" export OPENAI_API_KEY="your_openai_key"
Four libraries do the work. crawlbase is the client for the Crawling API, openai is the official SDK for the model calls, beautifulsoup4 strips a rendered page down to readable text, and html2text turns HTML into markdown when you want the model to see structure like headings and tables. You will not always need both BeautifulSoup and html2text; pick whichever representation suits the page.
Step 1: Fetch the rendered page with the Crawling API
Start by getting a clean copy of the page. Use the JavaScript token for any site that renders content client-side, which is most modern pages, and pass ajax_wait plus page_wait so late-loading content has time to appear before the HTML comes back. The example below points at a public product page; swap in whatever public URL you are working with.
import os from crawlbase import CrawlingAPI api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]}) target_url = "https://www.example.com/products/widget" def fetch_html(url): response = api.get(url, {"ajax_wait": "true", "page_wait": 4000}) if response["status_code"] != 200: raise RuntimeError(f"Fetch failed: {response['status_code']}") return response["body"].decode("utf-8", "ignore") if __name__ == "__main__": html = fetch_html(target_url) print(len(html), "bytes of rendered HTML")
Run it and you should see a substantial byte count and, if you print a slice, real content rather than an empty shell. This is the step ChatGPT cannot do on its own: the Crawling API rendered JavaScript and routed the request through a trusted IP so the page came back whole. With clean HTML in hand, the model can take over.
Step 2: Reduce the page to clean text or markdown
Raw HTML is mostly noise to a language model: scripts, styles, SVG paths, and tracking tags that cost tokens and dilute the signal. Strip them first. For plain field extraction, BeautifulSoup text is enough. When the page has meaningful structure, like a spec table or a nested list, convert to markdown so the model can see the hierarchy.
from bs4 import BeautifulSoup import html2text def to_text(html): soup = BeautifulSoup(html, "html.parser") for tag in soup(["script", "style", "noscript", "svg"]): tag.decompose() return soup.get_text(separator=" ", strip=True) def to_markdown(html): converter = html2text.HTML2Text() converter.ignore_links = True converter.ignore_images = True return converter.handle(html)
This one move often cuts the token count by an order of magnitude, which means cheaper, faster, and more accurate extractions because the model is not wading through markup. If you would rather skip this entirely, Crawlbase can return clean LLM-ready markdown directly from the fetch, so the page arrives in the shape a model wants without a local conversion step.
Step 3: Prompt the OpenAI model to extract structured JSON
Now the actual ChatGPT web scraping. Send the cleaned content to an OpenAI model with a prompt that names every field you want and forces the output into JSON. The single most important setting is asking for a JSON object response, so you get parseable data instead of prose. A low temperature keeps the model from getting creative with values that should be copied verbatim.
import json from openai import OpenAI client = OpenAI() SYSTEM = ( "You extract structured data from web page content. " "Return only valid JSON. Copy values verbatim from the text. " "If a field is not present, use null. Never invent data." ) def extract(content, fields): prompt = ( f"Extract these fields as JSON: {', '.join(fields)}.\n\n" f"Page content:\n{content}" ) response = client.chat.completions.create( model="gpt-4o-mini", temperature=0, response_format={"type": "json_object"}, messages=[ {"role": "system", "content": SYSTEM}, {"role": "user", "content": prompt}, ], ) return json.loads(response.choices[0].message.content)
Wire the three stages together and you have a complete scraper that never hard-codes a selector.
if __name__ == "__main__": html = fetch_html(target_url) text = to_text(html) data = extract(text, ["product_name", "price", "rating", "in_stock"]) print(json.dumps(data, indent=2))
The result is structured and ready to store.
{ "product_name": "Acme Widget Pro", "price": "$49.99", "rating": "4.6", "in_stock": true }
The model can only extract from a page it can actually see. The Crawling API renders JavaScript in a real browser, rotates through residential IPs server-side, and returns finished HTML or LLM-ready markdown in one call, so ChatGPT gets clean content instead of a blocked shell. Start on the free tier and point it at a public page.
Designing prompts that extract reliably
The prompt is now where your scraping logic lives, so it pays to write it deliberately. A few patterns make the difference between flaky output and data you can trust.
Define an explicit schema, not a vague request
"Get the important info" gives the model room to guess. Name every field, its type, and what to do when it is missing. Passing a JSON skeleton in the prompt is even stronger, because the model fills in a shape it can see rather than one it has to infer.
schema = { "product_name": "string", "price": "string, include currency symbol", "rating": "number or null", "specs": "object of key-value pairs", } prompt = ( f"Fill this schema from the page content. " f"Use null for anything absent.\n\n" f"Schema:\n{json.dumps(schema, indent=2)}\n\n" f"Content:\n{content}" )
Pin down formats and forbid invention
State exactly how each value should look: keep the currency symbol on prices, normalize dates to YYYY-MM-DD, return ratings as numbers. Just as important, tell the model never to guess. The instruction "use null for missing fields, never fabricate" in the system prompt is what stops hallucinated values, the single biggest risk in model-based extraction.
Lower the temperature and validate the output
Set temperature=0 so the same page yields the same JSON. Then validate what comes back: confirm it parses, check that required keys exist, and verify types. The model returns text, so treat its output as untrusted input until your code has checked it, exactly as you would with any external source.
Handling large pages that exceed the context window
The pattern above works until a page is too big to fit in one model call. Long category listings, reviews with hundreds of entries, and dense documentation can blow past the context window or simply cost too much per request. The fix is to split the content into chunks, extract from each, and merge the results.
def chunk_text(text, size=12000, overlap=500): chunks = [] start = 0 while start < len(text): end = start + size chunks.append(text[start:end]) start = end - overlap return chunks def extract_large(text, fields): results = [] for chunk in chunk_text(text): part = extract(chunk, fields) if part: results.append(part) return results
A few rules keep chunking honest. Overlap the chunks by a few hundred characters so a record straddling a boundary is not cut in half. Split on natural breaks like paragraphs or list items rather than mid-word when you can. For list pages, ask each chunk for an array of items and concatenate the arrays, then de-duplicate on a stable key like a product ID or URL, since the overlap will produce a few repeats. The pattern is the same one production systems use, and the broader mechanics are covered in how AI data extraction works.
When the page fights back
Everything so far assumes the fetch succeeds. On well-defended sites it will not, at least not for long. The Crawling API handles rendering and IP rotation in the call you already wrote, which clears most blocks. When you are running at higher volume or hitting unusually aggressive targets, route through the Smart AI Proxy, which adapts its strategy per target to keep success rates up, or reach for the Crawling API when a site already has a maintained parser and you want clean fields without an LLM call at all.
The division of labor is the thing to hold onto: Crawlbase is responsible for getting past defenses and delivering a real page, and the OpenAI model is responsible for reading it. Conflating the two, by asking ChatGPT to fetch a URL, is what produces blocked requests and hallucinated answers. Keep them separate and each does the part it is good at. If you want to compare model families for the extraction half, our walkthrough on using Gemini AI for web scraping follows the same fetch-then-extract shape.
Key takeaways
- Split the job in two. Crawlbase fetches and renders the page; the OpenAI model extracts data from the clean content. The model never touches the network.
- Reduce before you prompt. Strip HTML to text or markdown so you spend tokens on content, not script tags, and get cheaper, more accurate extractions.
-
Make the prompt your schema. Name every field and type, force JSON output, set
temperature=0, and tell the model to use null instead of inventing values. - Chunk large pages. Split with overlap, extract per chunk, then merge and de-duplicate on a stable key when content exceeds the context window.
- Validate the output. The model returns text; confirm it parses and has the expected keys and types before you store it.
- Stay on public data. Respect each site's terms and robots.txt; no accounts, no personal data, no actions behind a login.
Frequently Asked Questions (FAQs)
Can ChatGPT scrape a website directly?
No. ChatGPT has no reliable way to fetch a live page: it gets blocked like any other client and often fabricates an answer when it cannot load the URL. What it does well is read content you give it and return structured data. So you fetch the page with a tool like the Crawling API, then pass the clean HTML or markdown to the model for extraction.
Why fetch with Crawlbase instead of Python requests?
Because a plain request returns an empty shell on JavaScript-heavy sites and gets blocked on defended ones. The Crawling API renders the page in a real browser and routes through rotating residential IPs, so the content the model sees is the content a human would see. Without that step, the model is extracting from a blank or a bot-detection page.
Which OpenAI model should I use for extraction?
A small, fast model like gpt-4o-mini handles most field extraction well and keeps costs low at scale. Reach for a larger model only when the page demands real reasoning, such as inferring fields that are implied rather than stated, or reconciling conflicting values. Start small, measure accuracy on your pages, and size up only if the small model misses.
How do I stop the model from hallucinating values?
Three things together. Set temperature=0 for deterministic output, instruct the model in the system prompt to use null for missing fields and never invent data, and validate the returned JSON in code. Asking it to copy values verbatim from the text, rather than summarize, also cuts down on fabricated answers.
How do I handle a page that is too large for one request?
Split the cleaned content into overlapping chunks, extract from each chunk separately, and merge the results. For list pages, return an array per chunk and de-duplicate on a stable key like a product ID, since the overlap produces a few repeats. Stripping the HTML to text first also shrinks the page enough that many "too large" pages fit in a single call.
Is ChatGPT web scraping legal?
It depends on the target site's terms of service, your jurisdiction, and what you do with the data. Keep strictly to public content, respect robots.txt and rate expectations, and never touch accounts, personal data, or anything behind a login. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
