Pulling a price off an Amazon product page sounds trivial until you try to automate it at scale. The page renders client-side, the markup shifts between layouts, and the same product can show a deal price, a list price, a coupon, and a subscribe-and-save price all at once. The old answer was a pile of brittle CSS selectors or XPath that broke the moment Amazon shipped a tweak. AI changes the calculation: instead of describing where the price sits in the DOM, you hand a model the page content and tell it what you want in plain English.
This guide shows you how to scrape prices from Amazon with AI the reliable way. Use the Crawling API to fetch and render the public product page into clean markdown, then pass that content to a large language model that returns the price and related fields as structured JSON. The split is the whole point: Crawlbase handles fetching and rendering behind a real browser and a trusted IP, the model handles reading and structuring. Each tool does the part it is actually good at.
Why pair AI with a fetching layer
A language model is good at reading messy content and returning structured data when you ask for it. What it cannot do is fetch a web page. It has no HTTP client, no browser, no proxy pool, and no way past the anti-bot defenses that guard a site like Amazon. Give it a URL and it cannot open it; give it raw HTML you scraped yourself and it will extract from whatever you actually retrieved, which on Amazon is usually a near-empty shell or a CAPTCHA page.
That gap is what the fetching layer fills. Amazon renders prices and availability in the browser, and it challenges automated traffic fast, so a plain requests.get typically returns a 200 with none of the data you came for. You need a browser that runs the page's JavaScript and an IP the site reads as a real visitor. You can build that yourself with a headless browser plus rotating residential proxies, but keeping that stack healthy is most of the job. The Crawling API folds both into one call: send a URL with a JavaScript token, it renders the page and returns finished content, ready for the model.
Keep the boundary clear. Crawlbase fetches and renders the Amazon page into clean markdown. The model extracts the price and related fields from that content. The model never touches the network in this design, and Crawlbase never tries to understand the data. Blurring those two jobs is the most common reason these pipelines feel flaky.
Is it legal to scrape Amazon prices?
It depends on Amazon's terms of service, your jurisdiction, and what you do with the data. Amazon's terms restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Treat the legal question as a real one, not a formality.
A few lines worth holding to. Collect only public data: the prices, titles, ratings, and availability that anyone can see on a product page without logging in. Respect Amazon's robots.txt and its stated rate expectations, and keep your request volume low enough that you are not straining anyone's servers. If you plan to reuse the data commercially, get permission or an official data agreement rather than assuming silence is consent. And never collect personal data, including anything tied to individual customer accounts or reviews attributable to identifiable people.
This walkthrough is deliberately scoped to public product data because that is the line that keeps the work defensible. It does not cover anything behind a login, account or order data, payment flows, or any attempt to bypass authentication. If your project needs more than public product pages, the right move is an official API or a data agreement with Amazon, not a cleverer scraper.
What you will build
A small, runnable Python script that takes an Amazon product URL, retrieves the rendered page through the Crawling API as clean markdown, sends that markdown to a language model with an extraction prompt, and writes the structured result to a JSON file. We will point it at a real product page, but the same script works on any public Amazon URL you swap in.
Set up the environment
You need Python 3.8 or later. Confirm your version, create a virtual environment so project dependencies stay isolated, then install the libraries.
python --version python -m venv amazon_env source amazon_env/bin/activate pip install openai crawlbase python-dotenv
On Windows, activate the environment with amazon_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, openai is the client for the language model, and python-dotenv loads your keys from a local file so they never end up hard-coded in the script.
You need two credentials. Get an API key for your model provider from their dashboard, and get a Crawlbase JavaScript (JS) token from your Crawlbase dashboard after signing up. Store both in a .env file in your project folder.
OPENAI_API_KEY=your_model_api_key_here CRAWLBASE_JS_TOKEN=your_crawlbase_js_token_here
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Amazon loads its prices client-side, so the JS token is the right choice here. Using the normal token on a client-rendered page returns the same empty shell a plain fetch would, and the model cannot extract a price that was never there.
Step 1: Fetch the rendered Amazon page
The Crawling API can return the page already converted to markdown, which is exactly what you want before sending it to a language model. Markdown strips the navigation, scripts, and styling noise, leaving the readable content. That cuts the token count you send to the model, which makes the call cheaper and the extraction more accurate. Pass format: 'markdown' and the API hands you clean text instead of raw HTML.
import os from dotenv import load_dotenv from crawlbase import CrawlingAPI load_dotenv() api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]}) url = "https://www.amazon.com/dp/B0CHX1W1XY" def fetch_markdown(target_url): options = {"format": "markdown", "ajax_wait": "true", "page_wait": 3000} response = api.get(target_url, options) return response["body"].decode("utf-8") page_markdown = fetch_markdown(url) print(page_markdown[:500])
The two wait options matter for a client-rendered target like Amazon. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering price elements appear before the page is captured. Three seconds is a reasonable start; raise it if the price comes back missing. Run this and you should see real product markup in the output, not an empty shell, which confirms rendering works before you write a single line of extraction logic.
An AI model reads pages, it does not fetch them. The Crawling API closes that gap in one call: pass a JS token, it renders the Amazon page in a real browser, rotates through residential IPs server-side, and returns clean HTML or LLM-ready markdown, so you skip running a headless fleet and a proxy pool yourself. Point it at a public product page on the free tier first.
Step 2: Send the content to the model and ask for JSON
With clean markdown in hand, you describe the fields you want in a prompt and let the model do the extraction. The trick for a reliable pipeline is forcing JSON output. The OpenAI client supports a JSON response format, so set it and the model returns parseable JSON instead of prose wrapped in code fences. That single setting removes most of the brittleness people complain about with LLM extraction.
from openai import OpenAI client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) def extract_fields(content): prompt = f"""You are a data extraction tool. From the Amazon product page content below, extract the product title, current price, list price, currency, rating, review count, and availability. Return only JSON with keys: title, price, list_price, currency, rating, reviews, availability. Use null for any field not present. CONTENT: {content} """ response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, ) return response.choices[0].message.content raw_json = extract_fields(page_markdown) print(raw_json)
A few things make this prompt work. It states the role ("data extraction tool") so the model stays terse, it names the exact keys you want so the schema is stable across runs, and it tells the model to use null for missing fields so a product without a list price does not derail the output. Passing markdown rather than raw HTML means the model spends its attention on content, not boilerplate. If you need a richer schema, list more keys and describe any that are ambiguous; the model handles nested objects and arrays without extra ceremony.
Step 3: Parse and save the structured result
Because you asked for a JSON response format, the response text is already valid JSON. Parse it into a Python dict and write it to disk. Wrap the parse in a try/except so a rare malformed response logs the raw text instead of crashing the run.
import json def save_json(raw, path="amazon_price.json"): try: data = json.loads(raw) except json.JSONDecodeError: print("Model did not return valid JSON:") print(raw) return with open(path, "w") as f: json.dump(data, f, indent=2) print(f"Saved {path}") save_json(raw_json)
The full script
Here is everything wired together into one runnable file. Fill in your two credentials in .env, change the URL to the product you care about, and adjust the prompt keys for whatever fields you need.
import os import json from dotenv import load_dotenv from crawlbase import CrawlingAPI from openai import OpenAI load_dotenv() api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]}) client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) url = "https://www.amazon.com/dp/B0CHX1W1XY" def fetch_markdown(target_url): options = {"format": "markdown", "ajax_wait": "true", "page_wait": 3000} response = api.get(target_url, options) return response["body"].decode("utf-8") def extract_fields(content): prompt = f"""You are a data extraction tool. From the Amazon product page content below, extract the product title, current price, list price, currency, rating, review count, and availability. Return only JSON with keys: title, price, list_price, currency, rating, reviews, availability. Use null for any field not present. CONTENT: {content} """ response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, ) return response.choices[0].message.content def main(): markdown = fetch_markdown(url) raw = extract_fields(markdown) try: data = json.loads(raw) except json.JSONDecodeError: print("Model did not return valid JSON:", raw) return with open("amazon_price.json", "w") as f: json.dump(data, f, indent=2) print(json.dumps(data, indent=2)) if __name__ == "__main__": main()
What the output looks like
Run it with python amazon_scraper.py and you get clean structured data written to amazon_price.json and echoed to the console.
{ "title": "Echo Dot (5th Gen) Smart speaker with Alexa", "price": "$34.99", "list_price": "$49.99", "currency": "USD", "rating": "4.7", "reviews": "128,540", "availability": "In Stock" }
Notice what you did not write: no CSS selectors, no XPath, no per-field parsing logic to untangle the deal price from the list price. You described the fields and the model found them. Point the same script at a different product, or even a search results page, and it adapts without code changes, which is the real advantage of the AI data extraction approach over hand-tuned selectors. For the same pattern with a different model, see how to leverage Gemini AI for web scraping.
Scaling to many products
One page is a demo; a real price-monitoring job runs over a list of products. The shape stays the same: loop the URLs, fetch each through the Crawling API, extract with the model, and collect the rows. Two things to keep in mind as you scale. The model bills per token, so sending markdown rather than full HTML keeps cost down on every call, and the Crawling API has its own throughput so you do not have to manage proxies or browser instances yourself.
urls = [ "https://www.amazon.com/dp/B0CHX1W1XY", "https://www.amazon.com/dp/B09B8V1LZ3", ] results = [] for u in urls: markdown = fetch_markdown(u) raw = extract_fields(markdown) try: row = json.loads(raw) row["url"] = u results.append(row) except json.JSONDecodeError: print(f"Skipped {u}: invalid JSON") with open("prices.json", "w") as f: json.dump(results, f, indent=2)
If you are scraping Amazon specifically and repeatedly, it is worth comparing this against the Crawling API, which returns pre-parsed JSON for supported sites, Amazon included, without an LLM in the loop. It is faster and cheaper for that well-trodden case. The AI approach in this guide is the flexible fallback for odd layouts, one-off products, or fields no dedicated parser exposes. For why markdown is the right input shape for a model, see LLM-ready markdown for web scraping.
Limits worth knowing before you ship
The AI-plus-Crawlbase pipeline is flexible, but it is not the right hammer for every nail. Keep these in mind.
Token cost adds up. The model charges per token sent and received. Sending full HTML instead of markdown can multiply your bill for no benefit, so always trim the input. For very large pages, extract only the relevant section before the model call.
It is slower than rule-based parsing. An LLM round-trip takes longer than a BeautifulSoup or Cheerio selector pass. For high-frequency, low-latency jobs like second-by-second price monitoring, a dedicated parser or the Scraper API wins. The AI approach shines when layouts vary or change often.
Models can be wrong. Amazon pages are dense and repetitive, so a model can occasionally grab a related product's price or confuse the deal price with the list price. Forcing JSON output and naming exact keys reduces this a lot, but for anything mission-critical, validate the parsed dict against an expected schema before trusting it.
For staying unblocked at volume, the Crawling API handles IP rotation and rendering for you. If you would rather route your own traffic through a rotating pool, the Smart AI Proxy gives you the same residential IP rotation as a drop-in proxy endpoint. Either way, the broader playbook lives in how to scrape websites without getting blocked, and the wider e-commerce context is in ecommerce web scraping.
Key takeaways
- Split the job. Crawlbase fetches and renders the Amazon page; the model extracts the price and related fields. Neither tool does the other's job, and that separation is what makes the pipeline reliable.
-
Use the JS token and markdown format. The JS token renders Amazon's client-side prices;
format: 'markdown'returns clean, low-token content that is ideal input for a model. - Force JSON output. Set the response format to a JSON object and name your exact keys so the result is parseable every run.
- No selectors needed. You describe the fields in plain English, so the same script adapts across product layouts without rewriting extraction code.
- Stay on public data. Respect Amazon's ToS and robots.txt; no accounts, no order data, no auth bypass, and reach for the Scraper API when speed matters.
Frequently Asked Questions (FAQs)
Can an AI model scrape Amazon prices on its own?
Not the fetching part. A language model reads and structures content you give it, but it has no HTTP client, browser, or proxy pool, so it cannot open an Amazon URL or get past anti-bot defenses. You pair it with a fetching layer like the Crawling API, which renders the page and returns clean markdown; the model then extracts the price and other fields from that content.
Why convert the Amazon page to markdown before sending it to the model?
Markdown strips navigation, scripts, and styling noise, leaving the readable content. That lowers the token count you send to the model, which cuts cost and improves accuracy because the model spends its attention on the actual product details instead of boilerplate. The Crawling API can return markdown directly with format: 'markdown', so you do not need a separate conversion step.
Do I need the normal token or the JS token for Amazon?
Use the JS token. Amazon renders its prices client-side, so the normal token returns an empty shell and the model has nothing to extract. The JS token renders the page in a real browser first, so the price and availability are present when the content reaches the model.
Should I use the Scraper API instead for Amazon?
Often, yes. The Crawling API returns pre-parsed JSON for Amazon without an LLM in the loop, which is faster and cheaper for routine price monitoring. The AI approach in this guide is the flexible fallback for unusual layouts, one-off products, or fields the parser does not expose. Many teams use the Scraper API for the bulk and the AI pipeline for the edge cases.
Is it legal to scrape prices from Amazon?
It depends on Amazon's terms of service, your jurisdiction, and your purpose, and their terms restrict automated access. Keep strictly to public product data such as prices, titles, and ratings, respect robots.txt and rate expectations, and never touch accounts, order data, payment flows, or authentication. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.
How do I avoid getting blocked while scraping Amazon prices?
Keep your per-IP request rate low, vary the products you fetch instead of hammering one URL, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off when you start seeing challenges.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
