Turning a web page into clean, structured data is two jobs, not one. First you have to get the page, which sounds trivial until the target serves you a CAPTCHA or an empty shell. Then you have to read the markup and pull out the fields you actually want, which is the part that traditionally means hand-writing brittle parsers for every site. This guide pairs two tools that each handle one half: the Crawlbase Crawling API collects the page, and Perplexity AI interprets it into JSON.

The build is a small, runnable Python script for perplexity ai web scraping python: fetch the HTML through the Crawling API, trim it down to the part that matters, convert it to Markdown to save tokens, then hand it to Perplexity's API with a prompt that says exactly what to extract. The important thing to keep straight is the division of labor. Perplexity does not crawl the site in this flow. It reads the text you give it. The fetching, rendering, and block-avoidance all happen in the Crawling API step.

Why use Perplexity AI for web scraping

Classic Python scraping leans on requests and BeautifulSoup: you fetch the HTML, then write selectors that walk the DOM to the fields you want. That works fine on tidy, stable pages. It falls apart when the markup is deeply nested, inconsistent across listings, or rewritten every few weeks, because each change means rewriting selectors.

An LLM like Perplexity changes the second half of that equation. Instead of telling it where the price lives in the DOM, you tell it what you want ("the product title, the price, and a one-line summary") and it reads the content the way a person would. It is good at pulling structure out of messy text and returning it as JSON, which is exactly the shape you want for a pipeline. This is the same idea behind AI data extraction generally, and Perplexity's Sonar models add web grounding on top.

What Perplexity is and is not doing here

Perplexity's API is OpenAI-compatible, so you talk to it with the same openai client and a different base URL. In this scraper it plays exactly one role: read the page text we collected and return structured fields. It is not your crawler, it is not avoiding blocks, and it is not your proxy. Keep that boundary clear and the architecture stays simple: Crawlbase gets the bytes, Perplexity makes sense of them.

Set up your Python environment

You need Python 3.8 or newer. Create a virtual environment so this project's dependencies stay isolated, then activate it.

bash
python -m venv perplexity_env

# Windows
perplexity_env\Scripts\activate

# macOS / Linux
source perplexity_env/bin/activate

Now install the four libraries the script uses.

bash
pip install crawlbase beautifulsoup4 markdownify openai
  • crawlbase the client for the Crawling API, which fetches and renders the page.
  • beautifulsoup4 trims the HTML down to the relevant section before you spend tokens on it.
  • markdownify converts that section to Markdown so the model gets clean text, not tag soup.
  • openai the OpenAI-compatible client Perplexity's API speaks.

You also need two keys. Get a Crawlbase token from the dashboard after signing up, and a Perplexity API key from your Perplexity account settings. Keep both out of source control, in environment variables or a secrets file, and never paste them into shared code.

Normal token vs JS token

Crawlbase issues two tokens. The normal token returns static HTML; the JavaScript (JS) token renders the page in a real browser first. If your target builds its content client-side (most modern stores and dashboards do), use the JS token, or the page comes back as an empty shell. For a server-rendered page the normal token is faster and cheaper.

Step 1: Fetch the page with the Crawling API

This is the collection step, and it is where blocks get handled. You send the Crawling API a URL; it routes the request through rotating residential IPs, optionally renders the JavaScript, deals with the CAPTCHAs and challenges that would stop a plain requests.get, and returns the finished HTML. You never touch a proxy pool or a headless browser yourself.

Save this as crawl.py. We use an Amazon product page as the example target.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_JS_TOKEN'})

def crawl(url: str) -> str:
    response = api.get(url, {'ajax_wait': 'true', 'page_wait': 3000})
    if response['status_code'] != 200:
        raise RuntimeError(f'Crawl failed: {response["status_code"]}')
    return response['body'].decode('utf-8')

if __name__ == '__main__':
    url = 'https://www.amazon.com/Art-War-DELUXE-Sun-Tzu/dp/9388369696'
    html = crawl(url)
    with open('output.html', 'w', encoding='utf-8') as f:
        f.write(html)
    print('Saved output.html')

Run it with python crawl.py. You get an output.html with the real product markup in it, not the blocked or empty page a direct request often returns. The ajax_wait and page_wait options tell the renderer to wait for asynchronous content; bump page_wait up if results come back thin. That is the entire point of using the Crawling API here: it is the layer that gets you a usable page in the first place, which is also what makes the AI step possible.

Crawlbase Crawling API

The AI step only works if you can get the page at all. The Crawling API takes a token, renders the page in a real browser when you need it, rotates through residential IPs server-side, and returns finished HTML, so you skip running a proxy pool and a headless fleet yourself. Point it at a public page on the free tier and feed the result straight into Perplexity.

Step 2: Trim the HTML and convert it to Markdown

A full product page is hundreds of kilobytes of nav bars, scripts, and footers. Sending all of that to Perplexity is slow and wasteful, since LLM pricing is per token and most of those tokens are noise. Two cheap steps fix it: use BeautifulSoup to grab only the section you care about, then convert that section to Markdown so the model reads clean prose instead of tags.

Save this as parse.py.

python
from bs4 import BeautifulSoup
from markdownify import markdownify as md

def html_to_markdown(html: str) -> str:
    soup = BeautifulSoup(html, 'html.parser')
    element = soup.find(id='centerCol') or soup.body
    if element is None:
        raise ValueError('Could not find content section in HTML')
    return md(str(element))

The centerCol id is the main product column on this Amazon page; for a different site, inspect the live page in your browser's dev tools and target whatever container holds the fields you want. The or soup.body fallback keeps the script from crashing if that id is absent. Selectors drift over time, so treat the target container as something you will revisit, not a permanent contract.

Step 3: Write the extraction prompt

The prompt is where you tell Perplexity what to pull and how to shape it. Be specific about the fields, and ask for JSON only so the response is easy to parse downstream. A system message sets the role; the user message carries the instructions plus the Markdown.

python
def build_prompt(markdown: str) -> list:
    return [
        {
            'role': 'system',
            'content': 'You extract structured data from product pages. Reply with JSON only, no prose.',
        },
        {
            'role': 'user',
            'content': (
                'Extract these fields from the Markdown:\n'
                '- title\n'
                '- price\n'
                '- rating\n'
                '- one_sentence_summary\n\n'
                f'Markdown:\n{markdown}\n\n'
                'Respond with a single JSON object.'
            ),
        },
    ]

Clear field names and an explicit "JSON only" instruction do most of the work. If you need a strict shape, name every key you expect and the model will follow it closely.

Step 4: Call Perplexity and assemble the scraper

Now wire it together. The openai client points at https://api.perplexity.ai, you send the prompt to a Sonar model, and you parse the JSON it returns. Save this as scraper.py.

python
import json
from openai import OpenAI
from crawl import crawl
from parse import html_to_markdown
from build_prompt import build_prompt

URL = 'https://www.amazon.com/Art-War-DELUXE-Sun-Tzu/dp/9388369696'

client = OpenAI(
    api_key='YOUR_PERPLEXITY_API_KEY',
    base_url='https://api.perplexity.ai',
)

def scrape(url: str) -> dict:
    html = crawl(url)
    markdown = html_to_markdown(html)
    messages = build_prompt(markdown)

    response = client.chat.completions.create(
        model='sonar-pro',
        messages=messages,
    )
    content = response.choices[0].message.content
    return json.loads(content)

if __name__ == '__main__':
    data = scrape(URL)
    print(json.dumps(data, indent=2))

Run it with python scraper.py after dropping in your two keys. The sonar-pro model is Perplexity's stronger Sonar tier; sonar is cheaper and fine for simple extraction. Both speak the OpenAI chat-completions format, so swapping models is a one-line change.

What the output looks like

The result is a clean JSON object you can write to a database, a CSV, or the next stage of a pipeline.

json
{
  "title": "The Art of War (Deluxe Hardbound Edition)",
  "price": "$15.80",
  "rating": "4.7 out of 5",
  "one_sentence_summary": "An ancient Chinese treatise by Sun Tzu on strategy, planning, and adapting tactics to win conflicts."
}

Note one quirk worth handling in production: Perplexity's Sonar models are web-grounded and sometimes append citation markers like [1] to text, or wrap JSON in a code fence. If json.loads raises, strip a leading and trailing fence before parsing, or instruct the model more firmly to return raw JSON. A short cleanup step keeps the parser happy.

Challenges and limits to keep in mind

Pairing an LLM with a crawler is powerful, but it is not free of trade-offs. Cost scales with tokens, so the trim-to-Markdown step matters more as you scale; do not send whole pages. Latency is higher than a hand-written parser, since you are waiting on a model round-trip per page, which is fine for hundreds of pages and worth rethinking for millions. And LLMs can occasionally mislabel a field or hallucinate a value, so validate critical fields (cast prices to numbers, check required keys) rather than trusting the output blind.

For very high-volume, fixed-schema jobs where you already know the exact fields, a deterministic parser, or Crawlbase's own Crawling API with its ready-made parsers, can be cheaper and faster than an LLM. The AI approach earns its keep when pages are varied, messy, or change often. If you want to compare AI providers for this kind of work, leveraging Gemini AI for web scraping walks the same pattern with a different model.

Keep the collection step unblocked

Everything above assumes Step 1 actually returns a real page. On hard commercial targets, that assumption is where scrapers fail, so a few habits keep the collection layer healthy.

  • Use the JS token when the page is client-rendered. The normal token returns the pre-render shell on those sites, and then there is nothing for Perplexity to read.
  • Lean on rotation. The Crawling API and the Smart AI Proxy route requests through rotating residential IPs so no single address trips a rate limit. If you build your own stack, this is the part to get right.
  • Pace your requests and read the status codes. Spread requests out, vary parameters, and treat a rising rate of challenges as a signal to back off, not push harder.

For the full playbook on this, see how to scrape websites without getting blocked. The short version: let the Crawling API own collection and block-avoidance, and let Perplexity own interpretation. Keep those two responsibilities separate and the system is easy to reason about.

Recap

Key takeaways

  • Two tools, two jobs. The Crawling API collects and renders the page behind a trusted IP; Perplexity reads the result and returns structured JSON. Perplexity is not your crawler.
  • Trim before you prompt. Use BeautifulSoup to grab the relevant section and Markdown to clean it, so you spend tokens on content, not nav bars and scripts.
  • Perplexity speaks OpenAI. Use the openai client with base URL https://api.perplexity.ai and a Sonar model like sonar-pro.
  • Prompt for JSON only. Name the fields explicitly and validate critical ones; Sonar can append citation markers or fences, so add a small cleanup step.
  • Block-avoidance lives in Step 1. Use the JS token for client-rendered pages, lean on rotation, and pace requests so the page comes back at all.

Frequently Asked Questions (FAQs)

Does Perplexity AI scrape the website itself?

No, not in this workflow. Perplexity reads text you give it and returns structured data; it does not fetch the target page, render JavaScript, or handle blocks. The Crawling API does all of the collection, including rotating residential IPs and rendering, and then you pass its HTML or Markdown to Perplexity for interpretation. Keeping that boundary clear is the key to understanding the architecture.

Why convert the HTML to Markdown before sending it to Perplexity?

Two reasons: cost and quality. A full HTML page is mostly nav, scripts, and styling that waste tokens, and LLM pricing is per token. Trimming to the relevant section with BeautifulSoup and converting to Markdown gives the model clean prose to read, which both lowers cost and improves extraction accuracy because there is less noise to wade through.

Which Perplexity model should I use, sonar or sonar-pro?

Start with sonar for simple, well-structured pages; it is cheaper and usually accurate enough. Move to sonar-pro when extraction quality matters or the content is dense and varied. Both use the OpenAI-compatible chat-completions format, so switching is a one-line change to the model argument.

Do I need the normal Crawlbase token or the JS token?

It depends on the target. Use the normal token for server-rendered pages where the HTML already contains the data. Use the JS token when the site builds its content client-side, which is most modern stores and apps, because the normal token would return the empty pre-render shell and leave Perplexity with nothing to extract.

The JSON parse keeps failing. What is wrong?

Perplexity's Sonar models are web-grounded and can wrap output in a code fence or append citation markers like [1], which breaks json.loads. Strip any leading and trailing fence before parsing, tighten the prompt to insist on raw JSON with no extra text, and validate the parsed object before using it. A small cleanup step makes the pipeline reliable.

Can I use Perplexity AI and Crawlbase together at scale?

Yes, and they complement each other well. Crawlbase handles collection and block-avoidance so your requests keep landing, and Perplexity turns each page into structured data. For very high volume with a fixed schema, weigh the per-page LLM cost against a deterministic parser or the Scraper API; the AI approach shines when pages are messy or change often, while fixed-schema jobs may be cheaper without an LLM.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available