Web scraping is a powerful method for extracting data from websites, but transforming messy HTML into clean, structured information presents a significant challenge. That’s where Perplexity AI comes in. With AI, you can extract data faster and more accurately.

In this blog, we’ll show you how to use Perplexity AI for web scraping in Python. You’ll learn how to fetch HTML content, convert it to Markdown for better readability, and use AI to extract the data you need. We’ll also show you how Crawlbase Smart Proxy helps you avoid blocks and captchas while scraping protected websites. You can sign up now and get 5,000 free credits.

This blog is for developers, analysts, or anyone who wants to scrape the web smarter.

📚 Table of Contents

  1. Why Use Perplexity AI for Web Scraping?
  2. Setting Up Your Python Environment
  1. Step-by-Step Guide to Using Perplexity AI for Web Scraping
  1. Challenges and Limitations of Perplexity AI in Web Scraping
  2. Avoid Getting Blocked: Use Crawlbase Smart Proxy
  3. Final Thoughts
  4. Frequently Asked Questions

Why Use Perplexity AI for Web Scraping?

Traditional web scraping uses Python libraries such as requests and BeautifulSoup to extract data from a website’s HTML. Works well for simple sites but becomes challenging when the HTML is messy or complex.

That’s where Perplexity AI comes in.

Perplexity AI is a smart tool that understands natural language and can find structured data inside raw HTML content. When you combine it with web scraping, it’s easier to extract clean, helpful, and organized data.

Perplexity AI for Scraping Benefits:

  • Extract data from complex web pages
  • Reduces time spent on writing custom parsing logic
  • Works with Markdown-formatted HTML, which makes data extraction more accurate
  • Returns structured output like JSON

By using Perplexity AI for web scraping in Python, you’ll scrape faster, smarter, and more efficiently.

Setting Up Your Python Environment

Before using Perplexity AI for web scraping, we need to prepare our Python environment. This setup ensures that everything runs smoothly and helps avoid errors later on.

Install Python

If you haven’t already, install Python from the official website. Python is the primary language we’ll use to send requests, process web data, and talk to the Perplexity API.

Create a Virtual Environment

A virtual environment keeps your project dependencies organized and avoids conflicts with other Python projects.

Open your terminal or command prompt and run:

1
python -m venv perplexity_env

Then activate the environment:

  • Windows:
1
perplexity_env\Scripts\activate
  • macOS/Linux:
1
source perplexity_env/bin/activate

Install Required Libraries

Now, let’s install the Python packages we need:

1
pip install beautifulsoup4 markdownify openai requests
  • beautifulsoup4: to parse HTML
  • markdownify: to convert HTML to markdown
  • openai: to connect with the Perplexity API (uses OpenAI-compatible format)
  • requests: to send HTTP requests

Setup Perplexity API Access

To use Perplexity for web scraping, you need an API key. Perplexity offers an OpenAI-compatible API, which means you can use the same code format as OpenAI’s GPT models.

Here’s how to set it up:

  1. Get your API key from your Perplexity account dashboard.
  2. Set your API key in your code like this:
1
2
3
4
5
6
from openai import OpenAI

client = OpenAI(
api_key="<perplexity.ai API KEY>",
base_url="https://api.perplexity.ai"
)

Ensure that you keep your API key secure and never share it publicly in code.

Step-by-Step Guide to Using Perplexity AI for Web Scraping

In this section, we’ll show you how to build a Python web scraper using Perplexity AI. You’ll learn how to scrape a web page, clean the content, convert it to Markdown, and use Perplexity AI to extract the data. We’ll use BeautifulSoup to select only the necessary part of the page, avoiding extra HTML that could increase costs by using more tokens.

We’ll use the following URL as an example:

1
https://www.amazon.com/Art-War-DELUXE-Sun-Tzu/dp/9388369696/ref=sr_1_1

Send Requests and Parse HTML

To begin, we’ll send an HTTP request to the target website and retrieve its HTML content. Save the following Python code in a file named crawl.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from requests.exceptions import RequestException
from urllib3.exceptions import InsecureRequestWarning
import requests

# Suppress only the single warning from urllib3 needed.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

HEADERS = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36'
}

def crawl(url) -> str:
try:
response = requests.get(
url,
headers=HEADERS,
verify=False,
timeout=30
)
response.raise_for_status()

return response.text

except RequestException as error:
print(f"\nFailed to crawl url '{url}': {error}\n")
raise

if __name__ == "__main__":

html_content = crawl("https://www.amazon.com/Art-War-DELUXE-Sun-Tzu/dp/9388369696/ref=sr_1_1")
with open('output.html', 'w', encoding='utf-8') as file:
file.write(html_content)

Run the script using the following command:

1
python crawl.py

Upon execution, it will generate an output file named output.html.

Amazon book page browser output

Note:
At times, the following error may occur due to Amazon blocking automated requests:

1
Failed to parse html to markdown: 'NoneType' object has no attribute 'text'

If this occurs, opening output.html in the browser may show an unexpected or empty result, as illustrated below:

Amazon book page browser output

This is a common issue with websites that employ bot protection. To address it, you can use HTTP headers that mimic a real browser, or adopt more advanced solutions such as Crawlbase Smart Proxy, which will be discussed later.

Convert HTML to Markdown for AI Processing

Perplexity AI performs better when we send clean, simplified text instead of raw HTML. To achieve this, we’ll use the markdownify library to convert HTML into Markdown. Sending only the relevant section reduces token usage and improves the quality of AI responses.

We will be parsing the HTML content using BeautifulSoup. Save the following code in a file named parse.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from crawl import crawl
from bs4 import BeautifulSoup
from markdownify import markdownify as md

def parse_html_into_markdown_format(html_content) -> str:
try:
soup = BeautifulSoup(html_content, "html.parser")
element = soup.find(id='centerCol')

return md(str(element.text))
except Exception as error:
print(f"\nFailed to parse html to markdown: {error}\n")
raise

if __name__ == "__main__":

html_content = crawl("https://www.amazon.com/Art-War-DELUXE-Sun-Tzu/dp/9388369696/ref=sr_1_1")

markdown_content = parse_html_into_markdown_format(html_content)
with open('output.md', 'w', encoding='utf-8') as file:
file.write(markdown_content)

Now, run the script using the command below:

1
python parse.py

This will generate an output file named output.md. When viewed with a Markdown previewer, it will appear as follows:

Amazon book page browser output

The clean Markdown format makes it easier for tools like Perplexity AI to process the content effectively, allowing them to focus on the most relevant information.

Formatting AI Prompts

To achieve the best results with Perplexity AI, provide clear instructions (prompts). These prompts help the AI understand what you want to extract.

Here’s an example prompt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
prompt = [
{
"role": "system",
"content": "You are a helpful assistant that summarizes an Amazon product book page."
},
{
"role": "user",
"content": (
"Extract the following details from the Markdown:\n"
"- 1 sentence summary\n"
"- Search the web for recommended reading\n"
"- Prices\n\n"
f"Markdown:\n{markdown_content}\n\n"
"Respond only with extracted data in JSON format."
),
},
]

This prompt instructs the AI on exactly what to extract from the content.

Feed to AI for Analysis

Now, let’s send this prompt to Perplexity AI using their OpenAI-compatible API:

Save into perplexity_ai_powered_scraper.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from crawl import crawl
from parse import parse_html_into_markdown_format
from openai import OpenAI
import json

URL = "https://www.amazon.com/Art-War-DELUXE-Sun-Tzu/dp/9388369696/ref=sr_1_1"

html_content = crawl(URL)
markdown_content = parse_html_into_markdown_format(html_content)

prompt = [
{
"role": "system",
"content": "You are a helpful assistant that summarizes an Amazon product book page."
},
{
"role": "user",
"content": (
"Extract the following details from the Markdown:\n"
"- 1 sentence summary\n"
"- Search the web for recommended reading\n"
"- Prices\n\n"
f"Markdown:\n{markdown_content}\n\n"
"Respond only with extracted data in JSON format."
),
},
]

client = OpenAI(api_key="<perplexity.ai API KEY>", base_url="https://api.perplexity.ai")

# Send chat completion request
response = client.chat.completions.create(
model="sonar-pro",
messages=prompt,
)

# Export the result in JSON format
scraped_data = json.loads(response.choices[0].message.content)

print(json.dumps(scraped_data, indent=2))

Make sure to replace <perplexity.ai API KEY> with the API Key earlier and run the code using the command below.

1
python perplexity_ai_powered_scraper.py

This will output a json text:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"1_sentence_summary": "The Art of War is an ancient Chinese military treatise by Sun Tzu that emphasizes strategic planning, understanding both oneself and the enemy, and using adaptable tactics to achieve victory in conflict and beyond[1][3][4].",
"recommended_reading": [
"On War by Carl von Clausewitz",
"The Book of Five Rings by Miyamoto Musashi",
"Leadership and Strategy from Sun Tzu and Other Masters by William A. Cohen",
"The 33 Strategies of War by Robert Greene",
"The Prince by Niccol\u00f2 Machiavelli"
],
"prices": [
{
"format": "Hardcover (Deluxe Hardbound Edition)",
"price": "$15.80"
}
]
}

You can view the complete source code on GitHub.

Challenges and Limitations of Perplexity AI in Web Scraping

While Perplexity AI offers powerful features for web scraping, it does come with some challenges:

Images showing the challenges and limitations of Perplexity AI in web scraping

Understanding these limitations helps you maximize the benefits of Perplexity AI for web scraping while minimizing potential issues.

Avoid Getting Blocked: Use Crawlbase Smart Proxy

When scraping with Perplexity AI, websites often block bots, making it more challenging to obtain data. Crawlbase Smart Proxy solves this by rotating IP addresses and bypassing CAPTCHAs, allowing you to scrape websites without being blocked.

Why Use Crawlbase Smart Proxy with Perplexity AI?

  1. Bypass IP Blocks: Rotates IP addresses to avoid detection.
  2. Solve CAPTCHAs: Automatically handles CAPTCHAs, so you don’t have to.
  3. Save Time: No need to manage proxy servers—Crawlbase does it all.
  4. Clean HTML: Returns ready-to-use HTML for Perplexity AI.

Example Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from bs4 import BeautifulSoup
from markdownify import markdownify as md
from requests.exceptions import RequestException
from urllib3.exceptions import InsecureRequestWarning
import requests

# Suppress only the single warning from urllib3 needed.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

HEADERS = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36'
}

def crawl_with_smart_proxy(url) -> str:
proxy_url = "http://<Private token>:@smartproxy.crawlbase.com:8012" # Use https:// and port 8013 for HTTPS
proxies = {
"http": proxy_url,
"https": proxy_url
}

try:
response = requests.get(
url,
headers=HEADERS,
proxies=proxies,
verify=False,
timeout=30
)
response.raise_for_status()

return response.text

except RequestException as error:
print(f"\nFailed to crawl url '{url}': {error}\n")
raise

With Crawlbase Smart Proxy, you can scrape websites safely, bypass blocks, and get clean data for processing with Perplexity AI.

Final Thoughts

Utilizing Perplexity AI for web scraping in Python can enhance your scraping tasks by making them faster, smarter, and more accurate. By converting raw HTML to Markdown and utilizing AI to extract structured data, you can streamline your process and save time.

However, scraping websites can be challenging, especially when encountering blocks and CAPTCHAs. That’s where Crawlbase Smart Proxy comes in. It helps you avoid IP blocks and solves CAPTCHAs, allowing you to scrape websites without interruptions. This combination of Perplexity AI and Crawlbase Smart Proxy makes web scraping more efficient and scalable, allowing you to obtain the data you need without being blocked.

Frequently Asked Questions

Q. What is Perplexity AI, and how does it help with web scraping?

Perplexity AI is a tool that uses natural language processing to help you extract structured data from raw HTML content. It makes scraping easier by converting messy HTML into readable Markdown and extracting key details with AI. Saves you time and improves data extraction accuracy.

Q. How does Crawlbase Smart Proxy prevent my scraper from getting blocked?

Crawlbase Smart Proxy rotates IP addresses and solves CAPTCHAs, making it appear as if a real user is browsing the site. It helps avoid IP blocks and lets you scrape websites without being detected as a bot. A reliable tool to keep your scraping tasks running.

Q. Can I use Perplexity AI and Crawlbase Smart Proxy together?

Yes! Using Perplexity AI for data extraction and Crawlbase Smart Proxy for bypassing blocks and CAPTCHAs is a killer combo. Crawlbase enables seamless access to the website, and Perplexity AI facilitates the cleaning and processing of data.