Baidu, more commonly recognized as the “Chinese Google”, is the top search engine most people use in China. If you’re interested in things like market trends, SEO research, or just curious about what’s popular in China, getting data from Baidu is almost a must.

So in this blog, we’ll show you an easy way to scrape Baidu using Python and Crawlbase. Using such tools, building and managing proxies manually is no longer necessary; a simple script can do the job.

Start your sign up and get free credits to try Crawlbase now.

Table of Contents

Scraping Baidu Search Results: Overview

We’ll combine two main tools to extract data from Baidu, commonly referred to as the Chinese English Search Engine: Crawlbase and BeautifulSoup.

Crawlbase’s Crawling API enables you to avoid blocks when crawling websites. It employs all the necessary techniques, such as rotating IPs, handling CAPTCHAs, and simulating real browsers, to ensure the project’s success. This means you don’t need a complex scraper that mimics human behavior. You need to send the URL to Crawlbase and receive the complete HTML response.

Once we’ve got the HTML in hand, we’ll turn to BeautifulSoup for scraping. It’s a Python library that takes the mess out of parsing web pages, so finding and pulling out the exact info you want is quick and easy. You can pick out whatever tags or classes you need, almost like using familiar CSS selectors. This makes the script simple and easy to read for everyone.

We will focus on three key areas of the Baidu search results page with the scraper we’re going to build. Let’s look at this sample Baidu URL.

Open the page in your browser and inspect the following elements (right-click → Inspect):

  • Search Results: You can find this inside a div with the class title-box_4YBsj, followed by an h3 tag with the class t.
An image of the Chrome Browser Developer Tools with the search result element highlighted.
  • Related Searches: These are found inside a table with the class rs-table_3RiQc, navigating through its tr and td tags to get the links.
An image of the Chrome Browser Developer Tools with the related searches element highlighted.
  • Pagination: To load more results, we increase the pn query parameter in the URL by multiples of 10 (like pn=10, pn=20, etc.).

Setting Up Crawlbase and Coding Environment

  • Go to Crawlbase to create an account and login.
  • Your first 1,000 requests will be free of charge. If you want to get an extra 9,000 requests for free, simply add your billing details before consuming any of the free initial credits.
  • Get your API key or the Normal request token. We’ll use it to crawl Baidu.

Next, set up your Python environment:

  • Make sure Python 3 is installed.
  • Open a terminal and install the required libraries:
1
pip install requests beautifulsoup4

That’s it. You’re now ready to start writing the scraper.

How to Crawl and Scrape Baidu Search Results

Now your main tools are ready. Let’s have fun by creating a simple script to fetch the HTML using Crawlbase. It can be done with these steps:

Step 1: Handle blocks and CAPTCHAs with Crawlbase

Create a new file called crawling.py, and add the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from requests.exceptions import RequestException
import json
import requests

def crawl(url: str) -> str:
API_TOKEN = "<Crawlbase Normal requests token>" #<-- replace with your actual token
API_ENDPOINT = "https://api.crawlbase.com/"

params = {
"token": API_TOKEN,
"url": url
}

response = requests.get(API_ENDPOINT, params=params)
response.raise_for_status()

json_string_content = response.text
json_data = json.loads(json_string_content)

if json_data["original_status"] != 200:
raise Exception[f"Unable to crawl '{url}'"]

return json_data["body"]

What This Script Does:

  • Defines a crawl() function that accepts a target URL.
  • Uses requests to send a GET request to the Crawlbase API.
  • Accepts your Normal Request token to authenticate the request. So don’t forget to change it.
  • Checks that the original page returned a 200 OK status.
  • Returns the full HTML content from the body of the response.

This script lets you fetch any public web page easily, while Crawlbase handles issues like blocking, CAPTCHAs, and rate limits for you.

Step 2: Scrape HTML with Beautifulsoup

We’ll use BeautifulSoup to extract the necessary data from the Baidu search results page for this step, which comes from the raw HTML obtained using Crawlbase.

So create a file called scraping.py and add the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from bs4 import BeautifulSoup
def scrape_html(html: str) -> dict:
soup = BeautifulSoup(html, 'html.parser')

title = soup.title.string if soup.title else None
search_input = soup.find('input', {'name': 'wd'})
if search_input:
search_query = search_input.get('value', '')
search_results = []
for result in soup.select('div.title-box_4YBsj h3.t'):
title = result.text
url = result.find('a')['href']
search_results.append({
'title': title,
'url': url
})

related_searches = []
for result in soup.select('table.rs-table_3RiQc tr td a'):
title = result.text
url = result['href']
related_searches.append({
'title': title,
'url': url
})
page_data = {
'title': title,
'searchQuery': search_query,
'searchResults': search_results,
'relatedSearches': related_searches
}

return page_data

What This Script Does:

  • Parses the HTML using BeautifulSoup.
  • Extracts the following data:
    • The page’s <title> tag.
    • The search query from the input field named “wd“.
    • Search result titles and URLs from div.title-box_4YBsj h3.t.
    • Related search links from table.rs-table_3RiQc tr td a.
  • Returns the data in a clean, structured format you can use or save.

Step 3: Combine Your Scripts

Now that we have both the crawling and scraping scripts ready, it’s time to put everything together and see the results.

Save the following to a file named main.py:

1
2
3
4
5
6
7
from crawling import crawl
from scraping import scrape_html
import json
html = crawl("https://www.baidu.com/s?ie=utf-8&wd=苹果%20iPhone")
data = scrape_html(html)
json_string = json.dumps(data, ensure_ascii=False, indent=2)
print(json_string)

What This Script Does:

  • Uses your crawl() function to fetch the HTML content of Baidu search results for the query “苹果 iPhone”.
  • Passes the HTML to scrape_html() to extract structured data.
  • Converts the result into pretty-printed JSON and displays it in the console.

Step 4: Run the Main Script

From your terminal, run:

1
python main.py

You should see something like this:

An image displaying the structured JSON output of a Baidu SERP (Search Engine Results Page).

This output confirms your crawler and scraper are working together successfully to turn raw HTML into clean, usable data.

Quick Tips for Scraping Baidu Search Results

Here are some quick tips to keep in mind as you continue building after setting up your Baidu search scraper with Python, Crawlbase, and BeautifulSoup:

  • Crawlbase Rate Limits: Just be aware of the default rate limit of 20 requests per second to avoid 429 errors. If you need to send more requests than the default, you can contact Crawlbase customer support.
  • 5XX pc_status codes: Any 5XX error codes are free of charge. If you encounter this type of error, it typically means that your URL is blocked, not responding, or currently unavailable. Luckily, you can easily retry these errors since it’s free of charge.
  • Inspect HTML with DevTools: Always return to the HTML page and inspect the elements to check if there are any CSS class changes or if something breaks in your scraper. That’s a common problem, but a quick fix.
  • Try different queries: Modify the Baidu search URL to test with other keywords and explore how the structure changes.
  • Export your data: Write the output to a .json or .csv file using json.dump() or csv.DictWriter() for later use.

If you haven’t already, sign up for Crawlbase to get your free API token and start exploring the web at scale, without getting blocked.

Frequently Asked Questions (FAQs)

Q: What is Crawlbase, and how does it help with web scraping Baidu?

Crawlbase is a platform mainly focused on web data collection. It handles the technical aspects of scraping, such as changing IP addresses, bypassing anti-bot protections, and returning the raw web page as clean HTML that you can easily use in your code. You don’t need to worry about running into common scraping roadblocks anymore.

Q: Can I scrape Baidu using Python?

Yes, you can. With requests and BeautifulSoup, you can fetch and work with search result pages. Crawlbase acts as a bridge, ensuring your requests are processed smoothly and preventing you from getting blocked.

Q: Why use BeautifulSoup for scraping Baidu search engine?

BeautifulSoup is designed for extracting data from web pages. Even if you’re starting out, it’s easy to work with. Also a popular choice for handling both tidy and messy HTML, like search results.

Q: Do I need to use JavaScript rendering to scrape Baidu?

Usually, Baidu’s main content loads without extra scripting. However, if you encounter a page that does require it, Crawlbase offers a JavaScript option to handle sites that necessitate a browser-style fetch.