Baidu is the dominant search engine in China, the place most Chinese users go first when they look something up. That makes its public search results a useful signal for anyone doing keyword research, SEO tracking, market analysis, or simply trying to understand what ranks in a market that Google does not lead. The results page carries the same structured data a SERP tool wants anywhere else: titles, links, snippets, and the order they appear in.

This guide shows you how to scrape Baidu search results with Python the reliable way. You build a small, runnable scraper that fetches a rendered results page through the Crawling API, parses each result with BeautifulSoup, and prints clean structured output. The whole walkthrough stays scoped to public search-results data that anyone can see without an account, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.

What you will build

A Python script that takes a public Baidu search URL, retrieves the HTML through the Crawling API, and extracts a structured record for every organic result on the page. We will use a sample query as the running example and pull these fields from each result:

  • Title the headline text of the result, as shown in the listing.
  • Link the destination URL the result points to.
  • Snippet the displayed description or summary under the title.
  • Position the rank of the result on the page, counted from the top.

Why a plain request fails on Baidu

If you fire a bare HTTP request at a Baidu results URL from a script, you rarely get the clean page you see in your own browser. Two things work against you. First, Baidu serves from inside China and tailors what it returns based on the requesting IP, so a foreign datacenter address can come back with a region gate or partial content. Second, Baidu watches for automated traffic: requests that do not look like a real browser get challenged, fed a verification page, or blocked before they reach the listings.

So a working Baidu scraper needs two things in one request: an IP the platform reads as a real visitor, and, when the page leans on scripts, a browser that renders it. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but keeping those healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it fetches from a trusted residential IP and renders when needed, and it returns finished HTML for you to parse.

Why residential rotation matters here

Baidu is geo-sensitive in a way most Western targets are not. A request from a residential IP looks like an ordinary visitor, while a foreign datacenter address is an immediate tell. The Crawling API rotates through residential addresses server-side, so you do not have to source and maintain that pool yourself. You can start with 1,000 free requests, no credit card needed.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If BeautifulSoup is new to you, our guide to using BeautifulSoup in Python covers the parsing basics this tutorial assumes.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

A Crawlbase account and token. Sign up, open your dashboard, and copy your request token from the account docs page. Your first 1,000 requests are free, and adding billing details before you spend them unlocks an extra 9,000 free requests. Treat the token like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create a virtual environment so project dependencies stay isolated, then install the two libraries the scraper needs.

bash
python --version

python -m venv baidu_env
source baidu_env/bin/activate

pip install requests beautifulsoup4

On Windows, activate the environment with baidu_env\Scripts\activate instead of the source line. Two dependencies do the work: requests sends the HTTP call to the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector.

Step 1: Fetch the page through the Crawling API

Start by getting the HTML. Write a small crawl() function that sends your target URL to the Crawling API with your token, checks that the underlying page came back with a 200 status, and returns the HTML body. Checking the status before you parse keeps failures loud instead of silent.

python
import json
import requests

API_TOKEN = "YOUR_CRAWLBASE_TOKEN"  # replace with your token
API_ENDPOINT = "https://api.crawlbase.com/"

def crawl(url):
    params = {"token": API_TOKEN, "url": url}
    response = requests.get(API_ENDPOINT, params=params)
    response.raise_for_status()

    data = json.loads(response.text)
    if data["original_status"] != 200:
        raise Exception(f"Unable to crawl '{url}'")

    return data["body"]

if __name__ == "__main__":
    url = "https://www.baidu.com/s?ie=utf-8&wd=%E8%8B%B9%E6%9E%9C%20iPhone"
    html = crawl(url)
    print(html[:500])

The API returns a JSON envelope, so you load the response with json.loads and read two fields: original_status is the status Baidu itself returned, and body is the page HTML. Guarding on original_status means a region gate or a block surfaces as an exception instead of feeding garbage into the parser. The sample query is "苹果 iPhone" (apple iPhone), URL-encoded in the wd parameter, which is how Baidu carries the search term. Run the script with python crawling.py and you should see real results markup in the first 500 characters, which confirms the fetch works before you write a single selector.

Crawlbase Crawling API

That original_status check only ever reads 200 because the request reached Baidu as a real visitor in the first place. The Crawling API fetches the page from a rotating residential IP inside the right region, renders it when the page needs a browser, and hands you finished HTML, so you skip running a headless fleet and sourcing a residential proxy pool yourself. Point it at a public results URL on the free tier first.

Step 2: Parse the results with BeautifulSoup

With HTML in hand, load it into BeautifulSoup and pull each result by its selector. Baidu wraps each organic result in a title block and reads the destination link from the anchor inside it. Inspect the live page in your browser's dev tools (right-click, then Inspect) to confirm the current class names; the selectors below match the layout at the time of writing.

python
from bs4 import BeautifulSoup

def scrape_html(html):
    soup = BeautifulSoup(html, "html.parser")

    page_title = soup.title.string if soup.title else None
    search_input = soup.find("input", {"name": "wd"})
    search_query = search_input.get("value", "") if search_input else ""

    results = []
    for position, block in enumerate(soup.select("div.title-box_4YBsj"), start=1):
        heading = block.select_one("h3.t")
        link = block.select_one("a[href]")
        snippet = block.find_next("div", class_="content-right_2s-H4")
        if not heading or not link:
            continue
        results.append({
            "position": position,
            "title": heading.get_text(strip=True),
            "url": link["href"],
            "snippet": snippet.get_text(strip=True) if snippet else None,
        })

    return {
        "pageTitle": page_title,
        "searchQuery": search_query,
        "results": results,
    }

The selector div.title-box_4YBsj is the wrapper Baidu uses for each result's title block, with the headline in an h3.t tag and the destination in the anchor inside it. Reading the link from the anchor's href keeps the URL separate from the title. enumerate(..., start=1) gives you the position for free as you loop, so rank comes from page order instead of a fragile attribute. The if not heading or not link: continue guard skips anything that is not a real organic result, keeping ads and stray markup out of your output. The snippet is read from the description container that follows each title, falling back to None when it is absent.

Selectors drift

Baidu's class names, like title-box_4YBsj and content-right_2s-H4, carry a generated suffix that changes when Baidu redeploys its front end. Treat the selectors above as a starting template, not a contract. When a field comes back empty for every result, re-inspect a live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Put it together

Now wire the fetch and the parse into one runnable script. Crawl the rendered results page, hand the HTML to the parser, and write the structured output to JSON. Setting ensure_ascii=False keeps Chinese characters readable in the file instead of escaping them into \u sequences.

python
import json
import requests
from bs4 import BeautifulSoup

API_TOKEN = "YOUR_CRAWLBASE_TOKEN"
API_ENDPOINT = "https://api.crawlbase.com/"

def crawl(url):
    params = {"token": API_TOKEN, "url": url}
    response = requests.get(API_ENDPOINT, params=params)
    response.raise_for_status()
    data = json.loads(response.text)
    if data["original_status"] != 200:
        raise Exception(f"Unable to crawl '{url}'")
    return data["body"]

def scrape_html(html):
    soup = BeautifulSoup(html, "html.parser")
    page_title = soup.title.string if soup.title else None
    search_input = soup.find("input", {"name": "wd"})
    search_query = search_input.get("value", "") if search_input else ""

    results = []
    for position, block in enumerate(soup.select("div.title-box_4YBsj"), start=1):
        heading = block.select_one("h3.t")
        link = block.select_one("a[href]")
        snippet = block.find_next("div", class_="content-right_2s-H4")
        if not heading or not link:
            continue
        results.append({
            "position": position,
            "title": heading.get_text(strip=True),
            "url": link["href"],
            "snippet": snippet.get_text(strip=True) if snippet else None,
        })

    return {"pageTitle": page_title, "searchQuery": search_query, "results": results}

def main():
    url = "https://www.baidu.com/s?ie=utf-8&wd=%E8%8B%B9%E6%9E%9C%20iPhone"
    html = crawl(url)
    data = scrape_html(html)
    with open("baidu_results.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    print(f"Saved {len(data['results'])} results")

if __name__ == "__main__":
    main()

Run the full script with python main.py. It fetches the results page for "苹果 iPhone", extracts a record for each organic listing, and writes everything to baidu_results.json. The same two functions are all you need: swap the query in the URL and the parser handles whatever comes back.

What the output looks like

You get a clean structured object with the page title, the echoed search query, and an ordered list of results, ready to write to JSON, CSV, or a database.

json
{
  "pageTitle": "苹果 iPhone_百度搜索",
  "searchQuery": "苹果 iPhone",
  "results": [
    {
      "position": 1,
      "title": "Apple (中国大陆) - 官方网站",
      "url": "http://www.baidu.com/link?url=abc123",
      "snippet": "探索 iPhone、iPad、Mac 等 Apple 产品的全新阵容。"
    },
    {
      "position": 2,
      "title": "iPhone - 维基百科",
      "url": "http://www.baidu.com/link?url=def456",
      "snippet": "iPhone 是苹果公司设计和销售的智能手机系列。"
    }
  ]
}

Note that result URLs come back as baidu.com/link?url=... redirect links rather than the final destination. That is how Baidu serves outbound clicks. If you need the real target, follow each redirect with a separate request, but do it sparingly and at low volume so you are not multiplying your traffic against Baidu.

Scaling across pages and queries

One query on one page is a demo; a real job runs over several searches and deeper into the results. Baidu paginates with the pn query parameter, which is an offset in multiples of 10: pn=10 is the second page, pn=20 the third, and so on. The shape stays the same: build each URL, fetch it through the Crawling API, and parse it with the same function. The one habit that keeps a long run healthy is pacing, so pause between requests rather than firing them in a tight loop.

python
import time
from urllib.parse import quote

query = "苹果 iPhone"
encoded = quote(query)

all_results = []
for page in range(3):
    offset = page * 10
    url = f"https://www.baidu.com/s?ie=utf-8&wd={encoded}&pn={offset}"
    html = crawl(url)
    all_results.extend(scrape_html(html)["results"])
    time.sleep(3)

print(f"Collected {len(all_results)} results across 3 pages")

Crawlbase serves up to 20 requests per second by default, which is plenty of headroom for a scraper that paces itself; if you genuinely need more, support can raise it. Any 5XX response from the API is free of charge, so retrying a blocked or unavailable URL costs you nothing. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy (also called the AI Proxy) gives you the same residential IP rotation as a drop-in proxy endpoint.

Staying unblocked

Even with a trusted IP handled, Baidu watches for scraper-shaped traffic, and its checks are stricter than most because of where it operates. A few habits keep a run healthy.

  • Pace your requests. Hammering results pages in a tight loop is the fastest way to get challenged. Spread requests out and vary your queries instead of paging one term at full speed.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning challenges or verification pages is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
  • Re-inspect when fields go empty. Baidu changes its markup periodically. If results stop parsing, open a live page in dev tools and update the selectors.

For the broader playbook, see how to scrape websites without getting blocked and the deeper dive on how to bypass captchas while web scraping. If a Baidu page you need leans on scripts to render, our guide on crawling JavaScript websites explains why rendering matters and how to turn it on.

Whether scraping Baidu is allowed depends on Baidu's terms of service, your jurisdiction, and what you do with the data. Baidu's terms place limits on automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Baidu's terms and its robots.txt, and treat both as the boundary for what you collect.

A few lines worth holding to. Collect only public search-results data: the titles, links, snippets, and positions that anyone can see on a results page without an account. Keep your request volume low enough that you are not straining Baidu's servers, and pace your crawl rather than running it flat out. Baidu does not publish a broadly available official SERP API for this kind of access, so there is no sanctioned high-volume endpoint to fall back on, which is all the more reason to stay modest in scale and respectful of the site's stated rules.

This guide is deliberately scoped to public search-results pages because that is the line that keeps the work defensible. It does not cover anything behind a login, account or personal data, or copyrighted media pulled from the linked destinations. Public SERP data only. If your project needs more than that, an official data agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Baidu is geo-sensitive. A foreign datacenter IP gets a different page or a block, so you need a trusted residential address to see the real results.
  • The Crawling API fetches behind a real IP. Send it the URL, it rotates residential IPs server-side and renders when needed, and returns finished HTML for you to parse.
  • BeautifulSoup does the extraction. Select each div.title-box_4YBsj, then read title, link, snippet, and position from it, and expect the suffixed class names to drift.
  • Paginate with the pn offset. Increase pn in multiples of 10 to walk deeper into results, and pace your requests with a sleep between pages.
  • Stay on public data. Respect Baidu's ToS and robots.txt, keep volume low since there is no open official SERP API, and never touch accounts or personal data.

Frequently Asked Questions (FAQs)

Why does a plain request fail or return the wrong page on Baidu?

Baidu serves from inside China and adjusts what it returns based on the requesting IP, so a call from a foreign datacenter address can come back with a region gate, partial content, or a verification page instead of the results you see in your own browser. It also flags traffic that does not look like a real browser. Fetching through the Crawling API, which uses rotating residential IPs, makes the request look like an ordinary visitor so you get the real results page.

Can I scrape Baidu search results with Python?

Yes. With requests and BeautifulSoup you can fetch a results page and pull out titles, links, snippets, and positions. The Crawling API acts as the bridge that gets your request to Baidu from a trusted IP, so requests are processed smoothly instead of being blocked. For a broader Python primer, see our guide on scraping websites with Python.

What fields can I extract from a Baidu results page?

This tutorial pulls four fields from each organic result: the title, the destination link, the displayed snippet, and the position on the page. You also capture the page title and the echoed search query from the wd input. Stay within public search-results data and avoid anything behind a login.

Do I need JavaScript rendering to scrape Baidu?

Usually the main results load without it, so the basic fetch in this guide is enough. If you hit a page that needs a browser to fill in, the Crawling API offers a JavaScript rendering option that fetches the page the way a real browser would. Our guide to scraping JavaScript pages with Python covers when that is necessary.

How do I paginate through more Baidu results?

Use the pn query parameter, which is an offset in multiples of 10: pn=10 is the second page, pn=20 the third, and so on. Build each page URL with the offset, fetch it through the Crawling API, parse it with the same function, and pause a few seconds between requests so you are pacing the crawl rather than hammering it.

My selectors return nothing. What changed?

Almost certainly Baidu's markup. Class names like title-box_4YBsj carry a generated suffix that changes when Baidu redeploys its front end, so selectors that worked last month can break. Re-inspect a live results page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available