How to Scrape Google People Also Ask

Q: How do I capture the nested PAA questions?

Use the css_click_selector parameter to have the API click each visible PAA question before it captures the HTML, which makes Google load the related questions into the DOM. Parse that expanded HTML and diff it against the first pass to collect the new questions under each item's children. A 3-level expansion typically yields twelve to twenty total questions per query.

Google's People Also Ask (PAA) box shows up in a large share of searches, sitting just below or between the organic results as a stack of expandable question-and-answer pairs. For anyone doing content or SEO research it is one of the most direct signals of real user intent you can read off a results page: the exact phrasing people use, the follow-up questions they ask, and the gaps competitors have not covered yet. Click one open and Google loads more related questions beneath it, so a single query can branch into a whole tree of intent.

This guide shows you how to scrape Google People Also Ask with Python in a reliable, repeatable way. You build a small runnable scraper that fetches a rendered Google SERP through the Crawling API, parses the PAA questions and answers, expands the nested items so you capture the deeper layers most scrapers miss, and exports clean Q and A pairs to JSON. The whole walkthrough stays scoped to public search-results data that anyone can see without an account, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.

What you will build

A Python script that takes a public Google search query, retrieves the rendered HTML through the Crawling API, and extracts a structured record for every PAA item on the page, including the questions revealed by expanding the first layer. We will use a sample query as the running example and pull these fields from each PAA entry:

Question the PAA question text exactly as Google phrases it, which is what expands your keyword and topic coverage.
Answer the snippet answer Google shows when the item is expanded, useful for featured-snippet research.
Source URL the page Google cites for the answer, which supports competitor analysis.
Children the nested questions revealed when the item is expanded, capturing deeper levels of the expansion tree.

Why a plain request fails on Google

If you fire a bare HTTP request at a Google search URL from a script, the PAA section usually is not there. Two things work against you. First, the PAA box is rendered by JavaScript after the page initializes and updates again whenever a question is clicked, so the raw HTML a plain request returns either omits the box entirely or carries only an empty shell. Second, Google watches for automated traffic: requests that do not look like a real browser get challenged, fed a consent or verification page, or blocked before they reach the results.

So a working PAA scraper needs two things in one request: an IP the platform reads as a real visitor, and a browser that actually renders the page and runs its scripts. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but keeping those healthy is most of the work. The Crawling API folds both into a single call: you send it the search URL, it fetches from a trusted residential IP and renders the page in a real browser, waits for the dynamic content to load, and returns finished HTML for you to parse.

Rendering is the whole game here

PAA content loads after the page initializes and changes on interaction, so a non-rendered fetch returns an incomplete or empty box. Render the page, give it a short wait for the scripts to settle, and the PAA section is in the HTML. The Crawling API does the rendering and the IP rotation server-side, and you start with up to 20,000 free requests, no credit card needed.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If BeautifulSoup is new to you, our guide to using BeautifulSoup in Python covers the parsing basics this tutorial assumes.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

A Crawlbase account and a JavaScript token. Sign up, open your dashboard, and copy your token from the account docs page. Google SERPs need rendering, so use your JavaScript token (also called the Browser-enabled key) rather than the plain one. You get up to 20,000 free requests: 1,000 on signup, and more as you complete onboarding steps. Treat the token like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create a virtual environment so project dependencies stay isolated, then install the two libraries the scraper needs.

bash

python --version

python -m venv paa_env
source paa_env/bin/activate

pip install requests beautifulsoup4

On Windows, activate the environment with paa_env\Scripts\activate instead of the source line. Two dependencies do the work: requests sends the HTTP call to the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector.

Step 1: Fetch the rendered SERP through the Crawling API

Start by getting the rendered HTML. Write a small crawl() function that sends your Google search URL to the Crawling API with your JavaScript token, asks it to render and wait, checks that the underlying page came back with a 200 status, and returns the HTML body. The gl and hl parameters in the URL set the country and language, and page_wait gives the PAA scripts time to finish before the HTML is captured.

python

import json
import requests
from urllib.parse import urlencode

JS_TOKEN = "YOUR_CRAWLBASE_TOKEN"  # use your JavaScript token
API_ENDPOINT = "https://api.crawlbase.com/"

def build_serp_url(query, gl="us", hl="en"):
    base = "https://www.google.com/search?"
    return base + urlencode({"q": query, "gl": gl, "hl": hl})

def crawl(url, page_wait=2000):
    params = {
        "token": JS_TOKEN,
        "url": url,
        "page_wait": page_wait,
    }
    response = requests.get(API_ENDPOINT, params=params, timeout=90)
    response.raise_for_status()

    data = json.loads(response.text)
    if data["original_status"] != 200:
        raise Exception(f"Unable to crawl '{url}'")

    return data["body"]

if __name__ == "__main__":
    url = build_serp_url("how to scrape google", gl="us", hl="en")
    html = crawl(url)
    print(html[:500])

The API returns a JSON envelope, so you load the response with json.loads and read two fields: original_status is the status Google itself returned, and body is the rendered page HTML. Guarding on original_status means a consent gate or a block surfaces as an exception instead of feeding garbage into the parser. A page_wait of around 2000 milliseconds is usually enough for the PAA box to load; a 90 second request timeout gives the render room to finish without the call hanging forever. Run the script with python crawling.py and you should see real SERP markup in the first 500 characters, which confirms the fetch and render work before you write a single selector. If the PAA section looks thin, increasing page_wait is almost always the first fix.

Crawlbase Google Scraper

That page_wait only does its job because the request reached Google as a real browser on a trusted IP in the first place. The Crawling API fetches the SERP from a rotating residential address, renders it in a real headless browser, and waits for the PAA scripts to settle before handing you finished HTML, so you skip running a headless fleet and sourcing a residential proxy pool yourself. Point it at a public search URL on the free tier first.

Start free

Step 2: Parse the PAA questions and answers

With rendered HTML in hand, load it into BeautifulSoup and pull each PAA item. Google does not give the PAA box a single stable class name, so the reliable approach is layered fallback selectors: each PAA item carries a data-q attribute holding the question, and the visible answer lives in the expandable block beside it. The parser below reads the question from data-q, the answer from the rendered text, and the cited source from the first outbound anchor in the item.

python

from bs4 import BeautifulSoup

# Layered fallbacks: Google rotates these class names, so try
# the stable data-q attribute first, then known container classes.
PAA_ITEM_SELECTORS = [
    "div[data-q]",
    "div.related-question-pair",
    "div[jsname='Cpkphb']",
]

def find_paa_items(soup):
    for selector in PAA_ITEM_SELECTORS:
        items = soup.select(selector)
        if items:
            return items
    return []

def parse_paa(html):
    soup = BeautifulSoup(html, "html.parser")
    questions = []

    for item in find_paa_items(soup):
        question = item.get("data-q")
        if not question:
            heading = item.select_one("div[role='heading'], span")
            question = heading.get_text(strip=True) if heading else None
        if not question:
            continue

        answer_el = item.select_one("div[data-attrid], div.wDYxhc, span.hgKElc")
        answer = answer_el.get_text(" ", strip=True) if answer_el else None

        link = item.select_one("a[href^='http']")
        source_url = link["href"] if link else None

        questions.append({
            "question": question,
            "answer": answer,
            "source_url": source_url,
            "children": [],
        })

    return questions

The record shape matches the legacy structure exactly: question expands keyword coverage, answer helps with featured-snippet work, source_url supports competitor analysis, and children is reserved for the nested expansions you capture in the next step. Reading the question from the stable data-q attribute first, then falling back to the heading text, is what keeps the parser working when Google reshuffles its class names. The if not question: continue guard skips empty shells so only real PAA items reach your output.

Selectors drift

Google rotates the obfuscated class names in its SERP markup, so a selector that fires today can return nothing next month. That is why the parser tries the stable data-q attribute before any class name and keeps a list of fallbacks. When every PAA field comes back empty, re-inspect a live results page in your browser's dev tools and update the list. Log which selector fired on each run so a sudden drop in matches is easy to spot.

Step 3: Expand the nested PAA items

Up to this point you are extracting only the initial set of PAA questions. That alone is useful, but it is incomplete: the real value lives deeper in the expansion tree. When a user clicks a PAA question, Google dynamically loads two to four more related questions beneath it, and each of those can trigger further expansions. To capture that, you tell the Crawling API to simulate the clicks before it captures the HTML, using the css_click_selector parameter so the additional questions load into the DOM you then parse.

python

def crawl_expanded(url, page_wait=3000):
    # css_click_selector clicks each PAA question so Google loads
    # the nested questions before the HTML is captured.
    params = {
        "token": JS_TOKEN,
        "url": url,
        "page_wait": page_wait,
        "css_click_selector": "div[data-q]",
    }
    response = requests.get(API_ENDPOINT, params=params, timeout=90)
    response.raise_for_status()
    data = json.loads(response.text)
    if data["original_status"] != 200:
        raise Exception(f"Unable to crawl '{url}'")
    return data["body"]

def scrape_with_expansions(query, gl="us", hl="en"):
    url = build_serp_url(query, gl, hl)

    # First pass: the visible PAA questions.
    base_items = parse_paa(crawl(url))
    seen = {item["question"] for item in base_items}

    # Second pass: click to load the nested questions, then diff.
    expanded_items = parse_paa(crawl_expanded(url))
    for item in expanded_items:
        if item["question"] not in seen:
            base_items[0]["children"].append(item)
            seen.add(item["question"])

    return base_items

The flow is: build the SERP URL with your query and geo parameters, fetch the visible PAA set once, then fetch again with css_click_selector set to the PAA item selector so the API clicks each question and loads the new ones into the DOM. Parsing both passes and keeping only the questions you have not seen gives you the deeper layer without duplicating the originals. In practice a single query can grow from three or four visible questions to twelve to twenty total after a couple of expansion rounds. This step is optional from an implementation standpoint, but it is where most of the missing value lives.

Step 4: Put it together and export the Q and A pairs

Now wire the build, fetch, expand, and parse into one runnable script that writes the structured PAA data to JSON. Setting ensure_ascii=False keeps non-ASCII characters readable in the file instead of escaping them into \u sequences, which matters once you run queries in other languages.

python

import sys

def main():
    query = sys.argv[1] if len(sys.argv) > 1 else "how to scrape google"
    country = sys.argv[2] if len(sys.argv) > 2 else "us"

    paa = scrape_with_expansions(query, gl=country)

    outfile = f"paa_{country}.json"
    with open(outfile, "w", encoding="utf-8") as f:
        json.dump(paa, f, ensure_ascii=False, indent=2)

    total = len(paa) + sum(len(q["children"]) for q in paa)
    print(f"Saved {total} PAA questions to {outfile}")

if __name__ == "__main__":
    main()

Run the full script with python main.py "content gap analysis" uk. It builds the Google SERP URL for that query in the chosen country, fetches the rendered HTML, expands the PAA items, and writes the question-and-answer pairs to paa_uk.json. The same handful of functions are all you need: swap the query or the country code and the parser handles whatever comes back. If results look incomplete, raise page_wait before anything else, since a slow render is the most common cause of a short PAA list.

What the output looks like

You get a clean list of question objects, each with its answer, the cited source, and any nested questions captured during expansion, ready to write to JSON, feed into a content brief, or load into a database for clustering.

json

[
  {
    "question": "Is it legal to scrape Google?",
    "answer": "Scraping public search results is generally permitted, but it can conflict with Google's terms of service.",
    "source_url": "https://example.com/is-scraping-google-legal",
    "children": [
      {
        "question": "Can Google detect scraping?",
        "answer": "Yes, Google uses rate limits and behavioral signals to flag automated traffic.",
        "source_url": "https://example.com/google-bot-detection",
        "children": []
      }
    ]
  },
  {
    "question": "What is the best tool to scrape Google?",
    "answer": "A rendering API that handles proxies and JavaScript is the most reliable approach.",
    "source_url": "https://example.com/google-scraping-tools",
    "children": []
  }
]

Each question becomes a node, and each expansion adds more nodes beneath it under children. From here, exporting to CSV for a spreadsheet, or flattening the tree into a content brief, is a few lines away. Because every record carries its source URL, you can also group the questions by the domains Google cites to see who already owns the answers.

Comparing PAA across countries

PAA results are not universal: they vary by location and language because Google personalizes them to the searcher's market. To compare, run the same query with different gl values and diff the results.

python

import time

query = "best running shoes"
markets = ["us", "uk", "de"]

by_market = {}
for gl in markets:
    items = scrape_with_expansions(query, gl=gl)
    by_market[gl] = {q["question"] for q in items}
    time.sleep(3)

# Questions unique to the UK market.
uk_only = by_market["uk"] - by_market["us"]
print(f"UK-only PAA questions: {len(uk_only)}")

Comparing unique questions, overlapping topics, and differences in answers across markets is particularly useful when you are expanding into a new region or localizing content. The time.sleep between requests paces the run so you are not firing back to back. For scaling well beyond a handful of queries, the asynchronous Crawler lets you push URLs in bulk and receive results via webhook instead of waiting on each call.

Staying unblocked

Even with rendering and a trusted IP handled for you, Google watches for scraper-shaped traffic. A few habits keep a run healthy.

Pace your requests. Hammering the SERP in a tight loop is the fastest way to get challenged. Spread requests out and vary your queries instead of paging one term at full speed.
Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
Watch the status and the counts. A run that starts returning challenges, or a PAA count that drops to zero, is telling you the rate, the IP tier, or the selectors need attention. Treat that as signal, not noise.
Re-inspect when fields go empty. Google reshuffles its markup periodically. If the PAA parse stops returning items, open a live page in dev tools and update the selector list.

For the broader playbook, see how to scrape websites without getting blocked. If you want the wider view of every SERP feature beyond PAA, our guide to scraping Google search pages covers organic results, ads, and the knowledge panel, and the focused Python walkthrough for Google search results is the companion how-to to this one. Once you have PAA data in hand, extracting and analyzing Google SEO data and using scraped data to improve SEO show what to do with it.

Is it legal to scrape Google PAA?

Whether scraping Google's People Also Ask data is allowed depends on Google's terms of service, your jurisdiction, and what you do with the data. Scraping publicly visible search results sits in a legal grey area: the questions and answers in the PAA box are shown to anyone without an account, but Google's terms place limits on automated access, so a scraper can run against those terms regardless of how careful the tooling is. None of the code here changes that; it just makes the technical part work. Read Google's terms and its robots.txt, and treat both as the boundary for what you collect.

A few lines worth holding to. Collect only public PAA data: the questions, answers, and cited source URLs that appear on a results page without a login. Keep your request volume low enough that you are not straining Google's servers, and pace your crawl rather than running it flat out. Do not gather personal data, do not redistribute copyrighted answer text wholesale, and do not touch anything behind a login. The cited source URLs point to other people's pages, so treat the content there under its own license, not as yours to republish.

Where an official path exists, prefer it. Google offers Programmable Search and other official APIs for sanctioned access to search data, and for production-scale needs an official data agreement is the correct route, not a cleverer scraper. This guide is deliberately scoped to public PAA pages because that is the line that keeps the work defensible: question-and-answer pairs anyone can see, used for research and content planning, nothing more.

Recap

Key takeaways

PAA needs rendering. The box loads via JavaScript after the page initializes, so a plain request returns an empty or missing section; render the page and give it a short wait.
The Crawling API renders behind a real IP. Send it the search URL with a JavaScript token and page_wait, and it rotates residential IPs, runs a real browser, and returns finished HTML.
Parse with layered fallbacks. Read the question from the stable data-q attribute first, then fall back to class names, because Google rotates its obfuscated markup.
Expand the tree with css_click_selector. Click the visible questions so Google loads the nested ones, then diff the passes to grow from three or four questions to twelve to twenty.
Stay on public data. Respect Google's ToS and robots.txt, pace your requests, prefer official APIs at scale, and never touch logins or personal data.

Frequently Asked Questions (FAQs)

What is a People Also Ask box?

A PAA box is a Google SERP feature showing a stack of expandable question-and-answer pairs related to the search query. It appears in a large share of searches and expands dynamically when clicked, loading two to four more related questions each time, which is what makes it such a rich source of user-intent data for SEO and content research.

Why does a plain request miss the PAA section?

The PAA box is rendered by JavaScript after the initial HTML loads and updates again on interaction, so a bare HTTP request returns an empty shell or nothing at all. Fetching through the Crawling API with a JavaScript token renders the page in a real browser and waits for the scripts, so the PAA content is present in the HTML you parse.

How do I capture the nested PAA questions?

Use the css_click_selector parameter to have the API click each visible PAA question before it captures the HTML, which makes Google load the related questions into the DOM. Parse that expanded HTML and diff it against the first pass to collect the new questions under each item's children. A 3-level expansion typically yields twelve to twenty total questions per query.

Can I scrape Google PAA with Python?

Yes. With requests and BeautifulSoup you fetch the rendered SERP and pull the question, answer, source URL, and nested children from each PAA item. The Crawling API is the bridge that gets your request to Google from a trusted IP with rendering on. For a broader primer, see our guide on scraping websites with Python.

Why does PAA vary by country?

Google personalizes PAA results by the searcher's country and language, so the same query in the US and the UK often returns different questions because user behavior, language patterns, and available content differ by market. Run the same query with different gl values and diff the question sets to see the differences, which is useful when localizing content.

My selectors return nothing. What changed?

Almost certainly Google's markup. Google rotates the obfuscated class names in its SERP, so selectors that worked last month can break and the parser will silently return an empty list. Read the question from the stable data-q attribute first, keep a list of fallback selectors, log which one fires on each run, and re-inspect a live page in dev tools when the count drops.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why a plain request fails on Google

Prerequisites

Set up the project

Step 1: Fetch the rendered SERP through the Crawling API

Step 2: Parse the PAA questions and answers

Step 3: Expand the nested PAA items

Step 4: Put it together and export the Q and A pairs

What the output looks like

Comparing PAA across countries

Staying unblocked

Is it legal to scrape Google PAA?

Key takeaways

Frequently Asked Questions (FAQs)

What is a People Also Ask box?

Why does a plain request miss the PAA section?

How do I capture the nested PAA questions?

Can I scrape Google PAA with Python?

Why does PAA vary by country?

My selectors return nothing. What changed?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Introducing the New Crawlbase Dashboard: a cleaner control center

13 Tips to Master Data Crawling: crawls that do not break

The Best Octoparse Alternative: a fair comparison

The infrastructure brief, in your inbox.

We use cookies

Customize cookies