Google's search results are the scoreboard for SEO. Where a page ranks for a keyword, which titles and snippets win the click, what questions show up in the "People also ask" box, and which related searches Google suggests: every one is a public signal you can read straight off the results page. Pulled across your target keywords, they tell you who you compete with, where you sit today, and which topics you have not covered yet.
This guide shows you how to extract and analyze SEO data from Google with Python. You build a small, runnable workflow that fetches a rendered SERP through the Crawling API, parses the fields that matter for SEO, runs the same query across a list of keywords, and loads the result into pandas to find ranking positions, competitor overlap, and content gaps. The whole walkthrough stays scoped to public search-results data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Python workflow that takes a list of target keywords, retrieves each SERP's HTML through the Crawling API, and extracts a structured record for every organic result plus the page's SEO features. We pull these fields from each search:
- Position the organic rank of each result, counted from the top of the page.
- Title the meta title shown for each result, which drives click-through rate.
- Link the destination URL, from which we also derive the domain for competitor analysis.
- Snippet the meta description Google displays under each result.
- People also ask the PAA questions Google attaches to the query, useful for content ideas.
- Related searches the related-query suggestions at the foot of the page, a source of long-tail keywords.
Those fields cover the SEO data points worth tracking from a SERP: organic rankings, SERP features like PAA, meta titles and descriptions, and the URLs and domains of the pages that outrank you. Search volume and CPC are not on the page itself, so those still come from a keyword tool such as Google Keyword Planner, Ahrefs, or SEMrush; this workflow gives you everything the results page does carry.
Why a plain request fails on Google
If you fire a bare HTTP request at a Google search URL from a script, you rarely get the clean SERP you see in your own browser. Google runs strong anti-scraping measures. Requests that do not look like a real browser get challenged with a CAPTCHA, redirected to a "sorry" interstitial, or rate-limited after a handful of calls. A datacenter IP making rapid, identical requests is the clearest possible tell, and Google blocks it quickly.
On top of that, much of a modern SERP is assembled with JavaScript. Features like the PAA box fill in after the initial HTML loads, so even a request that gets through can come back missing the very data you came for. A working Google SEO scraper therefore needs two things in one request: an IP the platform reads as a real visitor, and a browser that renders the page. You can build that yourself with a headless browser plus rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it fetches from a trusted IP and renders when needed, and returns finished HTML for you to parse.
Google's defenses key heavily on request patterns and IP reputation. A single datacenter address paging through keywords trips a limit fast. The Crawling API rotates requests across many real-user addresses server-side and solves the CAPTCHAs Google throws, so you do not have to source and maintain that pool yourself. You can start with 1,000 free requests, no credit card needed.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If BeautifulSoup is new to you, our guide to using BeautifulSoup in Python covers the parsing basics this tutorial assumes, and our guide to analyzing data with pandas covers the DataFrame side.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your request token from the account docs page. Your first 1,000 requests are free, and you pay only for successful requests. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the three libraries the workflow needs.
python --version python -m venv seo_env source seo_env/bin/activate pip install requests beautifulsoup4 pandas
On Windows, activate the environment with seo_env\Scripts\activate instead of the source line. Three dependencies do the work: requests sends the HTTP call to the Crawling API, beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector, and pandas holds the scraped data in a DataFrame for the analysis at the end.
Step 1: Fetch a SERP through the Crawling API
Start by getting the HTML. Write a small crawl() function that sends your target Google search URL to the Crawling API with your token, checks that the underlying page came back with a 200 status, and returns the HTML body. Pass "javascript": "true" so the API renders the page with a real browser before returning it, which is what fills in script-loaded features like the PAA box. Checking the status before you parse keeps failures loud instead of silent.
import json import requests from urllib.parse import quote_plus API_TOKEN = "YOUR_CRAWLBASE_TOKEN" # replace with your token API_ENDPOINT = "https://api.crawlbase.com/" def search_url(keyword): return f"https://www.google.com/search?q={quote_plus(keyword)}&hl=en&gl=us" def crawl(url): params = {"token": API_TOKEN, "url": url, "javascript": "true"} response = requests.get(API_ENDPOINT, params=params) response.raise_for_status() data = json.loads(response.text) if data["original_status"] != 200: raise Exception(f"Unable to crawl '{url}'") return data["body"] if __name__ == "__main__": html = crawl(search_url("web scraping api")) print(html[:500])
The API returns a JSON envelope, so you load the response with json.loads and read two fields: original_status is the status Google itself returned, and body is the rendered page HTML. Guarding on original_status means a CAPTCHA page or a block surfaces as an exception instead of feeding garbage into the parser. The search_url() helper URL-encodes the keyword and pins the language and region with hl=en and gl=us, so the results are consistent run to run. Run the script with python crawling.py and you should see real SERP markup in the first 500 characters, which confirms the fetch works before you write a single selector.
That original_status check only ever reads 200 because the request reached Google as a real visitor in the first place, CAPTCHA already handled. The Crawling API fetches each SERP from a rotating IP, renders the JavaScript that fills in the PAA box and related searches, and hands you finished HTML, so you skip running a headless browser fleet and sourcing a residential proxy pool yourself. Point it at a public search URL on the free tier first.
Step 2: Parse the SEO fields with BeautifulSoup
With HTML in hand, load it into BeautifulSoup and pull each result by its selector. Google wraps each organic result in a container with the title in an h3 tag, the destination in the anchor around it, and the snippet in a description block beside it. The PAA questions and related searches sit lower in their own blocks. Inspect a live SERP in your browser's dev tools (right-click, then Inspect) to confirm the current class names; the selectors below match the layout at the time of writing.
from urllib.parse import urlparse from bs4 import BeautifulSoup def scrape_serp(html, keyword): soup = BeautifulSoup(html, "html.parser") results = [] for position, block in enumerate(soup.select("div.g"), start=1): heading = block.select_one("h3") link = block.select_one("a[href]") snippet = block.select_one("div.VwiC3b") if not heading or not link: continue url = link["href"] results.append({ "keyword": keyword, "position": position, "title": heading.get_text(strip=True), "url": url, "domain": urlparse(url).netloc, "snippet": snippet.get_text(strip=True) if snippet else None, }) paa = [ q.get_text(strip=True) for q in soup.select("div.related-question-pair") if q.get_text(strip=True) ] related = [ r.get_text(strip=True) for r in soup.select("div.y6Uyqe div.b2Rnsc, a.k8XOCe") if r.get_text(strip=True) ] return { "keyword": keyword, "results": results, "people_also_ask": paa, "related_searches": related, }
The selector div.g is the wrapper Google uses for each organic result, with the headline in an h3 tag, the destination in the anchor around it, and the snippet in div.VwiC3b. enumerate(..., start=1) gives you the rank for free, so position comes from page order instead of a fragile attribute. urlparse(url).netloc derives the domain from each link, the key field for competitor analysis later. The if not heading or not link: continue guard keeps ads and stray markup out of your output. PAA questions come from div.related-question-pair, and related searches from the suggestion block at the foot of the page.
Google rotates its result-container class names, such as VwiC3b and related-question-pair, when it redeploys its front end, sometimes A/B testing several layouts at once. Treat the selectors above as a starting template, not a contract. When a field comes back empty for every result, re-inspect a live SERP in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 3: Run it across your keywords
Now wire the fetch and the parse into one runnable script that loops over a list of target keywords, collects the structured SEO data for each, and writes everything to JSON. Pacing the loop with a short sleep keeps a multi-keyword run healthy rather than firing requests back to back.
import json import time import requests from urllib.parse import quote_plus, urlparse from bs4 import BeautifulSoup API_TOKEN = "YOUR_CRAWLBASE_TOKEN" API_ENDPOINT = "https://api.crawlbase.com/" KEYWORDS = [ "web scraping api", "how to scrape google", "rotating proxy service", ] def search_url(keyword): return f"https://www.google.com/search?q={quote_plus(keyword)}&hl=en&gl=us" def crawl(url): params = {"token": API_TOKEN, "url": url, "javascript": "true"} response = requests.get(API_ENDPOINT, params=params) response.raise_for_status() data = json.loads(response.text) if data["original_status"] != 200: raise Exception(f"Unable to crawl '{url}'") return data["body"] def scrape_serp(html, keyword): soup = BeautifulSoup(html, "html.parser") results = [] for position, block in enumerate(soup.select("div.g"), start=1): heading = block.select_one("h3") link = block.select_one("a[href]") snippet = block.select_one("div.VwiC3b") if not heading or not link: continue url = link["href"] results.append({ "keyword": keyword, "position": position, "title": heading.get_text(strip=True), "url": url, "domain": urlparse(url).netloc, "snippet": snippet.get_text(strip=True) if snippet else None, }) paa = [q.get_text(strip=True) for q in soup.select("div.related-question-pair") if q.get_text(strip=True)] related = [r.get_text(strip=True) for r in soup.select("div.y6Uyqe div.b2Rnsc, a.k8XOCe") if r.get_text(strip=True)] return {"keyword": keyword, "results": results, "people_also_ask": paa, "related_searches": related} def main(): serps = [] for keyword in KEYWORDS: html = crawl(search_url(keyword)) serps.append(scrape_serp(html, keyword)) print(f"Scraped '{keyword}': {len(serps[-1]['results'])} results") time.sleep(2) with open("google_seo_data.json", "w", encoding="utf-8") as f: json.dump(serps, f, ensure_ascii=False, indent=2) if __name__ == "__main__": main()
Run the full script with python main.py. It loops over the three sample keywords, fetches and renders each SERP, extracts the organic results plus the PAA and related-search blocks, and writes everything to google_seo_data.json. Swap in your own target keywords and the same two functions handle whatever comes back. For a deeper walkthrough of the scrape itself, including pagination across result pages, see our focused guide on scraping Google search results with Python.
What the output looks like
You get one structured object per keyword: the organic results in rank order, each with its domain derived, plus the PAA questions and related searches Google attached to the query.
{ "keyword": "web scraping api", "results": [ { "keyword": "web scraping api", "position": 1, "title": "Crawling API - Crawlbase", "url": "https://crawlbase.com/crawling-api-avoid-captchas-blocks", "domain": "crawlbase.com", "snippet": "Crawl any website with a single API call and skip blocks and CAPTCHAs." }, { "keyword": "web scraping api", "position": 2, "title": "Best Web Scraping APIs Compared", "url": "https://example-blog.com/web-scraping-apis", "domain": "example-blog.com", "snippet": "A side-by-side look at popular scraping APIs and how they handle blocks." } ], "people_also_ask": [ "What is a web scraping API?", "Is using a scraping API legal?" ], "related_searches": [ "free web scraping api", "web scraping api python" ] }
This is the raw material for every SEO question you want to ask: rankings sit in position, the competitive landscape in domain, click-through signals in title and snippet, and content ideas in people_also_ask and related_searches.
Analyze the SEO data with pandas
Scraping gives you data; pandas turns it into insight. Load the flat list of organic results into a DataFrame and you can answer the three questions that drive most SEO work: where do I rank, who am I competing with, and what have I not covered. The legacy approach of reading a CSV and counting top domains still works; here we do it straight off the scraped objects.
import json import pandas as pd with open("google_seo_data.json", encoding="utf-8") as f: serps = json.load(f) # Flatten every organic result across all keywords into one table rows = [r for serp in serps for r in serp["results"]] df = pd.DataFrame(rows) # 1. Where do I rank? Find your own domain's position per keyword MY_DOMAIN = "crawlbase.com" mine = df[df["domain"] == MY_DOMAIN][["keyword", "position", "title"]] print("Your rankings:") print(mine.to_string(index=False)) # 2. Who am I competing with? Domains that appear most across SERPs top_domains = df["domain"].value_counts().head(10) print("\nTop competing domains:") print(top_domains) # 3. Content gaps: PAA and related searches you have not covered yet ideas = [] for serp in serps: for q in serp["people_also_ask"] + serp["related_searches"]: ideas.append({"keyword": serp["keyword"], "idea": q}) gaps = pd.DataFrame(ideas).drop_duplicates() print(f"\n{len(gaps)} content ideas from PAA and related searches") print(gaps.head(10).to_string(index=False)) df.to_csv("google_seo_results.csv", index=False)
Three analyses come out of this. The first filters the table to your own domain and prints your position per keyword, which is rank tracking. The second runs value_counts() on the domain column to surface the sites showing up most across your keyword set, the competitors dominating your niche. The third pools every PAA question and related search into a deduplicated list of content ideas: real user queries you can answer to win featured snippets and long-tail traffic. The final line writes the flat results to google_seo_results.csv so you can diff rankings week over week.
Search volume, keyword difficulty, and CPC are not printed on the SERP, so this workflow does not invent them. Pull those from a keyword tool such as Google Keyword Planner, Ahrefs, or SEMrush, then merge them into the DataFrame on the keyword column to combine ranking position with opportunity size. For more on turning scraped signals into strategy, see our guide on using data to improve SEO.
Scaling across keywords and competitors
Three keywords is a demo; a real job tracks dozens or hundreds, sometimes against the ad results too. The shape stays the same: extend the KEYWORDS list, fetch each SERP through the Crawling API, and parse it with the same function. A few habits keep a larger run healthy.
-
Pace your requests. Keep the
time.sleep()between keywords and avoid running hundreds of searches in a tight loop. Spreading the work out is the single biggest factor in staying unblocked. - Retry blocked URLs cheaply. Any 5XX response from the API is free of charge, so retrying a keyword that came back blocked or unavailable costs you nothing.
- Re-inspect when fields go empty. Google changes its markup often. If a column comes back all null, open a live SERP in dev tools and update the selector.
- Schedule re-runs. Rankings move, so run the same keyword set on a weekly cadence and store each run with a date column to chart position over time.
If you also want to study the paid side of the SERP, the same fetch-and-parse pattern applies to sponsored results; our guide on analyzing competitor Google Ads walks through that. For the broader anti-block playbook, see how to scrape websites without getting blocked.
Is it legal to scrape Google SEO data?
Whether scraping Google is allowed depends on Google's terms of service, your jurisdiction, and what you do with the data. Google's terms place limits on automated access to its search service, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Google's terms and its robots.txt, and treat both as the boundary for what you collect.
A few lines worth holding to. Collect only public search-results data: the rankings, titles, snippets, PAA questions, and related searches that anyone can see on a results page without an account. Keep your request volume low enough that you are not straining Google's servers, and pace your crawl rather than running it flat out. The titles and snippets on the page are Google's rendering of other sites' content, so use them as analytical signals, not as media to republish wholesale.
This workflow is deliberately scoped to public SERP data because that is the line that keeps the work defensible. It does not cover anything behind a login, account or personal data, or copyrighted media pulled from the linked destinations. For sanctioned, high-volume access to Google data, Google offers official products such as the Custom Search JSON API and, for ad data, the Google Ads API; those are the correct path when a project outgrows public-SERP analysis. Use a scraper to read what is on the page, not to reach past it.
Key takeaways
- SEO lives on the SERP. Rankings, titles, snippets, PAA questions, and related searches are all public on the results page, and together they map your position, your competitors, and your content gaps.
- A plain request gets blocked. Google challenges datacenter IPs with CAPTCHAs and renders features in JavaScript, so you need a trusted IP plus rendering, which the Crawling API handles in one call.
-
BeautifulSoup does the extraction. Select each
div.gfor organic results, derive the domain from each URL, and pull PAA and related searches from their own blocks; expect the class names to drift. -
pandas turns data into insight. Flatten the results into a DataFrame, filter to your domain for rank tracking, run
value_counts()on domains for competitors, and pool PAA plus related searches for content ideas. - Stay on public data. Respect Google's ToS and robots.txt, keep volume low, pull volume and CPC from a keyword tool, and use Google's official APIs for sanctioned high-volume access.
Frequently Asked Questions (FAQs)
What SEO data can I extract from Google search results?
You can pull organic search rankings, meta titles, meta descriptions (snippets), result URLs and their domains, "People also ask" questions, and related searches. Those cover rank tracking, competitor analysis, and content ideas. Search volume and CPC are not on the page, so pull those from a keyword tool such as Google Keyword Planner, Ahrefs, or SEMrush and merge them in.
How do I scrape Google search results without getting blocked?
Google challenges automated traffic with CAPTCHAs and rate limits, especially from datacenter IPs. Rotate IPs, pace your requests, render JavaScript so script-loaded features appear, and let a managed service handle the CAPTCHA side. Fetching through the Crawling API, which rotates IPs and solves CAPTCHAs server-side, makes each request look like an ordinary visitor so you get the real SERP back.
Which tools do I need to extract and analyze Google SEO data?
Three libraries cover it: the Crawling API to fetch and render each SERP without getting blocked, BeautifulSoup to parse the organic results and SERP features out of the HTML, and pandas to load everything into a DataFrame for ranking, competitor, and content-gap analysis. A keyword tool such as Ahrefs or SEMrush fills in volume and difficulty.
How do I find content gaps from a SERP?
Collect the "People also ask" questions and related searches for each target keyword, pool them into one deduplicated list, and compare them against the topics your site already covers. The questions that are left are real user queries you have not answered yet, which is exactly where new content and featured-snippet opportunities come from.
Can I track ranking changes over time?
Yes. Run the same keyword set on a schedule, weekly is common, and write each run to a CSV with a date column. Filtering the combined table to your own domain gives you a position history per keyword that you can chart with pandas or a spreadsheet to see whether your SEO work is moving the needle.
Does Google offer an official API for this data?
Google provides official products for some of it, such as the Custom Search JSON API for programmatic search and the Google Ads API for advertising data. They are the sanctioned path for high-volume or commercial use. For the public SERP signals this guide covers, scraping the rendered results page through the Crawling API is a practical option as long as you stay within Google's terms and keep to public data.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
