Google Scholar is where researchers, students, and academics go to find scholarly articles, conference papers, theses, and citations across nearly every discipline. Its results page is a compact bibliographic record: each listing carries a paper title, the authors, where and when it was published, how many times it has been cited, and a link to the source. That makes Google Scholar a useful starting point for literature reviews, citation analysis, and tracking how a topic evolves over time.
This guide shows you how to scrape Google Scholar results with Python the reliable way. You build a small, runnable scraper that fetches a rendered results page through the Crawling API, parses each result with BeautifulSoup, and exports clean records to JSON and CSV. The whole walkthrough stays scoped to public bibliographic data that anyone can see on a results page, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Python script that takes a Google Scholar search URL, retrieves the HTML through the Crawling API, and extracts a structured record for every result on the page. We will use the query "Data Science" as the running example and pull these fields from each result:
- Title the paper or book title as shown in the listing.
- Authors the author names parsed from the byline under the title.
- Publication the journal, conference, publisher, or source the byline names.
- Year the publication year, where the byline includes one.
- Citations the citation count from the "Cited by" link under the result.
- Link the destination URL the result title points to.
Why a plain request fails on Google Scholar
If you fire a bare HTTP request at a Google Scholar results URL from a script, you rarely get the clean page you see in your own browser. Scholar watches closely for automated traffic. Requests that do not look like a real browser get challenged with a CAPTCHA, fed a verification page, or rate-limited after a handful of calls, and a single datacenter IP making repeated queries is an immediate tell. The page also leans on scripting in places, so static fetches can come back missing content that a rendered browser would show.
So a working Scholar scraper needs two things in one request: an IP the platform reads as a real visitor, and, when the page needs it, a browser that renders. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but keeping those healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it fetches from a trusted residential IP and renders when needed, and it returns finished HTML for you to parse.
Google Scholar is one of the quicker targets to start serving CAPTCHAs once it sees a burst of automated queries from one address. The Crawling API rotates through residential IPs and absorbs those challenges server-side, so you do not have to source proxies or solve CAPTCHAs yourself. You can start with 1,000 free requests, no credit card needed.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If BeautifulSoup is new to you, our guide to using BeautifulSoup in Python covers the parsing basics this tutorial assumes.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your request token. Crawlbase offers two token types, Normal and JavaScript; the Normal token is the right one for Google Scholar. Your first 1,000 requests are free. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the two libraries the scraper needs.
python --version python -m venv scholar_env source scholar_env/bin/activate pip install requests beautifulsoup4
On Windows, activate the environment with scholar_env\Scripts\activate instead of the source line. Two dependencies do the work: requests sends the HTTP call to the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector.
Step 1: Fetch the page through the Crawling API
Start by getting the HTML. Write a small crawl() function that sends your target URL to the Crawling API with your token, checks that the underlying Scholar page came back with a 200 status, and returns the HTML body. Checking the status before you parse keeps failures loud instead of silent.
import json import requests API_TOKEN = "YOUR_CRAWLBASE_TOKEN" # replace with your token API_ENDPOINT = "https://api.crawlbase.com/" def crawl(url): params = {"token": API_TOKEN, "url": url} response = requests.get(API_ENDPOINT, params=params) response.raise_for_status() data = json.loads(response.text) if data["original_status"] != 200: raise Exception(f"Unable to crawl '{url}'") return data["body"] if __name__ == "__main__": url = "https://scholar.google.com/scholar?q=Data+Science" html = crawl(url) print(html[:500])
The API returns a JSON envelope, so you load the response with json.loads and read two fields: original_status is the status Google Scholar itself returned, and body is the page HTML. Guarding on original_status means a CAPTCHA page or a block surfaces as an exception instead of feeding garbage into the parser. The search term rides in the q parameter, which is how Scholar carries the query. Run the script with python crawling.py and you should see real results markup in the first 500 characters, which confirms the fetch works before you write a single selector.
That original_status check only ever reads 200 because the request reached Google Scholar as a real visitor in the first place. The Crawling API fetches the page from a rotating residential IP, absorbs the CAPTCHA challenges Scholar throws at automated traffic, and renders when the page needs a browser, then hands you finished HTML. You skip running a headless fleet and sourcing a residential proxy pool yourself. Point it at a public results URL on the free tier first.
Step 2: Inspect the result structure
Before parsing, open a Google Scholar results page in your browser, right-click a result, and choose Inspect to see how each listing is built. Every result is wrapped in a div.gs_r element that carries a data-rp attribute holding its position. Inside that wrapper, the title sits in an h3.gs_rt heading with the destination link as the anchor inside it, the byline (authors, publication, year) sits in a div.gs_a element, the snippet lives in div.gs_rs, and the "Cited by" count appears in the footer links under div.gs_fl. These class names are the selectors the parser targets.
The one field that needs a little work is the byline. Google Scholar packs authors, publication, and year into a single gs_a string separated by dashes, for example H Wickham, M Çetinkaya-Rundel - 2023 - books.google.com. We split that string to pull the three pieces apart in the next step.
Step 3: Parse the results with BeautifulSoup
With HTML in hand, load it into BeautifulSoup and pull each result by the selectors from the previous step. The byline parser splits the gs_a text on its dash separators, then scans the middle and end segments for a four-digit year so authors, publication, and year land in their own fields.
import re from bs4 import BeautifulSoup def parse_byline(text): # gs_a packs "authors - publication, year - host" parts = [p.strip() for p in text.split(" - ")] authors = parts[0] if parts else "" publication = parts[1] if len(parts) > 1 else "" year = None match = re.search(r"\b(19|20)\d{2}\b", text) if match: year = match.group() publication = re.sub(r",?\s*" + year, "", publication).strip() return authors, publication, year def parse_citations(result_item): for a in result_item.select("div.gs_fl a"): text = a.get_text(strip=True) if text.startswith("Cited by"): return int(text.replace("Cited by", "").strip()) return 0 def parse_google_scholar(html): soup = BeautifulSoup(html, "html.parser") results = [] for item in soup.select("div.gs_r[data-rp]"): heading = item.find("h3", class_="gs_rt") link = item.select_one("h3.gs_rt > a") byline = item.find("div", class_="gs_a") if not heading: continue authors, publication, year = "", "", None if byline: authors, publication, year = parse_byline(byline.get_text(strip=True)) results.append({ "position": int(item["data-rp"]), "title": heading.get_text(strip=True), "authors": authors, "publication": publication, "year": year, "citations": parse_citations(item), "link": link["href"] if link else None, }) return results
The selector div.gs_r[data-rp] matches each result wrapper and skips layout blocks that lack a position. For each one, h3.gs_rt gives the title and the anchor inside it gives the link, div.gs_a feeds the byline parser, and parse_citations walks the footer links in div.gs_fl for the "Cited by" entry, returning 0 when a paper has none. Reading the position straight from data-rp matches the rank Scholar itself assigns. The if not heading: continue guard keeps stray markup out of your output.
Google occasionally changes Scholar's markup. Class names like gs_rt, gs_a, and gs_fl have been stable for a long time, but treat them as a starting template, not a contract. If a field comes back empty for every result, re-inspect a live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper.
Step 4: Put it together and export JSON and CSV
Now wire the fetch and the parse into one runnable script, then write the structured output to both JSON and CSV. JSON keeps the nested shape for programmatic use, and CSV drops straight into a spreadsheet or a pandas dataframe for a literature review. Setting ensure_ascii=False keeps author names with accented characters readable in the file.
import csv import json import requests from bs4 import BeautifulSoup from scholar_parser import parse_google_scholar # the parser from step 3 API_TOKEN = "YOUR_CRAWLBASE_TOKEN" API_ENDPOINT = "https://api.crawlbase.com/" FIELDS = ["position", "title", "authors", "publication", "year", "citations", "link"] def crawl(url): params = {"token": API_TOKEN, "url": url} response = requests.get(API_ENDPOINT, params=params) response.raise_for_status() data = json.loads(response.text) if data["original_status"] != 200: raise Exception(f"Unable to crawl '{url}'") return data["body"] def save_json(results, path="scholar_results.json"): with open(path, "w", encoding="utf-8") as f: json.dump(results, f, ensure_ascii=False, indent=2) def save_csv(results, path="scholar_results.csv"): with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=FIELDS) writer.writeheader() writer.writerows(results) def main(): query = "Data Science" url = f"https://scholar.google.com/scholar?q={query.replace(' ', '+')}" html = crawl(url) results = parse_google_scholar(html) save_json(results) save_csv(results) print(f"Saved {len(results)} results to JSON and CSV") if __name__ == "__main__": main()
Run the full script with python main.py. It fetches the results page for "Data Science", extracts a record for each listing, and writes both scholar_results.json and scholar_results.csv. The same functions are all you need: swap the query and the parser handles whatever comes back.
What the output looks like
You get an ordered list of result objects, each with the parsed title, authors, publication, year, citation count, and link, ready to write to JSON, CSV, or a database.
[ { "position": 0, "title": "[BOOK][B] R for data science", "authors": "H Wickham, M Çetinkaya-Rundel, G Grolemund", "publication": "books.google.com", "year": "2023", "citations": 8421, "link": "https://books.google.com/books?id=TiLEEAAAQBAJ" }, { "position": 1, "title": "Data science and its relationship to big data and data-driven decision making", "authors": "F Provost, T Fawcett", "publication": "Big data", "year": "2013", "citations": 2510, "link": "https://www.liebertpub.com/doi/abs/10.1089/big.2013.1508" } ]
The CSV mirror carries the same columns, one row per result, with a header line of position,title,authors,publication,year,citations,link. That format is the one most literature-review workflows want, since you can sort by citation count or filter by year right in a spreadsheet.
Handling pagination
One query on one page is a demo; a real job runs deeper into the results. Google Scholar paginates with the start query parameter, which is an offset in multiples of 10: start=10 is the second page, start=20 the third, and so on, with ten results per page. The shape stays the same: build each URL, fetch it through the Crawling API, and parse it with the same function. Pace the run with a pause between requests rather than firing them in a tight loop.
import time def fetch_paginated_results(base_url, max_pages=5): all_results = [] for page in range(max_pages): start = page * 10 # 10 results per page url = f"{base_url}&start={start}" html = crawl(url) all_results.extend(parse_google_scholar(html)) time.sleep(3) return all_results base_url = "https://scholar.google.com/scholar?q=Data+Science" results = fetch_paginated_results(base_url, max_pages=5) print(f"Collected {len(results)} results across 5 pages")
Any 5XX response from the Crawling API is free of charge, so retrying a blocked or unavailable URL costs you nothing. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy gives you the same residential IP rotation as a drop-in proxy endpoint. To store more than a single run's worth of results, write each page's records into a database such as SQLite as you go, keyed on the title and link, rather than holding everything in memory.
Staying unblocked
Even with a trusted IP handled, Google Scholar watches for scraper-shaped traffic, and it reaches for CAPTCHAs faster than most search targets. A few habits keep a run healthy.
- Pace your requests. Hammering result pages in a tight loop is the fastest way to get challenged. Spread requests out and vary your queries instead of paging one term at full speed.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning CAPTCHA or verification pages is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
- Re-inspect when fields go empty. Scholar changes its markup occasionally. If results stop parsing, open a live page in dev tools and update the selectors.
For the broader playbook, see how to scrape websites without getting blocked. If you are scraping the main Google index too, our guide on how to scrape Google search pages covers the regular SERP structure, and the general Python scraping guide walks through the fundamentals this tutorial builds on.
Is it legal to scrape Google Scholar?
Whether scraping Google Scholar is allowed depends on Google's terms of service, your jurisdiction, and what you do with the data. Google Scholar's terms place limits on automated access, and they explicitly discourage scraping for commercial purposes, so an automated collection can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Scholar's terms and its robots.txt, and treat both as the boundary for what you collect.
A few lines worth holding to. Collect only public bibliographic data: the titles, authors, publication details, years, citation counts, and links that anyone can see on a results page without an account. Keep your request volume low enough that you are not straining Scholar's servers, and pace your crawl rather than running it flat out. Critically, the listings are metadata about papers, not the papers themselves. Do not use a scraper to pull or redistribute the full text of articles that sit behind a paywall or a publisher's license; that is a separate matter from reading public citation metadata, and it is not covered here.
This guide is deliberately scoped to public results pages because that is the line that keeps the work defensible. Google does not publish a broadly available official Scholar API for this kind of access, so there is no sanctioned high-volume endpoint to fall back on, which is all the more reason to stay modest in scale and respectful of the site's stated rules. If your project needs more than public metadata at low volume, a licensed bibliographic dataset or a publisher's official API is the correct path, not a cleverer scraper.
Key takeaways
- Scholar blocks scraper-shaped traffic fast. It serves CAPTCHAs quickly once it sees automated queries from one IP, so you need a trusted residential address to see real results.
- The Crawling API fetches behind a real IP. Send it the URL, it rotates residential IPs and absorbs CAPTCHAs server-side, then returns finished HTML for you to parse.
-
BeautifulSoup does the extraction. Select each
div.gs_r[data-rp], then read title, link, and thegs_abyline, and split that byline into authors, publication, and year. -
Citations and pagination are simple. Read the "Cited by" count from
div.gs_fl, and walk deeper with thestartoffset in multiples of 10, pausing between pages. - Stay on public metadata. Respect Scholar's ToS and robots.txt, keep volume low, and never pull paywalled full text or personal data.
Frequently Asked Questions (FAQs)
Why does a plain request fail or return a CAPTCHA on Google Scholar?
Scholar flags traffic that does not look like a real browser and quickly rate-limits repeated queries from a single datacenter IP, so a bare script tends to hit a CAPTCHA or a verification page instead of the results you see in your own browser. Fetching through the Crawling API, which uses rotating residential IPs and handles CAPTCHA challenges server-side, makes the request look like an ordinary visitor so you get the real results page.
How can I scrape Google Scholar data using Python?
Use the requests library to send your search URL to the Crawling API, then parse the returned HTML with BeautifulSoup. Select each div.gs_r[data-rp] result and read the title from h3.gs_rt, the byline from div.gs_a, and the citation count from div.gs_fl. For the parsing fundamentals, see our BeautifulSoup guide.
What fields can I extract from a Google Scholar result?
This tutorial pulls the title, authors, publication, year, citation count, and link from each result, plus the position from the data-rp attribute. Authors, publication, and year all come from the single gs_a byline, which the parser splits apart. Stay within public bibliographic metadata and avoid pulling paywalled full text.
How do I get the citation count for each paper?
Each result's footer, the div.gs_fl block, contains a "Cited by N" link when the paper has citations. The parser scans those links, finds the one that starts with "Cited by", and reads the number off the end, returning 0 for papers with no citations yet.
How do I paginate through more Google Scholar results?
Use the start query parameter, which is an offset in multiples of 10: start=10 is the second page, start=20 the third, and so on, with ten results per page. Build each page URL with the offset, fetch it through the Crawling API, parse it with the same function, and pause a few seconds between requests so you are pacing the crawl rather than hammering it.
Can I analyze the scraped Google Scholar data afterward?
Yes. Export the results to CSV or JSON, then load them into a tool like pandas for analysis. Because the records carry citation counts and years, you can sort by impact, filter by recency, or chart how a topic's output grows over time, which is exactly what a literature review needs.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
