GitHub is one of the richest public datasets in software. Public repository pages carry a project's name, description, star and fork counts, primary language, and topics, while public profile pages summarize a developer's public name, bio, repository count, and follower count. That data drives a lot of legitimate work: tracking the popularity of open-source projects, surveying which languages and frameworks are gaining traction, and building dashboards of the libraries a team depends on.
This guide shows you how to scrape public GitHub repositories and profiles with Python through the Crawlbase Crawling API, parse the fields that matter, and export them to JSON and CSV. Everything here is scoped to public pages that anyone can open without logging in. It does not touch private repositories, organization member lists, email addresses, or anything behind authentication. Read the legality section near the end before you point this at anything real, and note up front that GitHub offers an official REST API that is the better tool for most of these jobs.
What you will build
A small Python script that takes a public GitHub repository or profile URL, fetches the page through the Crawling API, parses it with BeautifulSoup, and writes structured records to JSON and CSV. The fields it pulls:
- Repository name the project name shown in the repo header.
- Description the one-line summary in the sidebar.
- Stars the public star count.
- Forks the public fork count.
- Watchers the count of users tracking the repo.
- Language and topics the primary language and the repository's topic tags.
- Profile fields for a user URL: public name, bio, public repository count, and follower count.
Notice what is deliberately absent: no email addresses, no private repositories, no member lists of private organizations, and no attempt to build a dossier on any individual. Profile data describes real people, so the script treats it as personal data and keeps to coarse public fields.
Why a plain request can fail on GitHub
GitHub serves most of its repository and profile content as server-rendered HTML, so a plain request often returns usable markup. The friction shows up at volume. GitHub rate-limits unauthenticated traffic aggressively, and a tight loop from a single datacenter IP gets throttled or challenged quickly. Anonymous browsing also gives you a thinner page than a signed-in session, and the markup shifts between logged-in and logged-out views, which breaks brittle selectors.
So a reliable GitHub scraper needs requests that read as ordinary visitors and spread across many IP addresses so no single one trips a limit. You can build that yourself with a pool of rotating proxies and your own retry logic, but keeping that stack healthy is most of the work. The Crawling API folds it into one call: you send a URL, it fetches the page behind a trusted, rotating IP, and it returns finished HTML you can parse. GitHub pages are static enough that the normal token is the right choice here, with no JavaScript rendering required.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. GitHub repository and profile pages are server-rendered, so the normal token is enough and costs less. Reach for the JS token only if a specific page you need depends on client-side rendering.
Prerequisites
A few things to have in place first. None take long.
Basic Python. You should be comfortable running a script and installing packages with pip. If parsing HTML is new to you, our primer on how to use BeautifulSoup in Python covers the extraction side, and scraping a website with Python walks the end-to-end flow.
Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org.
A Crawlbase account and token. Sign up, open your dashboard, and copy your normal token from the account docs page. Crawlbase includes 1,000 free requests to start, and you pay only for successful requests. Treat the token like a password: keep it out of version control.
Set up the project
Create an isolated virtual environment, then install the three libraries the scraper needs.
python --version python -m venv github_env source github_env/bin/activate pip install crawlbase beautifulsoup4 pandas
On Windows, activate with github_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML so you can pull fields by selector, and pandas turns the records into a CSV at the end.
Step 1: Fetch a public repository page
Start by getting the finished page. Import CrawlingAPI, initialize it with your token, and request a public repository URL. Check the status code before parsing so failures stay loud instead of silent.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def crawl(page_url): response = api.get(page_url) if response["status_code"] == 200: return response["body"].decode("latin1") print(f"Request failed: {response['status_code']}") return None if __name__ == "__main__": page_url = "https://github.com/TheAlgorithms/Java" html = crawl(page_url) print(html[:500] if html else "No HTML returned")
The body is decoded as latin1 to avoid choking on the occasional non-UTF-8 byte in a repository's rendered HTML. The example points at a well-known public repository so you can confirm the fetch works before you write a single selector. Run it and you should see real GitHub markup in the first 500 characters, which tells you the request reached the page behind a trusted IP.
The api.get call above is doing more than an HTTP request. GitHub throttles unauthenticated traffic and a single datacenter IP gets limited fast, so the Crawling API fetches each page behind a rotating residential IP and handles retries and CAPTCHAs for you. You skip running a proxy pool and the back-off logic that goes with it. Point it at a public repo on the free tier first.
Step 2: Parse the repository fields
With rendered HTML in hand, load it into BeautifulSoup and pull the repository fields. GitHub's repo header exposes the name through an itemprop attribute, the description sits in the sidebar, and the star, fork, and watcher counts sit next to their Octicon SVG icons, which makes those icons reliable anchors for the numbers beside them. Topics are tagged links, and the primary language shows up in the languages list.
from bs4 import BeautifulSoup def text_of(soup, selector): el = soup.select_one(selector) return el.text.strip() if el else None def scrape_repository(html): soup = BeautifulSoup(html, "html.parser") topics = [t.text.strip() for t in soup.select('a[data-octo-click="topic_click"]')] return { "name": text_of(soup, 'strong[itemprop="name"] a'), "description": text_of(soup, "div.Layout-sidebar div.BorderGrid-row p.f4.my-3"), "stars": text_of(soup, "svg.octicon-star ~ strong"), "forks": text_of(soup, "svg.octicon-repo-forked ~ strong"), "watchers": text_of(soup, "svg.octicon-eye ~ strong"), "language": text_of(soup, 'span[itemprop="programmingLanguage"]'), "topics": topics, }
The text_of helper returns None when a selector misses, so one absent field never crashes the whole parse. The star, fork, and watcher selectors use the Octicon icon class as an anchor and a sibling combinator (~ strong) to grab the count rendered next to it, which is sturdier than depending on a deeply nested class chain. Topics are collected from every topic_click link into a list.
GitHub revises its markup periodically, so a selector that works today can return None later. When a field comes back empty, open the live page in your browser's dev tools and update the selector. Anchoring on stable hooks like itemprop and the Octicon icon classes, rather than auto-generated utility classes, keeps maintenance to a minimum.
Step 3: Parse a public profile page
A public profile page carries a different set of fields. From it you can pull the user's public display name, their nickname (the handle), the bio, the public repository count, and the follower count. GitHub marks the display name and nickname with stable vcard classes, and the repository count and follower count sit next to their own Octicon icons, the same pattern as the repo page.
def scrape_profile(html): soup = BeautifulSoup(html, "html.parser") return { "name": text_of(soup, "span.p-name.vcard-fullname"), "username": text_of(soup, "span.p-nickname.vcard-username"), "bio": text_of(soup, "div.p-note.user-profile-bio div"), "repositories": text_of(soup, "svg.octicon-repo ~ span"), "followers": text_of(soup, "svg.octicon-people ~ span.color-fg-default"), }
These are the coarse public fields a profile shows to any logged-out visitor. The script stops there on purpose. It does not read a user's email, their organization memberships, or the contents of their repositories, and it does not stitch profiles together into a record about a person. Public name, bio, repo count, and follower count are aggregate signals about a developer's public footprint; the individual behind them is not yours to profile.
Step 4: Put it together and export
Now wire fetch and parse into one runnable script that reads a repository and a profile, then writes both JSON and CSV with pandas.
import json import time import pandas as pd from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def crawl(page_url): response = api.get(page_url) if response["status_code"] == 200: return response["body"].decode("latin1") print(f"Request failed: {response['status_code']}") return None def main(): repo_url = "https://github.com/TheAlgorithms/Java" profile_url = "https://github.com/torvalds" records = [] repo_html = crawl(repo_url) if repo_html: repo = scrape_repository(repo_html) repo["url"] = repo_url records.append(repo) time.sleep(3) profile_html = crawl(profile_url) if profile_html: profile = scrape_profile(profile_html) profile["url"] = profile_url records.append(profile) with open("github_data.json", "w") as f: json.dump(records, f, indent=2, ensure_ascii=False) pd.DataFrame(records).to_csv("github_data.csv", index=False) print(f"Wrote {len(records)} records to JSON and CSV") if __name__ == "__main__": main()
The time.sleep(3) between requests is not decoration. Pacing is the single biggest factor in whether a run stays healthy on a rate-limited target like GitHub. The script collects a repository record and a profile record into one list, writes the structured result to github_data.json, and lets pandas flatten the same records into github_data.csv for a spreadsheet. The topics list serializes cleanly to JSON and lands as a string in the CSV column.
What the output looks like
Run the full script and you get a clean record of public fields, ready to load into a notebook, a database, or a spreadsheet.
[ { "name": "Java", "description": "All Algorithms implemented in Java", "stars": "59.1k", "forks": "19.5k", "watchers": "1.3k", "language": "Java", "topics": ["algorithms", "java", "data-structures"], "url": "https://github.com/TheAlgorithms/Java" }, { "name": "Linus Torvalds", "username": "torvalds", "bio": null, "repositories": "8", "followers": "219k", "url": "https://github.com/torvalds" } ]
The exact star and follower formatting (59.1k, 219k) comes straight from GitHub's rendered counts. If you want raw integers, the precise value is usually in the element's title attribute; read that instead of the visible text when you need to do math on the numbers.
Scaling to many repositories
The single-page script generalizes cleanly. To survey a set of projects, keep a list of repository URLs and loop the same scrape_repository call over them, accumulating records before you export once at the end.
repo_urls = [ "https://github.com/TheAlgorithms/Java", "https://github.com/pallets/flask", "https://github.com/psf/requests", ] records = [] for url in repo_urls: html = crawl(url) if html: record = scrape_repository(html) record["url"] = url records.append(record) time.sleep(3)
Keep the delay between requests, watch the status codes, and stop when you have what you need rather than crawling exhaustively. For the broader playbook on staying healthy against rate limits, see how to scrape websites without getting blocked. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy gives you the same residential rotation as a drop-in proxy endpoint, and our roundup of the top open-source scraping libraries covers parser and crawler choices if you want to assemble your own stack.
Is it legal to scrape GitHub?
This is the section to read before you write production code. Scraping public GitHub pages for personal or educational use is generally defensible, because the data is published for anyone to read without logging in. That does not make it unconditional. GitHub's Acceptable Use Policies govern automated access, and its robots.txt tells crawlers which paths are off limits. Read both and treat them as the boundary. Never touch private repositories, login-walled content, or anything you would need credentials to reach, and do not hammer the site at a rate that degrades it for others.
Profile data deserves extra care, because it describes real people. A public name, bio, and follower count are personal data, and in many jurisdictions privacy laws such as the GDPR and CCPA apply the moment you collect and store information about identifiable individuals, even when that information is public. That means having a lawful basis for what you collect, keeping only what you need, and honoring deletion requests. Aggregate where you can (counts and trends across many repos) rather than building dossiers on named developers, and never republish an individual's details or stitch their footprint together into a profile of the person.
For most jobs, the better tool is the official GitHub REST API. It is generous, free for normal use, and gives you clean, structured JSON for repositories, users, stars, forks, languages, and topics without parsing any HTML. It is the sanctioned path, it survives markup changes, and it comes with documented rate limits you can plan around. Reach for scraping only when a specific public page carries something the API does not expose, and keep that work small, paced, and scoped to public, non-sensitive fields. If your project needs GitHub data at any real scale, start with the REST API, not a scraper.
Key takeaways
- GitHub is server-rendered but rate-limited. A plain request returns markup, but unauthenticated traffic from one IP gets throttled fast, so route requests through rotating IPs.
- The normal token is enough. Repository and profile pages do not need JavaScript rendering, so the cheaper normal token fetches everything you need.
-
Anchor on stable hooks. Parse repo fields off
itempropattributes and Octicon icon classes, and profile fields offvcardclasses, not auto-generated utility classes. - Treat profile data as personal data. Pull coarse public fields, aggregate rather than profile individuals, and respect GDPR and CCPA when you store them.
- Prefer the GitHub REST API. It is free, generous, and structured; scrape only the public pages it does not cover, paced and small.
Frequently Asked Questions (FAQs)
Do I need the normal token or the JS token for GitHub?
The normal token. GitHub renders repository and profile pages on the server, so the static HTML already contains the name, description, star and fork counts, language, topics, and the public profile fields. The JS token renders pages in a browser first and costs more, which you only need for the rare GitHub view that depends on client-side rendering.
What GitHub data is safe to scrape?
Public data that any logged-out visitor can see: a public repository's name, description, stars, forks, watchers, primary language, and topics, plus a public profile's name, bio, public repository count, and follower count. Private repositories, organization member lists, email addresses, and anything behind authentication are off limits, both under GitHub's terms and, for personal data, under privacy law.
Should I use the GitHub REST API instead of scraping?
For most jobs, yes. The official GitHub REST API is free for normal use, generous with its rate limits, and returns clean JSON for repositories, users, stars, forks, languages, and topics with no HTML parsing. It is the sanctioned route and it survives markup changes. Reach for scraping only when a specific public page exposes something the API does not, and keep that work small and paced.
How do I avoid getting rate-limited while scraping GitHub?
Keep your per-IP request rate low, add real delays between requests as in the time.sleep(3) above, and route through rotating residential IPs so no single address trips a limit. The Crawling API manages rotation and retries for you. Watch the status codes and back off the moment you start seeing challenges or errors rather than pushing harder.
Why are the star and follower counts strings like "59.1k"?
Because that is the abbreviated text GitHub renders on the page, and the script reads the visible text. When you need exact integers, look at the element's title attribute, which usually holds the precise number, and read that instead of the displayed text before doing any arithmetic.
Can I scrape private repositories or user email addresses?
No, and this guide deliberately does not show how. Private repositories sit behind authentication, and email addresses are personal data that GitHub does not surface to anonymous visitors. Reaching either would mean bypassing access controls or collecting personal data without a lawful basis, both of which run against GitHub's terms and privacy law. For access to accounts or organizations you control, authenticate through the official GitHub REST API.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
