Reddit is one of the largest public discussion archives on the web, and the public listings on a subreddit are a useful signal for research: what topics are trending, how a community ranks links, which external sources get shared, and how scores and comment counts move over time. This guide shows you how to scrape public Reddit data with Python using the Crawlbase Crawling API, with the whole walkthrough scoped to public listing pages only.

To be clear up front: everything here stays on public, aggregate data from public subreddits and topic listings. That means post titles, scores and upvote counts, comment counts, the subreddit a post belongs to, and the link each post points at. It does not cover anything behind a login, private subreddits, direct messages, or the personal data of individual users. Reddit's terms restrict automated access, so read the legality section near the end before you point this at anything real, and prefer the official Reddit API for any production use.

What you will build

A small Python script that takes a public subreddit or topic listing URL, fetches the fully rendered page through the Crawling API with a JavaScript token, and parses a handful of public fields from each post in the listing:

  • Title the public headline text of each post.
  • Score / upvotes the aggregate vote count a post displays.
  • Comment count how many comments the post has, as a number.
  • Subreddit the community the post belongs to (for example r/technology).
  • Link the permalink or outbound URL the post points at.

Notice what is deliberately absent: no usernames, no author profiles, no comment text, no vote-by-vote breakdowns. Those are personal data of individuals, and collecting them is out of scope here on purpose. We aggregate at the level of the post and the community, never the person.

Why a plain request fails on Reddit

Request a Reddit listing URL with a bare HTTP client and you will usually get something close to useless: a JavaScript shell, a cookie consent interstitial, or a challenge page. Reddit's current front end renders post listings client-side, so the titles, scores, and links only appear after the page's scripts run in a browser. On top of that, Reddit flags automated traffic quickly. Datacenter IP ranges, missing browser behavior, and repetitive request patterns get rate-limited or blocked well before the listing ever loads.

So a working Reddit scraper needs two things in the same request: a real browser that renders the page, and an IP address Reddit reads as an ordinary visitor. You can build that yourself with a headless browser and a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call. You send it a URL with a JavaScript token, it renders the page behind a trusted residential IP, and it returns finished HTML you can parse. If you want the deeper background, see our guide on how to crawl JavaScript websites.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Reddit listings are client-side rendered, so you need the JS token here. The normal token returns the same shell a plain fetch would, with nothing useful to parse out of it.

Prerequisites

A few things to have in place first. None take long.

Basic Python. You should be comfortable running a script and installing packages with pip. If you are new to parsing HTML, our primer on how to use BeautifulSoup in Python covers the extraction side.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat it like a password: it authenticates your requests, so keep it out of version control. The free tier gives you 1,000 requests to test with.

Set up the project

Create an isolated virtual environment, then install the two libraries the scraper needs.

bash
python --version

python -m venv reddit_env
source reddit_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate with reddit_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by selector.

Step 1: Fetch the rendered listing

Start by getting the finished page. Import CrawlingAPI, initialize it with your JS token, and request a public listing URL. The legacy version of this tutorial pointed at a public topic listing, https://www.reddit.com/t/technology/, which is a good impersonal target to start with. Check the status code before parsing so failures stay loud instead of silent.

python
from crawlbase import CrawlingAPI

crawlbase_token = "YOUR_CRAWLBASE_TOKEN"
api = CrawlingAPI({"token": crawlbase_token})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    listing_url = "https://www.reddit.com/t/technology/"
    html = crawl(listing_url)
    print(html[:500] if html else "No HTML returned")

The two wait options matter for a client-rendered target. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering posts appear before the page is captured. Five seconds is a reasonable starting point; raise it if the listing comes back empty. Run the script and you should see real listing markup, which confirms rendering works before you write a single selector.

Crawlbase Reddit Scraper

That listing only filled in because the page was rendered behind a trusted IP in one call. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public subreddit on the free tier first.

Step 2: Parse the public post fields

With rendered HTML in hand, load it into BeautifulSoup and pull the public fields from each post. Reddit's listing markup groups each post inside a shreddit-post custom element, and the useful values live in that element's attributes rather than in deeply nested, frequently renamed CSS classes. The attributes you want are post-title, score, comment-count, subreddit-prefixed-name, and permalink. Reading attributes is far more durable than chasing rendered widgets.

python
from bs4 import BeautifulSoup

BASE = "https://www.reddit.com"

def to_int(value):
    try:
        return int(value)
    except (TypeError, ValueError):
        return None

def scrape_listing(html):
    soup = BeautifulSoup(html, "html.parser")
    posts = []
    for post in soup.select("shreddit-post"):
        permalink = post.get("permalink", "")
        link = f"{BASE}{permalink}" if permalink.startswith("/") else permalink
        posts.append({
            "title": post.get("post-title"),
            "score": to_int(post.get("score")),
            "comment_count": to_int(post.get("comment-count")),
            "subreddit": post.get("subreddit-prefixed-name"),
            "link": link,
        })
    return posts

Each post becomes a flat record of public, aggregate fields. The score and comment-count attributes come back as strings, so to_int coerces them to numbers and returns None when an attribute is missing rather than crashing the run. The permalink attribute is a site-relative path like /r/technology/comments/<id>/<slug>/, so we join it onto the base host to get a full link. Note there is no author field anywhere in this record, by design.

Selectors drift

Reddit changes its markup without notice, which is why this code reads attributes on the shreddit-post element rather than brittle nested classes. If a field comes back as None, re-inspect the live page in your browser's dev tools and update the attribute name. Periodic maintenance is normal for any production scraper, not a sign something is broken. For a refresher on choosing resilient selectors, see scrape websites without getting blocked.

Step 3: Put it together with pagination

A single listing page only shows the first batch of posts. Reddit's older, render-friendly listing endpoints accept an after token for the next page, but the most reliable way to page a JS-rendered listing through the Crawling API is to request the listing's .json companion endpoint, which is public and returns the same posts plus an after cursor. Here we keep it simple and page the public subreddit listing, collecting a fixed number of pages and pausing between requests.

python
import json
import time
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

crawlbase_token = "YOUR_CRAWLBASE_TOKEN"
api = CrawlingAPI({"token": crawlbase_token})
BASE = "https://www.reddit.com"

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def scrape_subreddit(subreddit, max_pages=3):
    records = []
    after = None
    for _ in range(max_pages):
        url = f"{BASE}/r/{subreddit}/.json?limit=25"
        if after:
            url += f"&after={after}"
        body = crawl(url)
        if not body:
            break
        data = json.loads(body)["data"]
        for child in data["children"]:
            post = child["data"]
            records.append({
                "title": post.get("title"),
                "score": post.get("score"),
                "comment_count": post.get("num_comments"),
                "subreddit": f"r/{post.get('subreddit')}",
                "link": f"{BASE}{post.get('permalink', '')}",
            })
        after = data.get("after")
        if not after:
            break
        time.sleep(3)
    return records

if __name__ == "__main__":
    posts = scrape_subreddit("technology", max_pages=3)
    print(json.dumps(posts, indent=2, ensure_ascii=False))

The public .json endpoint hands back the same aggregate fields with cleaner keys: title, score, num_comments, subreddit, and permalink. The after cursor on each response is the only state pagination needs, so the loop requests the next page until Reddit stops returning a cursor or you hit max_pages. The time.sleep(3) between pages is not decoration: pacing is the single biggest factor in whether a run stays healthy. If you prefer parsing the rendered HTML instead, swap crawl(url) for the listing URL from Step 1 and feed the body to scrape_listing.

What the output looks like

Run the script and you get a clean list of public, aggregate records, ready to write to JSON or CSV.

json
[
  {
    "title": "Researchers demo a swallowable device that tracks vital signs",
    "score": 8421,
    "comment_count": 312,
    "subreddit": "r/technology",
    "link": "https://www.reddit.com/r/technology/comments/17xmvmg/swallowable_device_tracking_vital_signs_inside/"
  }
]

To persist these records, a couple of standard-library lines turn the list into a CSV or JSON file. The fields are flat and uniform, so no special handling is needed.

python
import csv
import json

def save_json(records, path="reddit_posts.json"):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

def save_csv(records, path="reddit_posts.csv"):
    fields = ["title", "score", "comment_count", "subreddit", "link"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(records)

From here the data is ready for analysis: ranking subreddits by median score, tracking which external domains a community links to, or charting comment activity over a window of pages. If you plan to feed it into a model later, our guide on how to structure and clean web-scraped data for AI and ML covers the normalization step, and web scraping for machine learning covers what to do with it next.

Handling rate limits and errors

Two layers can throttle a Reddit run, and a resilient script accounts for both. Reddit rate-limits automated traffic and returns 429 Too Many Requests or 403 Forbidden when you push too hard; the Crawling API also has per-plan limits. The habits below keep a run inside both.

  • Pace your requests. The time.sleep(3) between pages is the floor, not a maximum. Hammering the listing in a tight loop is the fastest way to get throttled, so add real delays and resist parallelizing aggressively.
  • Read the status codes. A run that starts returning 429 or 403 is telling you the current rate is no longer enough. Back off rather than pushing harder, and consider exponential backoff with a few retries.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit. The Crawling API handles this for you; if you build your own stack, this is the part to get right.
  • Keep volume low and targets varied. Public-data research does not require crawling a subreddit's entire history. Sample the pages you need and stop.

This is the section to read before you write production code. Reddit's User Agreement and its Public Content Policy restrict automated access and the bulk collection of content, and Reddit's robots.txt sets out what crawlers may touch. Automated scraping can run against those terms regardless of how careful your tooling is, and none of the code above changes that. It only makes the technical part work. Read Reddit's terms and robots.txt first, and treat both as the boundary for what you collect.

The honest, restrictive rules to hold to. Collect only public, aggregate data: post titles, scores, comment counts, the subreddit, and the link, all of which anyone can see without logging in. Treat usernames, author handles, profile details, and the text of individual comments as personal data, and do not harvest them, build profiles of identifiable people, or tie content back to a person. Never scrape private subreddits, login-walled content, direct messages, or anything that requires authentication, and never bypass a login or a challenge to reach it. When personal data is involved, privacy laws such as the GDPR and the CCPA apply: you need a lawful basis to process it and you must honor deletion requests. The scraper in this guide stays on the aggregate, non-personal side of all of these lines on purpose.

For any real or commercial use, the right tool is the official Reddit API. It is built for sanctioned access, gives you guaranteed structure, exposes clear rate limits, and keeps you inside Reddit's terms. This article is a technical walkthrough scoped narrowly to public, aggregate listing data. It is not an endorsement of mass personal-data collection, and it does not cover anything behind a login. If your project needs more than a small sample of public fields, the Reddit API or a formal data agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Reddit listings are client-side rendered and bot-defended. A plain fetch returns a shell or a challenge, so you must render the page before you parse it.
  • Rendering and a trusted IP belong in one call. The Crawling API with a JS token does both; ajax_wait and page_wait control how long it waits for content.
  • Parse stable signals. The shreddit-post attributes (or the public .json endpoint keys) are more durable than brittle nested classes.
  • Aggregate, public fields only. Pull title, score, comment count, subreddit, and link; never usernames, author profiles, or comment text.
  • Pace, rotate, and prefer the official API. Keep volume low, lean on residential rotation, and use the Reddit API for anything real or commercial.

Frequently Asked Questions (FAQs)

Why does a plain request return no data from Reddit?

Because Reddit's current front end renders post listings client-side with JavaScript, and it challenges automated traffic. A raw HTTP request returns a near-empty shell, a cookie interstitial, or a block page. To get real public data you have to render the page first, which is what the Crawling API's JS token handles for you.

Do I need the normal token or the JS token for Reddit?

The JS token for the rendered listing pages, because the normal token returns the same shell a plain fetch would. If you page the public .json endpoint instead, that response is plain JSON, but routing it through the Crawling API still benefits from the trusted IP and rotation that keep the run from getting blocked.

What Reddit data is safe to scrape?

Only public, aggregate data: post titles, scores and upvote counts, comment counts, the subreddit, and the link each post points at. Usernames, author profiles, and the text of individual comments are personal data and are off limits here. Private subreddits, login-walled content, and direct messages are out of scope entirely.

Should I use the official Reddit API or scrape the site?

For any real, ongoing, or commercial use, use the official Reddit API. It is the sanctioned route, gives guaranteed structure, and publishes clear rate limits. Scraping a small sample of public listing fields with the approach here fits lightweight public-data research where no API access is in place, as long as you respect Reddit's terms, robots.txt, and rate limits.

How do I handle pagination across many pages?

Request the public listing's .json endpoint with a limit, read the after cursor from each response, and pass it back as &after=<cursor> on the next request. Loop until Reddit stops returning a cursor or you hit your page cap, and sleep a few seconds between pages so you stay inside the rate limits.

How do I avoid getting blocked while scraping Reddit?

Keep your request rate low, add real delays between pages, vary your targets instead of crawling one subreddit's full history, and route through rotating residential IPs so no single address trips a limit. The Crawling API manages rotation and a trusted IP pool for you. Watch for 429 and 403 responses and back off the moment you see them.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available