Public Facebook Pages, the brand and business profiles that anyone can view without logging in, hold a lot of useful public signal: the page name, the visible text of public posts, and the aggregate counts a post displays. For competitor research, brand monitoring, or content benchmarking, that public surface is worth reading programmatically. This guide shows you how to scrape a public Facebook Page with Python, built and run inside PyCharm, in a way that actually works.

To be unambiguous up front: everything here is scoped to public Facebook Pages only. That means the public page name, the visible text of public posts, and the public aggregate counts that a page or post shows to any visitor. It does not cover private groups, member lists, personal profiles, login-walled content, comments tied to named individuals, or any personal data of people. Facebook and its parent company Meta restrict automated access in their terms, so read the legality section near the end before you point this at anything real. For production or commercial use, the official Facebook Graph API is the correct tool, not a scraper.

What you will build

A small Python script, written and run in PyCharm, that takes a public Facebook Page URL, fetches the fully rendered page through the Crawling API with a JavaScript token, and parses out a handful of public fields:

  • Public page name the brand or business name shown on the page.
  • Public post text the visible body text of public posts on the page.
  • Public counts the aggregate numbers a public page displays, such as a post's reaction or share count, treated as numbers only.
  • Public post URLs the permalink to each public post.

Notice what is deliberately absent: no group member lists, no commenter identities, no personal profiles, no contact details of individuals. Those are personal data, and collecting them is out of scope here on purpose. The legacy version of this tutorial targeted Facebook Groups; this rewrite stays strictly on public Pages, which is the defensible surface.

Why a plain request fails on Facebook

Request a public Facebook Page URL with a bare HTTP client and you will get a response that is technically successful and practically useless. The body is a JavaScript shell: the real content only appears after the page's scripts run in a browser and fetch data from internal endpoints. Facebook is one of the harder targets on the public web for exactly this reason, and it has been for years.

On top of rendering, Facebook flags automated traffic quickly. Datacenter IP ranges, missing browser behavior, and repetitive request patterns get challenged or rate-limited well before the interesting content ever loads. So a working scraper needs two things in the same request: a real browser that renders the page, and an IP address the platform reads as an ordinary visitor. You can build that yourself with a headless browser and a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call: you send it a URL with a JavaScript token, it renders the page behind a trusted residential IP, and it returns finished HTML you can parse. If you want the deeper background, see our guide on how to crawl JavaScript websites.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Facebook is client-side rendered, so you need the JS token here. The normal token returns the same shell a plain request would, with nothing useful to parse out of it.

Prerequisites

A few things to have in place first. None take long.

Basic Python. You should be comfortable running a script and installing packages with pip. If you are new to parsing HTML, our primer on how to scrape a website with Python covers the extraction side.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org. The legacy tutorial used Python 2 and urllib2; both are end-of-life, so this rewrite targets modern Python 3.

PyCharm. Download the free Community edition from JetBrains and install it. PyCharm is the IDE we use to create the project, install packages, and run the script.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat it like a password: it authenticates your requests, so keep it out of version control.

Set up the project in PyCharm

This is the PyCharm part of the walkthrough. Open PyCharm and choose New Project. Name it something like facebook-page-scraper, let PyCharm create a virtual environment for you (the default), and confirm the interpreter is Python 3.8 or later. Click Create. Then right-click the project in the Project panel, choose New, Python File, and name it scraper.py.

PyCharm bundles a terminal at the bottom of the window, already activated to your project's virtual environment. Open it and install the two libraries the scraper needs.

bash
python --version

pip install crawlbase beautifulsoup4

Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by selector. If you prefer the GUI, PyCharm's Python Packages tool window installs the same packages without touching the terminal.

Step 1: Fetch the rendered page

Start by getting the finished page. In scraper.py, import CrawlingAPI, initialize it with your JS token, and request a public Facebook Page URL. Check the status code before parsing so failures stay loud instead of silent.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    page_url = "https://www.facebook.com/MetaForBusiness"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

Run it from inside PyCharm: right-click the editor and choose Run 'scraper', or press the green run arrow in the gutter. The two wait options matter for a client-rendered target. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear before the page is captured. Five seconds is a reasonable starting point; raise it if fields come back empty. The example uses a public business page precisely because it is public and impersonal. You should see real page markup in the Run panel, which confirms rendering works before you write a single selector.

Crawlbase Facebook Scraper

Facebook needs a rendered page behind a trusted IP, in one call. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public business page on the free tier first.

Step 2: Parse the public fields with BeautifulSoup

With rendered HTML in hand, load it into BeautifulSoup and pull the public fields. Facebook exposes a lot of stable metadata in the page's <meta> tags, which is far more reliable than chasing deeply nested, frequently renamed CSS classes. The public page name and description live in standard Open Graph meta tags; the visible public post text and post permalinks live in the rendered DOM.

python
from bs4 import BeautifulSoup

def meta(soup, prop):
    el = soup.find("meta", attrs={"property": prop})
    return el["content"] if el and el.has_attr("content") else None

def scrape_page(html):
    soup = BeautifulSoup(html, "html.parser")

    page_name = meta(soup, "og:title")
    summary = meta(soup, "og:description")

    post_urls = []
    for a in soup.select("a[href*='/posts/']"):
        href = a["href"].split("?")[0]
        if href.startswith("/"):
            href = f"https://www.facebook.com{href}"
        if href not in post_urls:
            post_urls.append(href)

    return {
        "page_name": page_name,
        "summary": summary,
        "post_urls": post_urls,
    }

The og:title tag carries the public page name, and og:description often carries a short public summary string. The post links are collected from anchors whose href contains /posts/, stripped of query strings, normalized to absolute URLs, and de-duplicated, since the same permalink can appear more than once in the rendered DOM. Everything here is a public, page-level signal, not a profile of any individual.

Selectors drift

Facebook changes its markup and class names without notice, which is why this code leans on Open Graph meta tags and the stable /posts/ URL shape rather than brittle nested classes. When a field comes back as None, re-inspect the live page in your browser's dev tools and update the selector. Periodic maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Extract public text and counts from a post

A public post page carries the same kind of Open Graph metadata, plus the visible public post text in the rendered DOM. From it you can pull the public post text and any public count the page displays, such as a reaction or share number. Read those numbers as plain aggregates, never as a way to enumerate the people behind them.

python
import re
from bs4 import BeautifulSoup

def scrape_post(html):
    soup = BeautifulSoup(html, "html.parser")

    desc = soup.find("meta", attrs={"property": "og:description"})
    post_text = desc["content"] if desc and desc.has_attr("content") else None

    counts = {}
    for label in ("reaction", "share"):
        node = soup.find("div", attrs={"aria-label": re.compile(label, re.I)})
        if node:
            digits = re.sub(r"[^\d]", "", node.get_text())
            counts[label] = int(digits) if digits else None

    return {
        "post_text": post_text,
        "counts": counts,
    }

This extracts only public, non-personal fields: the visible post text and the aggregate counts the page displays. It does not read individual comments, commenter handles, or who reacted to the post. The aria-label and selector shapes shift over time, so re-inspect them when a field stops resolving. That restraint is intentional, and it is also what keeps the work defensible. Counts are numbers; the people behind them are not yours to harvest.

Step 4: Put it together

Now wire fetch and parse into one runnable script that reads a public page, then visits its first few public posts. Paste this over scraper.py and run it from PyCharm.

python
import re
import json
import time
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def main():
    page_url = "https://www.facebook.com/MetaForBusiness"
    html = crawl(page_url)
    if not html:
        return

    page = scrape_page(html)
    records = []
    for post_url in page["post_urls"][:5]:
        post_html = crawl(post_url)
        if post_html:
            record = scrape_post(post_html)
            record["url"] = post_url
            records.append(record)
        time.sleep(3)

    output = {"page_name": page["page_name"], "posts": records}
    print(json.dumps(output, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    main()

The time.sleep(3) between requests is not decoration. Pacing is the single biggest factor in whether a run stays healthy, and we will come back to it. The slice [:5] keeps the demo small; raise it only when your pacing and volume are responsible. Keep the scrape_page and scrape_post functions from the earlier steps in the same file so this script runs end to end.

What the output looks like

Run the full script in PyCharm and you get a clean record of public fields, ready to write to JSON, CSV, or a database.

json
{
  "page_name": "Meta for Business",
  "posts": [
    {
      "post_text": "New tools to help businesses reach customers.",
      "counts": { "reaction": 1820, "share": 214 },
      "url": "https://www.facebook.com/MetaForBusiness/posts/example123"
    }
  ]
}

Scaling and staying unblocked

To read more than one page, wrap the crawl loop over a list of public page URLs and write each result to disk as you go, so a failure partway through does not lose everything. Even with rendering handled by the Crawling API, Facebook watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard, heavily defended target.

  • Pace your requests. Hammering pages in a tight loop is the fastest way to get throttled. Add real delays, as in the time.sleep above, and resist the urge to parallelize aggressively.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you build your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Back off rather than pushing harder.
  • Keep volume low and targets varied. Public-data research does not require crawling a page's entire history. Sample what you need and stop.

For the broader playbook, see our guide on how to scrape websites without getting blocked. If a managed render-plus-rotation call is not enough for your volume and you want auto-parsed page output, our deeper writeup on mastering Facebook data extraction with the Crawling API covers the structured-output path.

This is the section to read before you write production code. Facebook is owned by Meta, and Meta's Terms of Service strongly restrict automated access and data collection. Automated scraping can run against those terms regardless of how careful your tooling is, and none of the code above changes that. It only makes the technical part work. Read Meta's and Facebook's Terms of Service and Facebook's robots.txt, respect the platform's rate limits, and treat all of them as the boundary for what you collect.

The honest, restrictive rules to hold to. Collect only public data from public Pages: the public page name, the visible text of public posts, public aggregate counts, and public post URLs that anyone can see without logging in. Never scrape private groups, group member lists, personal profiles, login-walled content, direct messages, comments tied to named individuals, or any personal data of people. The legacy version of this tutorial pointed at Facebook Groups; that is exactly the surface this rewrite avoids, because group content and member data are personal and frequently behind a privacy gate. Never bypass authentication, solve a login challenge programmatically, or use someone's credentials to reach content. Those are bright lines, and this guide stays on the public side of all of them by design.

When any personal data is involved, privacy law applies. Under the GDPR in the EU and the CCPA in California, you need a lawful basis to process personal data, and you must honor deletion and opt-out requests. The safest posture is to avoid personal data entirely and keep to page-level public aggregates, which is what the script above does. For any real or commercial use, the right tool is the official Facebook Graph API. It is built for sanctioned access to Pages you own or manage, gives you guaranteed structure, and keeps you inside Meta's terms. This article is a technical walkthrough scoped narrowly to public Pages. It is not an endorsement of mass personal-data collection, and it does not cover anything behind a login. If your project needs more than a small sample of public fields, the Graph API or a formal data agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Public Pages only. This walkthrough targets public Facebook Pages, never private groups, member data, or personal profiles.
  • Facebook is client-side rendered and bot-defended. A plain request returns an empty shell, so you must render the page before you parse it.
  • PyCharm handles the setup. Create the project and its virtual environment, install crawlbase and beautifulsoup4, and run the script from the IDE.
  • Rendering and a trusted IP belong in one call. The Crawling API with a JS token does both; ajax_wait and page_wait control how long it waits for content.
  • Pace, rotate, and prefer the Graph API. Keep volume low, lean on residential rotation, mind GDPR and CCPA, and use the official Facebook Graph API for anything real or commercial.

Frequently Asked Questions (FAQs)

Why does a plain request return no data from Facebook?

Because Facebook renders its page and post content client-side with JavaScript. The initial HTML is a shell that only fills in after the page's scripts run in a browser, so a raw HTTP request returns a near-empty body. To get real public data you have to render the page first, which is what the Crawling API's JS token handles for you.

Do I need the normal token or the JS token for Facebook?

The JS token. The normal token fetches static HTML, which on Facebook is the same empty shell a plain request returns. The JS token renders the page in a real browser before handing back the HTML, so the public fields are present when BeautifulSoup parses them.

What Facebook data is safe to scrape?

Only public data from public Pages: the public page name, the visible text of public posts, public aggregate counts as numbers, and public post URLs. Private groups, group member lists, personal profiles, login-walled content, direct messages, and the identities of individual commenters are off limits. Those are personal data, and collecting them runs against Meta's terms and, in many places, privacy law.

Why does this tutorial cover Pages instead of Groups?

Because public Pages are the defensible surface. Group content, member lists, and the posts of individual members are personal data and are frequently behind a privacy gate, even for groups labeled public. Brand and business Pages are designed to be seen by anyone without logging in, so reading their public page name, post text, and aggregate counts stays inside the public boundary.

Should I use the official Facebook Graph API or scrape the site?

For any real, ongoing, or commercial use, use the official Facebook Graph API. It is the sanctioned route, gives guaranteed structure, and keeps you inside Meta's terms. Scraping a small sample of public Page fields with the approach here fits lightweight public-data research where no API access is in place, as long as you respect the terms, robots.txt, and rate limits.

How do I avoid getting blocked while scraping Facebook?

Keep your per-IP request rate low, add real delays between requests, vary your targets instead of crawling one page's full history, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you. Watch the status codes and back off the moment you start seeing challenges.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available