Instagram is one of the most data-rich surfaces on the public web, and it is also one of the hardest to read programmatically. Profiles and posts render client-side with JavaScript, the platform challenges automated traffic aggressively, and a plain HTTP request to a profile URL usually returns a near-empty shell. This guide shows you how to scrape Instagram data with Python in a way that actually works, while staying strictly inside what is public.

To be unambiguous up front: everything here is scoped to public data from public accounts only. That means public profile fields, public post captions, public like and comment counts, and public post URLs. It does not cover private accounts, login-walled content, direct messages, follower lists, or any personal data of individual people. Instagram and its parent company Meta restrict automated access in their terms, so read the legality section near the end before you point this at anything real. For any production or commercial use, the official Instagram Graph API is the correct tool, not a scraper.

What you will build

A small Python script that takes a public Instagram profile URL, fetches the fully rendered page through the Crawling API with a JavaScript token, and parses out a handful of public fields:

  • Public username the account handle shown on the profile.
  • Public post captions the visible text on public posts.
  • Public like and comment counts the aggregate numbers a post displays, not the people behind them.
  • Public post URLs the permalink to each public post.

Notice what is deliberately absent: no follower lists, no commenter identities, no private-account content, no contact details. Those are personal data of individuals, and collecting them is out of scope here on purpose.

Why a plain fetch fails on Instagram

Request a public Instagram profile URL with a bare HTTP client and you will get a response that is technically successful and practically useless. The body is a JavaScript shell: the real content only appears after the page's scripts run in a browser and fetch data from internal endpoints. On top of that, Instagram flags automated traffic fast. Datacenter IP ranges, missing browser behavior, and repetitive request patterns get challenged or rate-limited well before the interesting content ever loads.

So a working Instagram scraper needs two things in the same request: a real browser that renders the page, and an IP address the platform reads as an ordinary visitor. You can build that yourself with a headless browser and a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call. You send it a URL with a JavaScript token, it renders the page behind a trusted residential IP, and it returns finished HTML you can parse. If you want the deeper background, see our guide on how to crawl JavaScript websites.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Instagram is client-side rendered, so you need the JS token here. The normal token returns the same shell a plain fetch would, with nothing useful to parse out of it.

Prerequisites

A few things to have in place first. None take long.

Basic Python. You should be comfortable running a script and installing packages with pip. If you are new to parsing HTML, our primer on how to use BeautifulSoup in Python covers the extraction side.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat it like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create an isolated virtual environment, then install the two libraries the scraper needs.

bash
python --version

python -m venv instagram_env
source instagram_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate with instagram_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by selector.

Step 1: Fetch the rendered profile

Start by getting the finished page. Import CrawlingAPI, initialize it with your JS token, and request a public profile URL. Check the status code before parsing so failures stay loud instead of silent.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    page_url = "https://www.instagram.com/nasa/"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

The two wait options matter for a client-rendered target. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear before the page is captured. Five seconds is a reasonable starting point; raise it if fields come back empty. The example uses a public organization account (NASA) precisely because it is public and impersonal. Run the script and you should see real profile markup, which confirms rendering works before you write a single selector.

Crawlbase Crawling API

Instagram needs a rendered page behind a trusted IP, in one call. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public profile on the free tier first.

Step 2: Parse the public fields with BeautifulSoup

With rendered HTML in hand, load it into BeautifulSoup and pull the public fields. Instagram exposes a lot of useful metadata in the page's <meta> tags and in an embedded JSON-LD block, which is more stable than chasing deeply nested, frequently renamed CSS classes. The username and the profile description live in standard meta tags; the post permalinks live in anchor hrefs that follow the /p/<shortcode>/ pattern.

python
import re
from bs4 import BeautifulSoup

def meta(soup, prop):
    el = soup.find("meta", attrs={"property": prop})
    return el["content"] if el and el.has_attr("content") else None

def scrape_profile(html):
    soup = BeautifulSoup(html, "html.parser")

    username = meta(soup, "og:title")
    summary = meta(soup, "og:description")

    post_urls = []
    for a in soup.select("a[href^='/p/']"):
        href = a["href"]
        url = f"https://www.instagram.com{href}"
        if url not in post_urls:
            post_urls.append(url)

    return {
        "username": username,
        "summary": summary,
        "post_urls": post_urls,
    }

The og:description tag on a public profile typically carries the aggregate public counts (followers, following, posts) as a single summary string. Treat those as public aggregates only, never as a doorway to enumerate the individuals behind them. The post URLs are collected from anchors and de-duplicated, since the same permalink can appear more than once in the rendered DOM.

Selectors drift

Instagram changes its markup and class names without notice, which is why this code leans on meta tags and the stable /p/<shortcode>/ URL shape rather than brittle nested classes. When a field comes back as None, re-inspect the live page in your browser's dev tools and update the selector. Periodic maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Extract public fields from a single post

A public post page carries the same kind of meta and JSON-LD data. From it you can pull the public caption and the public like and comment counts. Instagram embeds an application/ld+json script on post pages that often contains an interactionStatistic block with those aggregate numbers, plus the caption text. Parsing the JSON-LD is more durable than scraping rendered widgets.

python
import json
from bs4 import BeautifulSoup

def scrape_post(html):
    soup = BeautifulSoup(html, "html.parser")
    block = soup.find("script", attrs={"type": "application/ld+json"})
    if not block:
        return {}

    data = json.loads(block.string)
    likes = comments = None
    for stat in data.get("interactionStatistic", []):
        kind = stat.get("interactionType", "")
        count = stat.get("userInteractionCount")
        if "LikeAction" in kind:
            likes = count
        elif "CommentAction" in kind:
            comments = count

    return {
        "caption": data.get("caption") or data.get("articleBody"),
        "like_count": likes,
        "comment_count": comments,
    }

This extracts only aggregate, non-personal fields: the caption text, the public like count, and the public comment count. It does not read individual comments, commenter handles, or who liked the post. That restraint is intentional, and it is also what keeps the work defensible. Like and comment counts are numbers; the people behind them are not yours to harvest.

Step 4: Put it together

Now wire fetch and parse into one runnable script that reads a public profile, then visits its first few public posts.

python
import json
import time
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def main():
    profile_url = "https://www.instagram.com/nasa/"
    html = crawl(profile_url)
    if not html:
        return

    profile = scrape_profile(html)
    records = []
    for post_url in profile["post_urls"][:5]:
        post_html = crawl(post_url)
        if post_html:
            record = scrape_post(post_html)
            record["url"] = post_url
            records.append(record)
        time.sleep(3)

    output = {"username": profile["username"], "posts": records}
    print(json.dumps(output, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    main()

The time.sleep(3) between requests is not decoration. Pacing is the single biggest factor in whether a run stays healthy, and we will come back to it. The slice [:5] keeps the demo small; raise it only when your pacing and volume are responsible.

What the output looks like

Run the full script and you get a clean record of public fields, ready to write to JSON, CSV, or a database.

json
{
  "username": "NASA (@nasa)",
  "posts": [
    {
      "caption": "A new view of a distant galaxy cluster.",
      "like_count": 412338,
      "comment_count": 1894,
      "url": "https://www.instagram.com/p/Cxample123/"
    }
  ]
}

Staying unblocked

Even with rendering handled by the Crawling API, Instagram watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard, heavily defended target.

  • Pace your requests. Hammering pages in a tight loop is the fastest way to get throttled. Add real delays, as in the time.sleep above, and resist the urge to parallelize aggressively.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you build your own stack, this is the part to get right. Our guide on how to use rotating proxies goes deeper.
  • Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Back off rather than pushing harder.
  • Keep volume low and targets varied. Public-data research does not require crawling an account's entire history. Sample what you need and stop.

For the broader playbook, see how to bypass captchas while web scraping and our deep dive on how to scrape JavaScript pages with Python. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart Proxy gives you the same residential rotation as a drop-in proxy endpoint.

This is the section to read before you write production code. Instagram is owned by Meta, and Meta's Terms of Use strongly restrict automated access and data collection. Automated scraping can run against those terms regardless of how careful your tooling is, and none of the code above changes that. It only makes the technical part work. Read Meta's and Instagram's Terms of Use and Instagram's robots.txt, and treat both as the boundary for what you collect.

The honest, restrictive rules to hold to. Collect only public data from public accounts: public profile fields, public post captions, public like and comment counts, and public post URLs that anyone can see without logging in. Never scrape private accounts, login-walled content, direct messages, follower or following lists, individual commenter or liker identities, or any personal data of individual people. Never bypass authentication, solve a login challenge programmatically, or use someone's credentials to reach content. Those are bright lines, and this guide stays on the public side of all of them by design.

For any real or commercial use, the right tool is the official Instagram Graph API. It is built for sanctioned access to accounts you own or manage, gives you guaranteed structure, and keeps you inside Meta's terms. This article is a technical walkthrough scoped narrowly to public data from public accounts. It is not an endorsement of mass personal-data collection, and it does not cover anything behind a login. If your project needs more than a small sample of public fields, the Graph API or a formal data agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Instagram is client-side rendered and bot-defended. A plain fetch returns an empty shell, so you must render the page before you parse it.
  • Rendering and a trusted IP belong in one call. The Crawling API with a JS token does both; ajax_wait and page_wait control how long it waits for content.
  • Parse stable signals. Meta tags, the /p/<shortcode>/ URL shape, and JSON-LD are more durable than brittle nested classes.
  • Public aggregates only. Pull username, captions, like and comment counts, and post URLs; never follower lists, commenter identities, or private content.
  • Pace, rotate, and prefer the Graph API. Keep volume low, lean on residential rotation, and use the official Instagram Graph API for anything real or commercial.

Frequently Asked Questions (FAQs)

Why does a plain fetch return no data from Instagram?

Because Instagram renders its profile and post content client-side with JavaScript. The initial HTML is a shell that only fills in after the page's scripts run in a browser, so a raw HTTP request returns a near-empty body. To get real public data you have to render the page first, which is what the Crawling API's JS token handles for you.

Do I need the normal token or the JS token for Instagram?

The JS token. The normal token fetches static HTML, which on Instagram is the same empty shell a plain fetch returns. The JS token renders the page in a real browser before handing back the HTML, so the public fields are present when BeautifulSoup parses them.

What Instagram data is safe to scrape?

Only public data from public accounts: the public username, public post captions, public like and comment counts as aggregate numbers, and public post URLs. Private accounts, login-walled content, direct messages, follower and following lists, and the identities of individual commenters or likers are off limits. Those are personal data, and collecting them runs against Meta's terms and, in many places, privacy law.

Should I use the official Instagram Graph API or scrape the site?

For any real, ongoing, or commercial use, use the official Instagram Graph API. It is the sanctioned route, gives guaranteed structure, and keeps you inside Meta's terms. Scraping a small sample of public fields with the approach here fits lightweight public-data research where no API access is in place, as long as you respect the terms, robots.txt, and rate limits.

How do I avoid getting blocked while scraping Instagram?

Keep your per-IP request rate low, add real delays between requests, vary your targets instead of crawling one account's full history, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you. Watch the status codes and back off the moment you start seeing challenges.

Can I scrape private accounts or follower lists?

No, and this guide deliberately does not show how. Private-account content sits behind authentication, and follower lists and individual user identities are personal data. Scraping login-walled content, enumerating followers, or bypassing authentication to reach any of it is out of scope here and runs against Meta's terms. For sanctioned access to accounts you own or manage, the official Instagram Graph API is the correct route.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available