How to Scrape Data Behind Login Pages

Q: When should I use the Crawling API instead of plain requests?

Use plain requests when the protected page is static HTML, as in the practice target here. Reach for the Crawling API when your authorised target renders its content with JavaScript or challenges automated clients. You keep the same login you built, then pass the session cookies to the API through its cookies parameter so it renders behind a trusted IP and returns finished content.

Plenty of the data you actually want to work with sits behind a login: your own analytics dashboard, an internal reporting tool, a SaaS account whose export button stops at last quarter, a members area you administer. A plain HTTP request to those pages gets you a redirect to the sign-in form, because the server has no idea who you are. To reach the content you have to do what a browser does: log in, hold on to the session, and send that session with every later request.

This guide shows you how to scrape data behind login pages using Python. You will build a small, runnable scraper that inspects a login form, posts credentials through a requests.Session, carries the session cookies (and a CSRF token) into authenticated requests, and then reads protected content. We use the public practice site quotes.toscrape.com/login as a safe, login-shaped target throughout. The legality section near the end is not boilerplate: it sets the one hard rule that makes any of this defensible, so read it before you point this code at a real account.

What you will build

A Python script that authenticates against a login form and then fetches a page that only renders for a logged-in user. Using the practice target as the running example, the script handles each part of a real authentication flow:

Form inspection reading the login form's field names and action URL from its HTML.
CSRF token pulling the hidden token out of the form and replaying it on submit.
Session login posting the credentials through a persistent requests.Session.
Cookie carry-over reusing the session so its cookies ride along on every later request.
Authenticated fetch requesting a protected page and confirming you are logged in.

Send a bare requests.get() to a page that needs a login and you get one of two non-answers: a redirect to the sign-in form, or the login HTML itself with a 200 status. Either way the protected content is not there. The server gates the page on a session it does not see, because your script never authenticated and is not sending the cookie that proves it did.

Authentication is the first wall. The second is everything sites do to keep automated traffic out even after you hold a valid session: hidden CSRF tokens that change per request, rate limits, IP reputation checks, and pages whose content is rendered by JavaScript after load rather than shipped in the initial HTML. A static client cannot run that JavaScript, so even a logged-in fetch can come back looking empty. When your target combines a login wall with client-side rendering or bot blocking, the heavy lifting belongs to a service built for it, and that is where the Crawling API comes in later.

Scope

This walkthrough uses a public practice login on purpose. The mechanics are identical for a real account, but the legality only holds when the account and the data are yours, or you have written authorisation. Treat the practice target as a stand-in for your own dashboard, never for someone else's.

Prerequisites

A few things in place before any code. None take long.

Basic Python. You should be comfortable writing and running a script and installing packages with pip. If parsing HTML is new to you, our guide to using BeautifulSoup in Python covers what this tutorial assumes.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

Credentials you are allowed to use. For the practice site, any username and password work. For real work, use only an account you own or are explicitly authorised to access. Never reuse stolen, shared, or guessed credentials.

A Crawlbase account and JS token (for the last step). When your real target renders content with JavaScript or blocks plain clients, you will route the authenticated request through the Crawling API. Sign up, open your dashboard, and copy your JavaScript (JS) token. Treat it like a password and keep it out of version control.

Set up the project

Create a virtual environment so dependencies stay isolated, then install the two libraries the scraper needs.

bash

python --version

python -m venv login_env
source login_env/bin/activate

pip install requests beautifulsoup4

On Windows, activate the environment with login_env\Scripts\activate instead of the source line. Two dependencies do the work: requests drives the HTTP session, and beautifulsoup4 parses the login form so you can read its field names and pull the CSRF token.

Before you can post credentials you need to know exactly what the form expects: the URL it submits to, the names of its input fields, and any hidden values it carries. Open the login page in your browser, right-click the form, and choose Inspect. On the practice target the form posts to /login and contains a username field, a password field, and a hidden csrf_token field. Real sites vary, so always confirm these names against the live HTML rather than assuming.

You can read the same structure programmatically. Fetch the login page, load it into BeautifulSoup, and print the form's fields so you know what to send.

python

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://quotes.toscrape.com/login"

page = requests.get(LOGIN_URL)
soup = BeautifulSoup(page.text, "html.parser")

for field in soup.select("form input"):
    print(field.get("name"), "->", field.get("type"))

Run this and you will see the three field names printed, including the hidden csrf_token. That hidden value is the piece most first-time login scrapers miss: the server issues it on the login page and rejects any POST that does not echo it back, which is exactly what a Cross-Site Request Forgery defence is meant to do.

Step 2: Log in with a session and the CSRF token

Now post the credentials. The key is to use a requests.Session object rather than a one-off requests.post. A session persists cookies across requests, so once the server sets a session cookie on a successful login, every later request through that same session sends the cookie automatically and the server keeps treating you as logged in.

The flow is: GET the login page to receive a fresh CSRF token (and the initial cookies), scrape the token out of the hidden input, then POST the username, password, and that same token back to the form's action URL through the session.

python

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://quotes.toscrape.com/login"
USERNAME = "your-username"
PASSWORD = "your-password"

session = requests.Session()

# GET the form first to receive a fresh CSRF token and cookies.
login_page = session.get(LOGIN_URL)
soup = BeautifulSoup(login_page.text, "html.parser")
token = soup.find("input", {"name": "csrf_token"})["value"]

payload = {
    "csrf_token": token,
    "username": USERNAME,
    "password": PASSWORD,
}

response = session.post(LOGIN_URL, data=payload)
response.raise_for_status()

# The site shows a "Logout" link only when authenticated.
if "Logout" in response.text:
    print("Login succeeded; session cookies:", session.cookies.get_dict())
else:
    print("Login failed; still on the sign-in page.")

Run the script and, on a successful login, you will see Login succeeded followed by the session cookie the server set. That cookie is your proof of identity for everything that follows. Checking for the Logout link is a simple, reliable success test: that text only appears for an authenticated user, so its presence confirms the session took rather than relying on the status code alone.

Crawlbase Crawling API

The login above works because the practice target is plain HTML. The moment your real dashboard renders its data with JavaScript or challenges automated clients, a requests.Session alone falls short. The Crawling API renders the page in a real browser and rotates requests through trusted residential IPs server-side, and it accepts your session cookies, so you can hand it an authenticated request and get back finished content without running a headless browser fleet and a proxy pool yourself.

Start free

Step 3: Fetch a protected page and parse it

With the session authenticated, every request through that same session object carries the login cookie automatically. So fetching a protected page is just another session.get(), no extra headers needed. Here we reuse the session from Step 2 to request a page and parse content from it, exactly as you would parse your own exported data.

python

PROTECTED_URL = "https://quotes.toscrape.com/"

# The same session sends the login cookie automatically.
page = session.get(PROTECTED_URL)
page.raise_for_status()

soup = BeautifulSoup(page.text, "html.parser")
records = []

for card in soup.select(".quote"):
    records.append({
        "text": card.select_one(".text").text.strip(),
        "author": card.select_one(".author").text.strip(),
    })

print(len(records), "records read while authenticated")

Because the session holds the cookie, the server returns the logged-in version of the page instead of bouncing you to the form. If you swap in your own authorised dashboard URL and its real selectors, this is the whole pattern: log in once, then read as many protected pages as you need through the same session.

Step 4: Carry the session into the Crawling API

The plain-session approach stops working when the protected page is rendered by JavaScript, or when the site challenges automated clients before your cookie is even checked. In that case, you keep the same login you built above and hand the authenticated request to the Crawling API, passing the cookies the server gave you. The API renders the page behind a trusted IP and returns finished content.

python

import requests

JS_TOKEN = "YOUR_CRAWLBASE_JS_TOKEN"
TARGET_URL = "https://quotes.toscrape.com/"

# Reuse the cookies from the logged-in session in Step 2.
cookie_pairs = [f"{k}={v}" for k, v in session.cookies.get_dict().items()]
cookie_header = "; ".join(cookie_pairs)

params = {
    "token": JS_TOKEN,
    "url": TARGET_URL,
    "cookies": cookie_header,
    "country": "US",
}

api = requests.get("https://api.crawlbase.com/", params=params)
api.raise_for_status()
print(api.text[:500])

The cookies parameter takes the same key1=value1; key2=value2 format a browser sends, which is why we join the session's cookie dict into one header string. Crawlbase forwards those cookies with the request it renders, so the site treats the call as logged in, then returns the rendered HTML for you to parse with the same BeautifulSoup code from Step 3. If you make several authenticated calls in a row and want the session to persist across them, see the FAQ below on the cookies-session parameter.

What the output looks like

The plain-session run in Step 3 produces structured records you can serialise to JSON. With the practice target the shape is small and predictable:

json

[
  {
    "text": "The world as we have created it is a process of our thinking.",
    "author": "Albert Einstein"
  },
  {
    "text": "It is our choices that show what we truly are.",
    "author": "J.K. Rowling"
  }
]

Swap in your authorised dashboard and the fields change, but the principle does not: you logged in, the session carried your identity, and you parsed content that an anonymous request could never reach.

Handling "remember me" and expired sessions

Two practical wrinkles come up once you move past a single run. The first is the "remember me" checkbox. When a form offers it, it is just another form field, often a checkbox named something like remember. Inspect the form, and if the box maps to a value, add it to your payload (for example "remember": "on"). Sites that honour it return a longer-lived cookie, so your session survives across script runs instead of expiring when you stop. Only set it when the form actually has it; inventing fields the server does not expect can cause the login to fail.

The second wrinkle is expiry. Login cookies are not permanent. They lapse on a timer, on logout elsewhere, or when the site rotates sessions. The tell is your scraper suddenly pulling the sign-in page instead of the content. Handle it by detecting the failure (the Logout link is gone, or you were redirected to /login) and re-running the login flow from Step 2 to mint a fresh session before retrying. Building that check in from the start saves you from silently scraping login pages for an hour.

Keep the session warm

If you make many authenticated requests through the Crawling API and want the same login to persist across them, assign the cookies_session parameter any value up to 32 characters. The API links the session cookies from one request to the next so you do not re-send the full cookie string each time.

Staying unblocked

Even with a valid session, sites watch for traffic that does not look human. A few habits keep an authorised run healthy.

Pace your requests. Hammering protected pages in a tight loop is the fastest way to get a session flagged. Space requests out and add a short sleep between them.
Send the same CSRF token the form gave you. Reusing a stale token, or skipping it, is a common reason a login POST is rejected. Always GET the form first and replay its current token.
Watch the status codes. A run that starts returning redirects or challenges is telling you the session lapsed or the IP tier is no longer enough. Back off and re-authenticate rather than retrying blindly.
Lean on rotation for hard targets. When a single IP keeps tripping checks, the Crawling API rotates through residential addresses for you; if you build your own stack, the Smart AI Proxy gives you the same rotation as a drop-in endpoint.

For the broader playbook, see how to scrape websites without getting blocked and, when the protected page is client-rendered, scraping JavaScript pages with Python.

This is the question that decides whether anything above is appropriate to run, so be honest about it before you write a line of production code. The short answer: only access accounts and data that you own or are explicitly authorised to access. The moment you log in to a site, you accept its terms of service, and those terms almost always restrict automated access. So logging in does not grant you the right to scrape; if anything it adds a contract you are now bound by. If the data is not yours, get written permission before you automate against it.

What is firmly out of bounds is the part this guide does not teach. Never use stolen, shared, or brute-forced credentials, and never log in to an account that is not yours. Never collect other users' personal data, private messages, profiles, or anything a real person would consider theirs. Bypassing authentication, scraping a login wall you were not invited through, or harvesting personal information is not a grey area; it can breach computer-misuse and data-protection laws regardless of how clean your code is. The techniques here exist for one purpose: reaching your own authorised data, such as exporting figures from a dashboard you administer, when the site offers no easier route.

That easier route is usually the right first stop. Before you script a login, check whether the service has an official API, a data export or download feature, or an OAuth integration. Those are the sanctioned paths the provider built for exactly this, and they keep you on the right side of the terms you agreed to. Reach for session scraping only when no official mechanism exists and the data is genuinely yours, then keep the scope to that data and nothing else. If a project needs information that belongs to other people or other organisations, a formal data agreement is the correct path, not a cleverer login script.

Recap

Key takeaways

Authorisation comes first. Only scrape behind a login for accounts and data you own or are explicitly permitted to access, and prefer an official API or export when one exists.
Inspect the form before you post. Read the field names, the action URL, and any hidden CSRF token from the login HTML rather than guessing.
Use a session, not one-off requests. A requests.Session persists cookies, so a single login keeps every later request authenticated.
Replay the CSRF token. GET the form to receive a fresh token, then send it back on the POST, or the server rejects the login.
Hand JS rendering and blocks to the Crawling API. When a session alone falls short, pass your cookies to the Crawling API so it renders behind a trusted IP and returns finished content.

Frequently Asked Questions (FAQs)

Because the server gates the page on a session your script never established. A bare requests.get() sends no login cookie, so the server treats you as anonymous and returns a redirect to the sign-in form or the form's HTML with a 200 status. To reach the content you have to authenticate first and then send the session cookie with each request, which a requests.Session does automatically.

Send a GET request to the login URL first, parse the returned HTML, and read the hidden CSRF input (often named csrf_token) out of the form. Include that exact value in the payload you POST back to the login URL. Some sites rotate the token per request or use more than one, so always GET the form fresh and inspect it carefully rather than hard-coding a token.

What does "remember me" change in the request?

It is an extra form field, usually a checkbox. When you include it in your POST payload (for example "remember": "on"), sites that honour it issue a longer-lived cookie, so the session survives across script runs instead of expiring when you stop. Only add the field if the form actually has it; sending fields the server does not expect can break the login.

Your session cookie almost certainly expired or was invalidated, by a timer, a logout elsewhere, or the site rotating sessions. Detect it (the Logout link is gone, or you were redirected to /login) and re-run the login flow to mint a fresh session before retrying. Building that check in from the start keeps you from silently scraping sign-in pages.

Can I scrape another person's account this way?

No. This guide is scoped to data you own or are explicitly authorised to access. Using stolen, shared, or guessed credentials, logging in to an account that is not yours, or collecting other users' personal data is out of bounds and can breach computer-misuse and data-protection laws. If you need data that belongs to someone else, get written permission or use an official data agreement.

When should I use the Crawling API instead of plain requests?

Use plain requests when the protected page is static HTML, as in the practice target here. Reach for the Crawling API when your authorised target renders its content with JavaScript or challenges automated clients. You keep the same login you built, then pass the session cookies to the API through its cookies parameter so it renders behind a trusted IP and returns finished content.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why a plain request fails behind a login

Prerequisites

Set up the project

Step 1: Inspect the login form

Step 2: Log in with a session and the CSRF token

Step 3: Fetch a protected page and parse it

Step 4: Carry the session into the Crawling API

What the output looks like

Handling "remember me" and expired sessions

Staying unblocked

Is it legal to scrape data behind a login?

Key takeaways

Frequently Asked Questions (FAQs)

Why does a plain request return the login page instead of my data?

How do I handle a CSRF token in a login form?

What does "remember me" change in the request?

My scraper started returning login pages mid-run. What happened?

Can I scrape another person's account this way?

When should I use the Crawling API instead of plain requests?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Inside Modern Anti-Bot Evasion: A Systems View

How to Scrape Local Business Listings with Python: names, addresses, ratings, and more

Build a Website Change Tracker with Python: snapshots and SHA-256 diffs

The infrastructure brief, in your inbox.

We use cookies

Customize cookies