Indeed is one of the largest job boards on the web, aggregating millions of public postings across industries, employers, and regions. Each listing carries structured signal that powers labor-market research, recruiter competitive analysis, and custom job-search tools: a job title, the hiring company, a location, a salary range when it is public, and a link to the full posting. The catch is that Indeed builds its search results with JavaScript and tucks the listing data deep inside the page, so a plain HTTP request hands you a near-empty shell instead of the jobs you came for.

This guide shows you how to scrape Indeed job posts with Python the reliable way. You build a small, runnable scraper that fetches a rendered search page through the Crawling API, pulls the embedded job data out of the page, parses each field, handles pagination, and exports the results to JSON and CSV. The whole walkthrough stays scoped to public job-listing data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.

What you will build

A Python script that takes a public Indeed search URL, retrieves the rendered HTML through the Crawling API, and extracts a structured record for every posting on the page. We will use a developer-jobs search as the running example and pull these fields per posting:

  • Job title the role being advertised, read from each card's title field.
  • Company the employer behind the listing.
  • Location where the job is based, from formattedLocation.
  • Salary range the minimum and maximum pay when Indeed exposes it, from extractedSalary.
  • Job key and link the jobkey identifier and posting URL, so you can follow up per role.

Why a plain request fails on Indeed

If you request an Indeed search URL with a bare HTTP client, you get a response with status 200 and almost none of the listing data in the body. Two things work against you. First, Indeed renders its results with JavaScript, so the initial HTML is a shell that only fills in after the page's scripts run. Second, Indeed actively defends against automated access with CAPTCHA challenges and per-IP rate limits, so even a rendered request from a flagged address gets a challenge page rather than jobs.

So a working Indeed scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser and a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL with a JavaScript token, it renders the page behind a trusted IP, and it returns finished HTML for you to parse.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Indeed is client-side rendered, so you need the JS token here. Using the normal token returns the same empty shell a plain fetch would, and there is nothing to parse out of it.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. The parsing here leans on regular expressions and JSON rather than a DOM library, but if you also want a selector-based primer, our guide on how to use BeautifulSoup in Python covers the basics.

Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control. The free tier includes 1,000 requests, enough to follow this guide end to end, and you are only charged for successful requests.

Set up the project

Create a virtual environment so project dependencies stay isolated, then install the one library the scraper needs.

bash
python --version

python -m venv indeed_env
source indeed_env/bin/activate

pip install crawlbase

On Windows, activate the environment with indeed_env\Scripts\activate instead of the source line. The crawlbase package is the official client for the Crawling API. Because Indeed embeds its listing data as JSON inside the page, the standard-library re and json modules do the parsing, so there is no HTML library to install for the search pages.

Step 1: Fetch the rendered search page

Start by getting the finished page. When you run a search on Indeed's homepage, it redirects you to a URL like https://www.indeed.com/jobs?q=Web+Developer&l=Virginia, where q is the query and l is the location. Import the CrawlingAPI class, initialize it with your JS token, and request that URL. Pass country so the request comes from a US IP, since Indeed serves localized results. Checking the status code before you parse keeps failures loud instead of silent.

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    response = api.get(page_url, {"country": "US"})
    if response["status_code"] == 200:
        return response["body"].decode("latin1")
    print(f"Request failed: {response['status_code']}")
    return None

if __name__ == "__main__":
    page_url = "https://www.indeed.com/jobs?q=Web+Developer&l=Virginia"
    html = crawl(page_url)
    print(html[:500] if html else "No HTML returned")

Run the script with python scraper.py and you should see real page markup, not the empty shell a plain fetch returns. We decode with latin1 rather than utf-8 because Indeed's pages carry mixed byte sequences, and latin1 maps every byte cleanly so the embedded JSON survives the decode intact. That confirms rendering works before you write a single parser.

Crawlbase Crawling API

Indeed needs a JavaScript-rendered page behind a trusted IP, in one call, or its CAPTCHA and rate limits stop you cold. The Crawling API takes a JS token, runs the page in a real browser, and rotates through residential IPs server-side, so you skip running a headless fleet and a proxy pool yourself. Point it at a public search page on the free tier first.

Step 2: Locate the embedded job data

You could parse the rendered job cards with CSS or XPath selectors, but Indeed gives you an easier and far more stable hook. Every search page embeds the full set of listings as a JSON document inside a script tag, assigned to a JavaScript variable named window.mosaic.providerData["mosaic-provider-jobcards"]. Reading that single blob gives you every field the cards display, already structured, with none of the markup churn that breaks class-name selectors.

A short regular expression pulls that JSON out of the HTML. Once parsed, the listings live under a predictable path, and a sibling block holds the result counts you will need for pagination:

python
import re
import json

def parse_search_page_html(html):
    data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html)
    data = json.loads(data[0])
    model = data["metaData"]["mosaicProviderJobCardsModel"]
    return {
        "results": model["results"],
        "meta": model["tierSummaries"],
    }

The function locates the mosaic-provider-jobcards variable, parses the JSON, and returns two pieces: results, the list of job listings, and meta, the tier summaries that report how many jobs match across categories. Each entry in results is a rich object, but you only need a handful of its fields for a clean record.

Step 3: Extract the fields you want

A raw listing object carries dozens of internal keys. Map each card down to the fields that matter, guarding the optional ones so a missing salary never crashes the run. Salary lives under extractedSalary as a min, max, and type when Indeed has it, and the posting link is built from the jobkey.

python
def format_salary(card):
    salary = card.get("extractedSalary")
    if not salary:
        return ""
    low, high = salary.get("min"), salary.get("max")
    unit = salary.get("type", "")
    if low == high:
        return f"{low} {unit}".strip()
    return f"{low}-{high} {unit}".strip()

def extract_jobs(results):
    jobs = []
    for card in results:
        job_key = card.get("jobkey", "")
        jobs.append({
            "title": card.get("title", ""),
            "company": card.get("company", ""),
            "location": card.get("formattedLocation", ""),
            "salary": format_salary(card),
            "posted": card.get("formattedRelativeTime", ""),
            "job_key": job_key,
            "link": f"https://www.indeed.com/viewjob?jk={job_key}",
        })
    return jobs

Every field read goes through .get() with a default, so a listing that omits a salary or a company name yields an empty string instead of a KeyError. The jobkey is worth keeping on its own: it uniquely identifies a posting and lets you build a direct viewjob link, which is handy if you later want to fetch the full job description page for any single role.

Selectors drift

The mosaic-provider-jobcards variable name and the field keys above reflect Indeed's current page structure and can change in a redesign. Treat them as a starting template, not a contract. If the regex stops matching or a field comes back empty across every card, re-inspect a live search page in your browser's dev tools, find the embedded JSON again, and update the pattern. Periodic maintenance is normal for any production scraper.

Step 4: Handle pagination

One search page shows only the first batch of jobs. Indeed paginates with a start query parameter that offsets results in steps of 10, so you walk further into the results by incrementing it. The tier summaries from Step 2 tell you how many jobs match in total, which lets you cap the crawl at a sensible max_results instead of fetching every page. Build each page URL with urlencode, fetch it, and accumulate the extracted records.

python
import time
from urllib.parse import urlencode

def make_search_url(query, location, offset):
    params = {"q": query, "l": location, "filter": 0, "start": offset}
    return f"https://www.indeed.com/jobs?{urlencode(params)}"

def scrape_indeed_search(query, location, max_results=50):
    print(f"Scraping first page: query={query}, location={location}")
    html = crawl(make_search_url(query, location, 0))
    if not html:
        return []

    first = parse_search_page_html(html)
    jobs = extract_jobs(first["results"])
    total = sum(c["jobCount"] for c in first["meta"])
    total = min(total, max_results)

    for offset in range(10, total, 10):
        print(f"Scraping page at offset {offset}")
        page_html = crawl(make_search_url(query, location, offset))
        if page_html:
            page = parse_search_page_html(page_html)
            jobs.extend(extract_jobs(page["results"]))
        time.sleep(2)
    return jobs

The first request does double duty: it gives you the opening page of jobs and the tier summaries that reveal the total match count. From there the loop generates page URLs at offsets of 10 up to your cap and fetches each one, reusing the same parse-and-extract pair. The time.sleep(2) between pages is deliberate. Firing requests back to back is the fastest way to get throttled, even with rendering and rotation handled for you.

Step 5: Put it together and export JSON and CSV

Now wire the fetch, the parser, the field extractor, and the pagination loop into one runnable script, then write the records to both JSON and CSV. JSON keeps the structure intact for downstream code; CSV opens straight in a spreadsheet.

python
import re
import json
import csv
import time
from urllib.parse import urlencode
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"})

def crawl(page_url):
    response = api.get(page_url, {"country": "US"})
    if response["status_code"] == 200:
        return response["body"].decode("latin1")
    print(f"Request failed: {response['status_code']}")
    return None

def parse_search_page_html(html):
    data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html)
    data = json.loads(data[0])
    model = data["metaData"]["mosaicProviderJobCardsModel"]
    return {"results": model["results"], "meta": model["tierSummaries"]}

def format_salary(card):
    salary = card.get("extractedSalary")
    if not salary:
        return ""
    low, high = salary.get("min"), salary.get("max")
    unit = salary.get("type", "")
    if low == high:
        return f"{low} {unit}".strip()
    return f"{low}-{high} {unit}".strip()

def extract_jobs(results):
    jobs = []
    for card in results:
        job_key = card.get("jobkey", "")
        jobs.append({
            "title": card.get("title", ""),
            "company": card.get("company", ""),
            "location": card.get("formattedLocation", ""),
            "salary": format_salary(card),
            "posted": card.get("formattedRelativeTime", ""),
            "job_key": job_key,
            "link": f"https://www.indeed.com/viewjob?jk={job_key}",
        })
    return jobs

def make_search_url(query, location, offset):
    params = {"q": query, "l": location, "filter": 0, "start": offset}
    return f"https://www.indeed.com/jobs?{urlencode(params)}"

def scrape_indeed_search(query, location, max_results=50):
    html = crawl(make_search_url(query, location, 0))
    if not html:
        return []
    first = parse_search_page_html(html)
    jobs = extract_jobs(first["results"])
    total = min(sum(c["jobCount"] for c in first["meta"]), max_results)
    for offset in range(10, total, 10):
        page_html = crawl(make_search_url(query, location, offset))
        if page_html:
            jobs.extend(extract_jobs(parse_search_page_html(page_html)["results"]))
        time.sleep(2)
    return jobs

def save_results(jobs):
    with open("indeed_jobs.json", "w") as f:
        json.dump(jobs, f, indent=2)
    if jobs:
        with open("indeed_jobs.csv", "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=jobs[0].keys())
            writer.writeheader()
            writer.writerows(jobs)

def main():
    jobs = scrape_indeed_search("Web Developer", "Virginia")
    save_results(jobs)
    print(f"Saved {len(jobs)} jobs to indeed_jobs.json and indeed_jobs.csv")

if __name__ == "__main__":
    main()

What the output looks like

Run the full script with python scraper.py and you get a list of clean structured records, one per posting, written to both indeed_jobs.json and indeed_jobs.csv.

json
[
  {
    "title": "Front Desk Agent",
    "company": "The Inn at Little Washington",
    "location": "Washington, VA 22747",
    "salary": "22 hourly",
    "posted": "20 days ago",
    "job_key": "72ed373141879fd4",
    "link": "https://www.indeed.com/viewjob?jk=72ed373141879fd4"
  },
  {
    "title": "Web Developer",
    "company": "Acme Digital",
    "location": "Richmond, VA",
    "salary": "75000-95000 yearly",
    "posted": "3 days ago",
    "job_key": "6a45faa5d8d817fa",
    "link": "https://www.indeed.com/viewjob?jk=6a45faa5d8d817fa"
  }
]

Listings without a public salary come back with an empty salary string, which is expected since not every employer posts pay. The same records sit in the CSV with one row per job, ready to open in a spreadsheet or load into a database for analysis.

Staying unblocked

Even with rendering and rotation handled, Indeed watches closely for scraper-shaped traffic and is one of the more aggressive job boards about it. A few habits keep a run healthy, and they apply to any hard commercial target.

  • Pace your requests. Keep the time.sleep between pages and vary your queries instead of crawling one search path at full speed.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning CAPTCHA challenges or non-200 codes is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.

For the broader playbook, see how to scrape websites without getting blocked and our deeper dive on how to crawl JavaScript websites, which covers the rendering problem that trips up most job-board scrapers. If you are building a wider job dataset, the same JSON-extraction pattern works on other boards: see how to scrape Monster jobs with Python and how to scrape Glassdoor.

Whether scraping Indeed is allowed depends on Indeed's terms of service, your jurisdiction, and what you do with the data. Indeed's terms restrict automated access, and the site actively deploys CAPTCHAs and rate limits to discourage it, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the Indeed Terms of Service and its robots.txt, and treat both as the boundary for what you collect.

A few lines worth holding to. Collect only public job-listing data: the job title, company, location, public salary range, and posting link that anyone can see on a public search page without signing in. Respect Indeed's stated rate expectations and keep your request volume low enough that you are not straining its servers. When personal data is involved, privacy laws such as GDPR and CCPA apply, so scope your work strictly to the public posting itself.

This guide is deliberately limited to public job listings because that is the line that keeps the work defensible. It does not cover applicant data, resumes, candidate profiles, recruiter contact details, or anything behind a login or paid tier, and it does not attempt to bypass authentication. Job seekers' and recruiters' personal information is exactly the kind of data to leave alone. Indeed also runs an official Publisher API and partner program for licensed access to its job data, and if your project needs more than public postings at scale, that program is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Indeed is JavaScript-rendered and defended. A plain fetch returns an empty shell and a flagged IP gets a CAPTCHA, so you must render behind a trusted IP before you parse.
  • The data is embedded, not in the markup. Every search page hides its listings as JSON in the mosaic-provider-jobcards variable; a short regex plus json.loads is more stable than CSS selectors.
  • Map only the fields you need. Pull title, company, formattedLocation, extractedSalary, and the jobkey link, guarding optional fields so a missing salary never crashes the run.
  • Paginate with start and export both formats. Walk the start parameter in steps of 10 up to a cap from the tier summaries, then write JSON and CSV.
  • Stay on public listings. Respect Indeed's ToS and robots.txt, prefer the official Publisher API for scale, and never touch applicant or recruiter personal data.

Frequently Asked Questions (FAQs)

Is it possible to scrape Indeed?

Technically yes, but Indeed defends against it with CAPTCHA challenges and per-IP rate limits, and its terms restrict automated access. To fetch a public search page you need the page rendered in a real browser behind a trusted IP, which is what the Crawling API's JS token handles before you ever parse. For licensed access at scale, Indeed's official Publisher API and partner program are the sanctioned route.

Do I need the normal token or the JS token for Indeed?

The JS token. Indeed renders its results with JavaScript, so the normal token returns the same empty shell a plain fetch would. The JS token renders the page in a real browser first, which is what makes the embedded mosaic-provider-jobcards JSON present in the HTML you parse.

Why parse embedded JSON instead of CSS selectors?

Indeed ships the full listing data as a JSON blob inside the page, assigned to window.mosaic.providerData["mosaic-provider-jobcards"]. Reading that one object gives you every field already structured, and it survives the hashed, build-generated class names that break selector-based scrapers on every deploy. A single regex extracts it, then json.loads turns it into a Python dictionary.

How do I handle Indeed's pagination?

Indeed offsets results with a start query parameter in steps of 10. Read the total match count from the tier summaries on the first page, cap it at a sensible max_results, then generate page URLs at offsets of 10, 20, 30, and so on, fetching and parsing each. Pace the loop with a short sleep so you are not throttled.

Can I scrape applicant data or full resumes from Indeed?

No, and this guide does not cover it. Applicant data, resumes, and candidate or recruiter profiles are personal information or sit behind a login, not public job-listing data. Scraping login-walled content or bypassing authentication runs against Indeed's terms and triggers privacy laws like GDPR and CCPA. Keep your scope to the public postings on search pages.

How do I avoid getting blocked while scraping Indeed?

Keep your per-IP request rate low, vary your queries instead of looping one search path, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off the moment you start seeing CAPTCHA challenges.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available