These days, job searching has largely moved to online platforms, making it easier than ever to find job opportunities. However, with this convenience comes the challenge of sifting through a vast amount of information to find the right job listings. This is where web scraping, a powerful technique in the world of data extraction, comes into play.
Web scraping enables you to transform the way you approach job hunting by automating the collection and organization of job postings. Rather than spending hours manually searching through different job boards and websites, you can create custom web scraping scripts to gather, filter, and present job listings tailored to your preferences. This not only saves you valuable time but also ensures you donât miss out on hidden job opportunities that might be buried deep within the web.
In this comprehensive guide, we will explore how to harness the potential of Indeed Scraper using Crawlbase Crawling API to streamline your job search process on one of the most prominent job listing websites. Whether youâre a job seeker looking for that perfect career opportunity or a data enthusiast interested in mastering web scraping techniques, this step-by-step Python guide will equip you with the skills to automate your job search and make it more effective and efficient. Join us as we dive into the world of web scraping and uncover the countless opportunities it offers in simplifying your job searching journey on Indeed.
What is a Job Scraper?
A job scraper is a piece of software or code that gathers job postings from different online sources, like job boards, company sites, or career hubs. These tools pull out important details such as job titles, descriptions, requirements, and how to apply. People often use the data they collect to study job trends, research the job market, or fill up job search websites.
Web scraping plays a crucial role in simplifying and optimizing the job searching process. Hereâs how:

- Aggregating Job Listings: Web scraping allows you to aggregate job listings from various sources and websites into a single dataset. This means you can access a wide range of job opportunities all in one place, saving you the effort of visiting multiple websites. 
- Automating Data Retrieval: Instead of manually copying and pasting job details, web scraping automates the data retrieval process. With the right scraping script, you can extract job titles, company names, job descriptions, locations, and more without repetitive manual tasks. 
- Customized Searches: Web scraping empowers you to customize your job search. You can set up specific search criteria and filters to extract job listings that match your preferences. This level of customization helps you focus on the most relevant opportunities. 
- Real-Time Updates: By scheduling web scraping scripts to run at regular intervals, you can receive real-time updates on new job listings. This ensures that youâre among the first to know about job openings in your desired field. 
In the following sections, weâll explore how to leverage web scraping, specifically using the Crawlbase Crawling API, to efficiently scrape job posts from Indeed. This step-by-step guide will equip you with the skills to automate your job search and make it more effective and efficient.
Getting Started with Crawlbase Crawling API
In your journey to harness the power of web scraping for job hunting on Indeed, understanding the Crawlbase Crawling API is paramount. This section will dive into the technical aspects of Crawlbaseâs API and equip you with the knowledge needed to seamlessly integrate it into your Python job scraping project.
Sending Request With Crawling API
Crawlbaseâs Crawling API is designed for simplicity and ease of integration into your web scraping projects. All API URLs begin with the base part: https://api.crawlbase.com. Making your first API call is as straightforward as executing a command in your terminal:
| 1 | curl 'https://api.crawlbase.com/?token=YOUR_CRAWLBASE_TOKEN&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories' | 
Here, youâll notice the token parameter, which serves as your authentication key for accessing Crawlbaseâs web scraping capabilities. Crawlbase offers two types of token, a normal (TCP) token and JavaScript (JS) token. Choose the normal token for websites that donât change much like static websites. But if you want to get information from a site that only works when people use web browsers with JavaScript or if the important stuff you want is made by JavaScript on the userâs side, then you should use the JavaScript token. Like with Indeed, you need the JavaScript token to get what you want.
API Response Time and Format
When interacting with the Crawlbase Crawling API, itâs crucial to understand the response times and how to interpret success or failure. Hereâs a closer look at these aspects:
Response Times: Typically, the API response time falls within a range of 4 to 10 seconds. To ensure a seamless experience and accommodate any potential delays, itâs advisable to set a timeout for calls to at least 90 seconds. This ensures that your application can handle variations in response times without interruptions.
Response Formats: When making a request to Crawlbase, you have the flexibility to choose between HTML and JSON response formats based on your preferences and parsing requirements. You can pass âformatâ query parameter with value âhtmlâ or âjsonâ to select the required format.
If you select the HTML response format (which is the default), youâll receive the HTML content of the web page as the response. The response parameters will be added to the response headers for easy access. Hereâs an example response:
| 1 | Headers: | 
If you opt for the JSON response format, youâll receive a structured JSON object that can be easily parsed in your application. This object contains all the information you need, including response parameters. Hereâs an example response:
| 1 | { | 
Response Headers: Both HTML and JSON responses include essential headers that provide valuable information about the request and its outcome:
- url: The original URL that was sent in the request or the URL of any redirects that Crawlbase followed.
- original_status: The status response received by Crawlbase when crawling the URL sent in the request. It can be any valid HTTP status code.
- pc_status: The Crawlbase (pc) status code, which can be any status code and is the code that ends up being valid. For instance, if a website returns an- original_statusof 200 with a CAPTCHA challenge, the- pc_statusmay be 503.
- body(JSON only): This parameter is available in JSON format and contains the content of the web page that Crawlbase found as a result of proxy crawling the URL sent in the request.
These response parameters empower you to assess the outcome of your requests and determine whether your web scraping operation was successful.
Crawling API Parameters
Crawlbase offers a comprehensive set of parameters that allow developers to customize their web crawling requests. These parameters enable fine-tuning of the crawling process to meet specific requirements. For instance, you can specify response formats like JSON or HTML using the âformatâ parameter, or control page waiting times with âpage_waitâ when working with JavaScript-generated content.
Additionally, you can extract cookies and headers, set custom user agents, capture screenshots, and even choose geolocation preferences using parameters such as âget_cookies,â âuser_agent,â âscreenshot,â and âcountry.â These options provide flexibility and control over the web crawling process. For example, to retrieve cookies set by the original website, you can simply include â&get_cookies=trueâ in your API request, and Crawlbase will return the cookies in the response headers.
You can read more about Crawlbase Crawling API parameters here.
Free Trial, Charging Strategy, and Rate Limit
Crawlbase provides a free trial that includes the first 1,000 requests, allowing you to explore its capabilities before committing. However, itâs essential to maximize this trial period to make the most of it.
Crawlbase operates on a âpay for what you useâ model. Importantly, Crawlbase only charges for successful requests, making it cost-effective and efficient for your web scraping needs. Successful requests are determined by checking the original_status and pc_status in the response parameters.
The API is rate-limited to a maximum of 20 requests per second, per token. If you require a higher rate limit, you can contact support to discuss your specific needs.
Crawlbase Python library
The Crawlbase Python library offers a simple way to interact with the Crawlbase Crawling API. You can use this lightweight and dependency-free Python class as a wrapper for the Crawlbase API. To begin, initialize the Crawling API class with your Crawlbase token. Then, you can make GET requests by providing the URL you want to scrape and any desired options, such as custom user agents or response formats. For example, you can scrape a web page and access its content like this:
| 1 | from crawlbase import CrawlingAPI | 
This library simplifies the process of fetching web data and is particularly useful for scenarios where dynamic content, IP rotation, and other advanced features of the Crawlbase API are required.
Scrape Indeed Data Like Job Listings
To effectively scrape job postings from Indeed, itâs essential to understand its website structure and how job listings are organized.

Homepage: When you first land on Indeedâs homepage, youâll encounter a straightforward search bar where you can input keywords, job titles, or company names. This search functionality is your gateway to finding specific job listings. You can also specify location details to narrow down your search to a particular city, state, or country.
Search Results: Upon entering your search criteria and hitting the âSearchâ button, Indeed displays a list of job listings that match your query. These listings are typically organized in reverse chronological order, with the most recent postings appearing at the top. Each listing provides essential details such as the job title, company name, location, and a brief job description.
Filters: Indeed offers various filters on the left-hand side of the search results page. These filters allow you to refine your search further. You can filter job listings by job type (e.g., full-time, part-time), salary estimate, location, company, and more. Using these filters can help you find job postings that precisely match your criteria.
Pagination: When there are numerous job listings that match your search, Indeed implements pagination. Youâll notice that only a limited number of job postings are displayed on each page. To access more listings, youâll need to click on the page numbers or the âNextâ button at the bottom of the search results. Understanding how pagination works is crucial for scraping multiple pages of job listings.
Setting Up Your Development Environment
Before you can dive into web scraping Indeed job postings with Python, you need to set up your development environment. This involves installing the necessary tools and libraries and choosing the right Integrated Development Environment (IDE) for your coding tasks.
Installing Python
Python is the primary programming language weâll use for web scraping. If you donât already have Python installed on your system, follow these steps:
- Download Python: Visit the official Python website at python.org and download the latest version of Python. Choose the appropriate installer for your operating system (Windows, macOS, or Linux). 
- Installation: Run the downloaded installer and follow the installation instructions. During installation, make sure to check the option that adds Python to your systemâs PATH. This step is crucial for running Python from the command line. 
- Verify Installation: Open a command prompt or terminal and enter the following command to check if Python is installed correctly: 
| 1 | python --version | 
You should see the installed Python version displayed.
Installing Required Libraries
Python offers a rich ecosystem of libraries that simplify web scraping. For this project, youâll need the crawlbase library for making web requests with the Crawlbase API and the Beautiful Soup library for parsing HTML content. To install these libraries, use the following commands:
- Crawlbase: The crawlbaselibrary is a Python wrapper for the Crawlbase API, which will enable us to make web requests efficiently.
| 1 | pip install crawlbase | 
- Beautiful Soup: Beautiful Soup is a library for parsing HTML and XML documents. Itâs especially useful for extracting data from web pages.
| 1 | pip install beautifulsoup4 | 
With these libraries installed, youâll have the tools you need to fetch web pages using the Crawlbase API and parse their content during the scraping process.
Choosing the Right Development IDE
An Integrated Development Environment (IDE) provides a coding environment with features like code highlighting, auto-completion, and debugging tools. While you can write Python code in a simple text editor, using an IDE can significantly improve your development experience.
Here are a few popular Python IDEs to consider:
- PyCharm: PyCharm is a robust IDE with a free Community Edition. It offers features like code analysis, a visual debugger, and support for web development. 
- Visual Studio Code (VS Code): VS Code is a free, open-source code editor developed by Microsoft. It has a vast extension library, making it versatile for various programming tasks, including web scraping. 
- Jupyter Notebook: Jupyter Notebook is excellent for interactive coding and data exploration. Itâs commonly used in data science projects. 
- Spyder: Spyder is an IDE designed for scientific and data-related tasks. It provides features like variable explorer and interactive console. 
Choose the IDE that best suits your preferences and workflow. Once you have Python installed, the required libraries set up, and your chosen IDE ready, youâre all set to start building your Indeed job scraper in Python.
Building Your Indeed Job Scraper
In this section, we will guide you through the process of creating a powerful Indeed job scraper using Python. This scraper will enable you to gather job listings, handle pagination on job search pages, extract detailed information from job posting pages, and efficiently save this data into an SQLite database.
Scraping Job Listings
To begin scraping job listings from Indeed.com, we need to understand how to make requests to the website and parse the results. If you visit Indeedâs homepage and submit a job search query, youâll notice that the website redirects you to a search URL with specific parameters, like this:
| 1 | https://www.indeed.com/jobs?q=Web+Developer&l=Virginia | 
Here, weâre searching for Web Developer jobs in Virginia, and the URL includes parameters such as q=Web+Developer for the job query and l=Virginia for the location. To replicate this in your Python code using the Crawlbase library, you can use the following example:
| 1 | from crawlbase import CrawlingAPI | 
This code snippet demonstrates how to send a GET request to Indeedâs job search page. Once you have the HTML content of the job listing page, you can parse it to extract the job listings.
We could parse the HTML document using CSS or XPath selectors, but thereâs an easier way: we can find all of the job listing data hidden away deep in the HTML as a JSON document:
We can use regular expressions to extract this JSON data efficiently. Letâs update the previous example to handle scraping of job listings.
| 1 | import re | 
The function, parse_search_page_html, is used to extract job listing data from the HTML source code of an Indeed job search page. It employs regular expressions to locate a specific JavaScript variable mosaic-provider-jobcards containing structured job listing information in JSON format. It then parses this JSON data, extracting two main components: âresults,â which contains the job listings, and âmeta,â which contains metadata about the job listings, such as the number of results in various categories. The function returns this structured data as a Python dictionary for further processing.
Example Output:
| 1 | { | 
Handling Pagination
Indeedâs job search results are typically paginated. To handle pagination and collect multiple pages of job listings, you can modify the URL parameters and send additional requests. To scrape multiple pages, you can adjust the URLâs start parameter or extract pagination information from the HTML.
| 1 | import json | 
The scrape_indeed_search function starts by making an initial request to the Indeed search page using the provided query and location. It then checks the response status code to ensure that the request was successful (status code 200). If successful, it proceeds to parse the job listing data from the HTML of the first page.
To handle pagination, the code calculates the total number of job listings available for the given query and location. It also determines how many pages need to be scraped to reach the maximum result limit set by the user. To collect the URLs of the remaining pages, it generates a list of page URLs, each with an incremental offset to fetch the next set of results.
Then it initiate Crawling API request for each of the generated page URLs. As each page is fetched, its job listings are extracted and added to the results list. This approach ensures that the script can handle pagination seamlessly, scraping all relevant job listings while efficiently managing the retrieval of multiple pages.
Extracting Data from Job Posting Page
Once you have the job listings, you may want to extract more details by scraping the full job posting pages. The job search results encompass nearly all job listing information, except for certain specifics like a comprehensive job description. To extract this missing information, we require the job ID, conveniently located within the jobkey field within our search results:
| 1 | { | 
Leveraging this jobkey, we can send a request for the complete job details page. Much like our initial search, we can parse the embedded data instead of the HTML structure:
This data is tucked away within the _initialData variable, and we can retrieve it using a straightforward regular expression pattern. Hereâs how you can do it:
| 1 | import json | 
Example Output:
| 1 | [ | 
Saving Data into an SQLite Database
To store the extracted job data, you can use an SQLite database. Hereâs an example code of how to create a database, create a table for job Postings, and insert data into it.
| 1 | import json | 
This code starts by initializing the database structure, creating a table named âjobsâ to store information such as job titles, company names, locations, and job descriptions. The initialize_database function initializes the SQLite database and returns both the connection and cursor. The save_to_database function is responsible for inserting job details into this table.
The actual web scraping process happens in the scrape_and_save function, which takes a job key (a unique identifier for each job posting) and an SQLite cursor as input. This function constructs the URL for a specific job posting, sends an HTTP request to Indeedâs website, retrieves the HTML content of the job page, and then parses it using the parse_job_page_html function. This parsed data, including job title, company name, location, and job description, is then saved into the SQLite database using the save_to_database function.
The main function orchestrates the entire process. It initializes the database connection and Crawling API instance, defines a list of job keys to scrape, and runs the scraping and saving tasks for each job key. Once all the job details have been scraped and stored, the database connection is closed.
By following these detailed steps, you can build a comprehensive Indeed job scraper in Python, scrape job listings, handle pagination, extract data from job posting pages, and save the data into an SQLite database for further analysis or use.
Optimize your Indeed Scraper with Python and Crawlbase
Online platforms stand at the forefront for job hunters, offering many opportunities at their fingertips. However, this ease comes with the daunting task of sifting through an ocean of information. Web scraping is a game-changer for data collection that reshapes our job-seeking strategies.
By employing web scraping, we can revolutionize how we hunt for jobs. It automates the tedious process of gathering and sorting job listings from various portals. You no longer need to spend countless hours manually searching different job boards. With tailored web scraping scripts, you can easily gather, categorize, and display job openings that align with your preferences. This saves time and ensures that no potential job offer, no matter how obscure, slips through the cracks.
Our comprehensive guide highlights web scraping capabilities through the Crawlbase Crawling API, focusing on its application for the renowned job listing site Indeed. Whether youâre looking for an ideal career match or a tech enthusiast keen on mastering scraping techniques, this Python guide provides the tools to automate and refine your job search. Journey with us as we showcase how web scraping can simplify and optimize your quest for the perfect job on Indeed.
Frequently Asked Questions
Is it possible to scrape Indeed?
You can scrape job postings from Indeed, but it goes against their rules. Indeed tries to stop scraping and uses things like CAPTCHAs and limits on how often you can access their site to prevent automated scraping. If you break these rules, you might face legal trouble or get your IP address blocked. Instead of scraping Indeed offers APIs or other ways to get their data for approved partners, which is a more above-board way to access what they have.
How to scrape leads from Indeed?
If you choose to gather job postings or potential leads from Indeed (even though itâs risky) here are the basic steps youâd take:
- Pick your target URLs: Figure out which job listings or search pages on Indeed you want to collect data from.
- Look at how the site is built: Use your browserâs developer tools to find the HTML tags that hold job titles, descriptions, company names, and locations. 3. Create a program to collect data: Use a coding language like Python, with tools such as BeautifulSoup and Scrapy, to pull out the info from these HTML tags.
- Deal with CAPTCHAs and limits: Come up with ways to get past CAPTCHAs and slow down your requests so the site doesnât block you.
- Keep the information: Save what youâve collected to a database or CSV file so you can work with it later.
What is the best job scraper?
The best job scraper depends on your specific needs, such as the platform you are targeting and the scale of the data collection. For a comprehensive and reliable solution, Crawlbase stands out as one of the top choices for job scraping.
Crawlbase Crawling API offers innovative features like:
- Versatile Parameter Options: Crawlbase provides a rich set of parameters, allowing developers to tailor their API requests precisely. Parameters such as âformat,â âuser_agent,â âpage_wait,â and more enable customization to suit specific crawling needs.
- Response Format Control: Developers can choose between JSON and HTML response formats based on their preferences and data processing requirements. This flexibility simplifies data extraction and manipulation.
- Cookies and Headers Handling: With the ability to retrieve cookies and headers from the original website using parameters like âget_cookiesâ and âget_headers,â developers can access valuable information that may be crucial for certain web scraping tasks.
- Dynamic Content Handling: Crawlbase excels in handling dynamic content, making it suitable for crawling JavaScript-rendered pages. Parameters like âpage_waitâ and âajax_waitâ enable developers to ensure that the API captures the fully rendered content, even when it takes time to load or includes AJAX requests.
- IP Rotation: Crawlbase offers the capability to rotate IP addresses, providing anonymity and reducing the risk of being blocked by websites. This feature ensures a higher success rate for web crawling tasks.
- Geolocation Options: Developers can specify a country for geolocated requests using the âcountryâ parameter. This is particularly useful for scenarios where data from specific geographic regions is required.
- Tor Network Support: For crawling onion websites over the Tor network, the âtor_networkâ parameter can be enabled, enhancing privacy and access to content on the dark web.
- Screenshot Capture: The API can capture screenshots of web pages with the âscreenshotâ parameter, providing visual context for crawled data.
- Data Scraping with Scrapers: Crawlbase offers the option to utilize predefined data scrapers, streamlining the extraction of specific information from web pages. This simplifies data retrieval for common use cases.
- Asynchronous Crawling: In cases where asynchronous crawling is needed, the API supports the âasyncâ parameter. Developers receive a request identifier (RID) for retrieving crawled data from the cloud storage.
- Autoparsing: The âautoparseâ parameter simplifies data extraction by returning parsed information in JSON format, reducing the need for extensive post-processing of HTML content.












