Blog | Crawlbase

How to extract Foursquare Data in Easy Steps

2024-10-24T14:00:00.000Z

Scraping data from Foursquare is super helpful for developers and businesses looking to get data on venues, user reviews, or location-based insights. Foursquare is one of the most popular location-based services, with over 50 million active monthly users and over 95 million locations worldwide. By scraping data from Foursquare, you can get valuable information for market research, business development, or building location-based applications.

This article will show you how to easily scrape Foursquare data using the Crawlbase Crawling API which is perfect for sites like Foursquare that rely heavily on JavaScript rendering. Whether you want to get search listings or detailed venue information, we will show you how to do it step by step.

Let’s dive into the process, from setting up your Python environment to extracting and storing Foursquare data in a structured way.

Scraping data from Foursquare is super helpful for developers and businesses looking to get data on venues, user reviews, or location-based insights. Foursquare is one of the most popular location-based services, with over 50 million active monthly users and over95 million locations worldwide. By scraping data from Foursquare, you can get valuable information for market research, business development, or building location-based applications.

This article will show you how to easily scrape Foursquare data using theCrawlbase Crawling API which is perfect for sites like Foursquare that rely heavily on JavaScript rendering. Whether you want to get search listings or detailed venue information, we will show you how to do it step by step.

Let’s dive into the process, from setting up your Python environment to extracting and storing Foursquare data in a structured way.

Why Extract Data from Foursquare?
Key Data Points to Extract from Foursquare
Crawlbase Crawling API for Foursquare Scraping

Crawlbase Python Library

Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping Foursquare Search Listings

Inspecting the HTML for Selectors
Writing the Foursquare Search Listings Scraper
Handling Pagination
Storing Data in a JSON File
Complete Code Example

Scraping Foursquare Venue Details

Inspecting the HTML for Selectors
Writing the Foursquare Venue Details Scraper
Storing Data in a JSON File
Complete Code Example

Final Thoughts
Frequently Asked Questions

Why Extract Data from Foursquare?

Foursquare is a massive platform with location data on millions of places like restaurants, cafes, parks and more. Whether you’re building a location-based app, doing market research or analyzing venue reviews, Foursquare has got you covered. By extracting this data you can get insights to inform your decisions and planning.

Businesses can use Foursquare data to learn more about customer preferences, popular venues and regional trends. Developers can use this data to build custom apps like travel guides or recommendation engines. Foursquare has venue names, addresses, ratings and reviews, so extracting data from this platform can be a total game changer for your project.

Key Data Points to Extract from Foursquare

When scraping Foursquare, you need to know what data you can collect. Here’s what you can get from Foursquare:

Venue Name: The business or location name, e.g., restaurant, cafe, park.
Address: Full address of the venue, including street, city, and postal code.
Category: Type of venue (e.g., restaurant, bar, museum) to categorize the data.
Ratings and Reviews: User generated ratings and reviews to know customer satisfaction.
Operating Hours: Business hours of the venue, suitable for time sensitive applications.
Contact Information: Phone numbers, email addresses, or websites to contact the venue.
Photos: The user uploaded photos to get an idea of the venue.
Geo-coordinates: Latitude and longitude of the venue for mapping and location-based apps.

Crawlbase Crawling API for Foursquare Scraping

Foursquare uses JavaScript to load its content dynamically, which makes it hard to scrape using traditional methods. This is where Crawlbase Crawling API comes in. It’s designed to handle websites with heavy JavaScript rendering by mimicking real user interactions and rendering the full page.

Here’s why using the Crawlbase Crawling API for scraping Foursquare is a great choice:

JavaScript Rendering: It takes care of loading all the dynamic content on Foursquare pages, so you get complete data without missing important information.
IP Rotation and Proxies: Crawlbase automatically rotates IP addresses and uses smart proxies to avoid getting blocked by the site.
Easy Integration: The Crawlbase API is simple to integrate with Python and offers flexible options to control scraping, such as scroll intervals, wait times, and pagination handling.

Crawlbase Python Library

Crawlbase also has a Python library using which you can easily use Crawlbase products in your projects. You’ll need an access token, which you can get by signing up with Crawlbase.

Here’s an example to send a request to Crawlbase Crawling API:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase provides two types of tokens: a Normal Token for static sites and a JavaScript (JS) Token for dynamic or browser-rendered content, which is necessary for scraping Foursquare. Crawlbase also offers 1,000 free requests to help you get started, and you can sign up without a credit card. For more details, check the Crawlbase Crawling API documentation.

In the next section, we’ll go through how to set up your Python environment to start scraping.

Setting Up Your Python Environment

Before we can start scraping Foursquare data, we need to set up the right Python environment. This includes installing Python, and necessary libraries and choosing the right IDE (Integrated Development Environment) to write and run our code.

Installing Python and Required Libraries

First, make sure that you have Python installed on your computer. You can download the latest version of Python from python.org. Once installed, you can check if Python is working by running this command in your terminal or command prompt:

1	python --version

Next, you’ll need to install the required libraries. For this tutorial, we’ll use Crawlbase and BeautifulSoup for parsing HTML. You can install these libraries by running the following command:

1	pip install crawlbase beautifulsoup4

These libraries will help you interact with the Crawlbase Crawling API, extract useful information from the HTML, and organize the data.

Choosing an IDE

To write and run your Python scripts, you need an IDE. Here are some options:

VS Code: A lightweight code editor with great Python support.
PyCharm: A more advanced Python IDE with lots of features.
Jupyter Notebook: A more advanced Python IDE with lots of features.

Choose what you like. Once you have your environment set up you can start writing the code for Foursquare search listings.

Scraping Foursquare Search Listings

In this section, we will scrape search listings from Foursquare. Foursquare search listings have various details about places, such as name, address, category, and more. We’ll break this down into the following steps:

Inspecting the HTML for Selectors

Before writing the scraper, we need to inspect the Foursquare search page to identify the HTML structure and CSS selectors that contain the data we want to extract. Here’s how you can do that:

Open the Search Listings Page: Go to a Foursquare search results page (e.g. search for “restaurants” in a specific location).
Inspect the Page: Right-click on the page and select “Inspect” or press Ctrl + Shift + I to open Developer Tools.

Find the Relevant Elements:

Place Name: The place name is inside a
tag with the class .venueName, and the actual name is inside an tag inside this div.
Address: The address is inside a
tag with the class .venueAddress.
Category: The category of the place can be extracted from a tag with the class .categoryName.
Link: The link to the place details is inside the same tag within the .venueName div.

Inspect the page for any additional data you want to extract, such as the rating or number of reviews.

Writing the Foursquare Search Listings Scraper

Now that we have our CSS selectors for the data points, we can start writing the scraper. We will use the Crawlbase Crawling API to handle JavaScript rendering and AJAX requests, utilizing its ajax_wait and page_wait parameters.

Here’s the code:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def make_crawlbase_request(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

def scrape_foursquare_listings(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    data = []

    # Select all place listings
    listings = soup.select('ul.recommendationList > li.singleRecommendation')

    for listing in listings:
        name = listing.select_one('div.venueName a').text.strip() if listing.select_one('div.venueName a') else ''
        address = listing.select_one('div.venueAddress').text.strip() if listing.select_one('div.venueAddress') else ''
        category = listing.select_one('span.categoryName').text.strip() if listing.select_one('span.categoryName') else ''
        link = listing.select_one('div.venueName a')['href'] if listing.select_one('div.venueName a') else ''

        # Add extracted data to the list
        data.append({
            'name': name,
            'address': address,
            'category': category,
            'link': f"https://foursquare.com{link}"  # Construct the full URL
        })

    return data

In the above code, We make a request to the Foursquare search page using Crawlbase Crawling API. We use BeautifulSoup to parse the HTML and extract the data points using the CSS selectors. Then we store the data in a list of dictionaries.

Foursquare search listings use button-based pagination. To handle pagination, we will use the css_click_selector parameter provided by the Crawlbase Crawling API. This allows us to simulate a button click to load the next set of results.

We will set the css_click_selector to the button class or ID responsible for pagination (usually a “See more results” button).

def make_crawlbase_request_with_pagination(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'li.moreResults > button'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Storing Data in a JSON File

Once we have scraped the data we can store it in a JSON file for later use. JSON is a popular format for storing and exchanging data.

def save_data_to_json(data, filename='foursquare_data.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {filename}")

Complete Code Example

Below is the complete code that combines all the steps:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def make_crawlbase_request_with_pagination(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'li.moreResults > button'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

def scrape_foursquare_listings(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    data = []

    # Select all place listings
    listings = soup.select('ul.recommendationList > li.singleRecommendation')

    for listing in listings:
        name = listing.select_one('div.venueName a').text.strip() if listing.select_one('div.venueName a') else ''
        address = listing.select_one('div.venueAddress').text.strip() if listing.select_one('div.venueAddress') else ''
        category = listing.select_one('span.categoryName').text.strip() if listing.select_one('span.categoryName') else ''
        link = listing.select_one('div.venueName a')['href'] if listing.select_one('div.venueName a') else ''

        # Add extracted data to the list
        data.append({
            'name': name,
            'address': address,
            'category': category,
            'link': f"https://foursquare.com{link}"  # Construct the full URL
        })

    return data

def save_data_to_json(data, filename='foursquare_data.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {filename}")

if __name__ == "__main__":
    url = "https://foursquare.com/explore?near=New%20York&q=Food"
    html_content = make_crawlbase_request_with_pagination(url)

    if html_content:
        data = scrape_foursquare_listings(html_content)  # Extract data from HTML content
        save_data_to_json(data)

Example Output:

[
    {
        "name": "Thai Diner",
        "address": "186 Mott St (at Kenmare), New York",
        "category": "Thai",
        "link": "https://foursquare.com/v/thai-diner/5e46e2ec5791a10008c55728"
    },
    {
        "name": "Fish Cheeks",
        "address": "55 Bond St (btwn Lafayette & Bowery St), New York",
        "category": "Thai",
        "link": "https://foursquare.com/v/fish-cheeks/57c169e3498e784947e307aa"
    },
    {
        "name": "Raku",
        "address": "48 Macdougal St (King and Macdougal), New York",
        "category": "Udon",
        "link": "https://foursquare.com/v/raku/5aea422a033693002bf0c1cb"
    },
    {
        "name": "Los Tacos No. 1",
        "address": "229 W 43rd St (btwn 7th & 8th Ave), New York",
        "category": "Tacos",
        "link": "https://foursquare.com/v/los-tacos-no-1/59580ce6db1d8148fee3d383"
    },
    {
        "name": "Mah-Ze-Dahr Bakery",
        "address": "28 Greenwich Ave (Charles Street), New York",
        "category": "Bakery",
        "link": "https://foursquare.com/v/mahzedahr-bakery/568c0ce238fafac5f5ffe631"
    },
    .... more
]

Scraping Foursquare Venue Details

In this section, we will learn how to scrape individual Foursquare venue details. After scraping the listings, we can then dig deeper and collect specific data from each venue’s page.

Inspecting the HTML for Selectors

Before we write our scraper, we first need to inspect the Foursquare venue details page to figure out which HTML elements contain the data we want. Here’s what you should do:

Visit a Venue Page: Open a Foursquare venue page in the browser.
Use Developer Tools: Right-click on the page and select “Inspect” (or press Ctrl + Shift + I) to open Developer Tools.

Identify CSS Selectors: Look for the HTML elements that contain the information you want. Here are some common details and their possible selectors:

Venue Name: Found in an
tag with the class .venueName.
Address: Found in a
tag with the class .venueAddress.
Phone Number: Found in a tag with the attribute itemprop="telephone".
Rating: Found in a tag with the attribute itemprop="ratingValue".
Reviews Count: Found in a
tag with the class .numRatings.

Writing the Foursquare Venue Details Scraper

Now that we have the CSS selectors for the venue details, let’s write the scraper. Just like we did in the previous section, we’ll use Crawlbase to handle the JavaScript rendering and Ajax requests.

Here’s a sample Python script to scrape venue details using Crawlbase and BeautifulSoup:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def fetch_venue_details(url):
    # Crawlbase request options
    options = {
        'ajax_wait': 'true',  # Wait for JavaScript to load
        'page_wait': '5000'   # Wait 5 seconds for full page render
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Error fetching venue details. Status: {response['headers']['pc_status']}")
        return None

def scrape_venue_details(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract venue details
    name = soup.select_one('h1.venueName').text.strip() if soup.select_one('h1.venueName') else ''
    address = soup.select_one('div.venueAddress').text.strip() if soup.select_one('div.venueAddress') else ''
    phone = soup.select_one('span[itemprop="telephone"]').text.strip() if soup.select_one('span[itemprop="telephone"]') else ''
    rating = soup.select_one('span[itemprop="ratingValue"]').text.strip() if soup.select_one('span[itemprop="ratingValue"]') else ''
    ratings_count = soup.select_one('div.numRatings').text.strip() if soup.select_one('div.numRatings') else ''

    # Return the details in a dictionary
    return {
        'name': name,
        'address': address,
        'phone': phone,
        'rating': rating,
        'ratings_count': ratings_count
    }

We use Crawlbase’s get() method to get the venue page. The ajax_wait and page_wait options make sure the page loads completely before we start scraping. We use BeautifulSoup to read the HTML and find the venue’s name, address, phone number, rating, and reviews. If Crawlbase can’t get the page, it will show an error message.

Storing Data in a JSON File

After you’ve scraped the venue details, you need to store the data for later use. We’ll save the extracted data to a JSON file.

Here’s the function to save the data:

def save_venue_data(data, filename='foursquare_venue_details.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Data successfully saved to {filename}")

You can now call this function after scraping the venue details, passing the data as an argument.

Complete Code Example

Here is the complete code example that puts everything together.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def fetch_venue_details(url):
    # Crawlbase request options
    options = {
        'ajax_wait': 'true',  # Wait for JavaScript to load
        'page_wait': '5000'   # Wait 5 seconds for full page render
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Error fetching venue details. Status: {response['headers']['pc_status']}")
        return None

def scrape_venue_details(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract venue details
    name = soup.select_one('h1.venueName').text.strip() if soup.select_one('h1.venueName') else ''
    address = soup.select_one('div.venueAddress').text.strip() if soup.select_one('div.venueAddress') else ''
    phone = soup.select_one('span[itemprop="telephone"]').text.strip() if soup.select_one('span[itemprop="telephone"]') else ''
    rating = soup.select_one('span[itemprop="ratingValue"]').text.strip() if soup.select_one('span[itemprop="ratingValue"]') else ''
    ratings_count = soup.select_one('div.numRatings').text.strip() if soup.select_one('div.numRatings') else ''

    # Return the details in a dictionary
    return {
        'name': name,
        'address': address,
        'phone': phone,
        'rating': rating,
        'ratings_count': ratings_count
    }

def save_venue_data(data, filename='foursquare_venue_details.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Data successfully saved to {filename}")

# Example usage:
if __name__ == "__main__":
    url = 'https://foursquare.com/v/thai-diner/5e46e2ec5791a10008c55728'
    html_content = fetch_venue_details(url)

    if html_content:
        venue_data = scrape_venue_details(html_content)
        save_venue_data(venue_data)

Example Output:

{
  "name": "Thai Diner",
  "address": "186 Mott St (at Kenmare)New York, NY 10012United States",
  "phone": "(646) 559-4140",
  "rating": "9.5",
  "ratings_count": "298"
}

Final Thoughts

In this blog, we learned how to scrape data from Foursquare using the Crawlbase Crawling API and BeautifulSoup. We covered the important bits, like inspecting the HTML for selectors, writing scrapers for search listings and venue details, and handling pagination. Scraping Foursquare data can be super useful, but you have to do it responsibly and respect the website’s terms of service.

Using the methods outlined in this blog, you can collect all sorts of venue information like names, addresses, phone numbers, ratings, and reviews from Foursquare. This data can be useful for research, analysis or building your apps.

If you want to do more web scraping, check out our guides on scraping other key websites.

📜 Scrape Costco Product Data Easily
📜 How to Scrape Houzz Data
📜 How to Scrape Tokopedia
📜 Scrape OpenSea Data with Python
📜 How to Scrape Gumtree Data in Easy Steps

If you have any questions or want to give feedback, our support team can help you with web scraping. Happy scraping!

Frequently Asked Questions

Q. Is it legal to scrape data from websites?

The legality of web scraping depends on the website’s terms of service. Many websites allow personal use, but some don’t. Always check the website’s policies before scraping data. Be respectful of these rules, and don’t overload the website’s servers. If in doubt, reach out to the website owner for permission.

Q. How can I scrape venue details from Foursquare?

To scrape venue details from Foursquare, you need to use a web scraping tool or library like BeautifulSoup along with a service like Crawlbase. First, inspect the HTML of the venue page to find the CSS selectors for the details you want, such as the venue’s name, address, and ratings. Then, write a script that fetches the page content and extracts the data using the identified selectors.

Q. How do I handle pagination while scraping Foursquare?

Foursquare has a “See more results” button to show more venues. One of the solutions for handling this in your scraper is the Crawlbase Crawling API. By using the css_click_selector parameter from this API, your scraper can click the “See more results” button to fetch more results. This way, you will capture all the data while scraping.

Scrape OpenSea Data with Python

2024-10-22T14:00:00.000Z

Scraping data from OpenSea is super useful, especially if you’re into NFTs (Non-Fungible Tokens) which have gone crazy in the last few years. NFTs are unique digital assets—art, collectibles, virtual goods—secured on blockchain technology. As one of the largest NFT marketplaces OpenSea has millions of NFTs across categories, so it’s a go to for collectors, investors and developers. Whether you’re tracking trends, prices or specific collections, having this data is gold.

But OpenSea uses JavaScript to load most of its data so traditional scraping won’t work. That’s where the Crawlbase Crawling API comes in—it can handle JavaScript heavy pages so it’s the perfect solution for scraping OpenSea data.

In this post we’ll show you how to scrape OpenSea data, collection pages and individual NFT detail pages using Python and the Crawlbase Crawling API. Let’s get started!

Why Scrape OpenSea for NFT Data?
What Data Can You Extract From OpenSea?
OpenSea Scraping with Crawlbase Crawling API
Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping OpenSea Collection Pages

Inspecting the HTML for CSS Selectors
Writing the Collection Page Scraper
Handling Pagination in Collection Pages
Storing Data in a CSV File
Complete Code Example

Scraping OpenSea NFT Detail Pages

Inspecting the HTML for CSS Selectors
Writing the NFT Detail Page Scraper
Storing Data in a CSV File
Complete Code Example

Final Thoughts
Frequently Asked Questions

Why Scrape OpenSea for NFT Data?

Scraping OpenSea can help you track and analyze valuable NFT data, including prices, trading volumes, and ownership information. Whether you’re an NFT collector, a developer building NFT-related tools, or an investor looking to understand market trends, extracting data from OpenSea gives you the insights you need to make informed decisions.

Here are some reasons why scraping OpenSea is important:

Track NFT Prices: Monitor individual NFT prices or an entire collection over time
Analyze Trading Volumes: Understand how in-demand certain NFTs are based on sales and trading volumes.
Discover Trends: Find out what are the hottest NFT collections and tokens in real-time.
Monitor NFT Owners: Scrape ownership data to see who owns specific NFTs or how many tokens a wallet owns.
Automate Data Collection: Instead of checking OpenSea manually, you can auto collect the data and save it in different formats like CSV or JSON.

OpenSea’s website use JavaScript rendering so scraping it can be tricky. But with the Crawlbase Crawling API, you can handle this problem and extract the data easily.

What Data Can You Extract From OpenSea?

When scraping OpenSea it’s important to know what data to focus on. The platform has a ton of information about NFTs (Non-Fungible Tokens) and extracting the right data will help you track performance, analyze trends and make decisions. Here’s what to extract:

NFT Name: The name that is unique to each NFT, often holds a branding or collection sentiment.
Collection Name: The NFT collection to which the individual NFT belongs. Collections usually represent sets or series of NFTs.
Price: The NFT listing price. This is important for understanding market trends and determining the value of NFTs.
Last Sale Price: The price the NFT was previously sold at. It gives a history for NFT market performance.
Owner: The NFT’s present holder (usually a wallet address).
Creator: The artist or creator of the NFT. Creator information is important for tracking provenance and originality.
Number of Owners: Some NFTs have multiple owners, which indicates how widely held the token is.
Rarity/Attributes: Many NFTs has traits that make them unique and more desirable.
Trading Volume: The overall volume of sales and transfers of the NFT or the entire collection.
Token ID: The unique identifier for the NFT on the blockchain, useful for tracking specific tokens across platforms.

OpenSea Scraping with Crawlbase Crawling API

The Crawlbase Crawling API makes OpenSea data scraping easy. Since OpenSea uses JavaScript to load its content, traditional scraping methods will fail. But the Crawlbase API works like a real browser so you can get all the data you need.

Why Use Crawlbase Crawling API for OpenSea

Handles Dynamic Content: The Crawlbase Crawling API can handle JavaScript heavy pages and ensures the scraping only happens after all NFT data (prices, ownership) is exposed.
IP Rotation: To prevent getting blocked by OpenSea’s security, Crawlbase rotates IP addresses. So you can scrape multiple pages without worrying about rate limits or bans.
Fast Performance: Crawlbase is fast and efficient for scraping large data volumes, saving you time especially when you have many NFTs and collections.
Customizable Requests: You can adjust headers, cookies and other parameters to fit your scraping needs and get the data you want.
Scroll-Based Pagination: Crawlbase supports scroll-based pagination so you can get more items on collection pages without having to manually click through each page.

Crawlbase Python Library

Crawlbase also has a python library using which you can easily use Crawlbase products into your projects. You’ll need an access token which you can get by signing up with Crawlbase.

Here’s an example to send a request to Crawlbase Crawling API:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase provides two types of tokens: a Normal Token for static sites and a JavaScript (JS) Token for dynamic or browser-rendered content, which is necessary for scraping OpenSea. Crawlbase also offers 1,000 free requests to help you get started, and you can sign up without a credit card. For more details, check the Crawlbase Crawling API documentation.

In the next section, we’ll set up your Python environment for scraping OpenSea effectively.

Setting Up Your Python Environment

Before scraping data from OpenSea, you need to set up your Python environment. This setup will ensure you have all the necessary tools and libraries to make your scraping process smooth and efficient. Here’s how to do it:

Installing Python and Required Libraries

Install Python: Download Python from the official website and follow the installation instructions. Make sure to check “Add Python to PATH” during installation.

Set Up a Virtual Environment (optional but recommended): This keeps your project organized. Run these commands in your terminal:

cd your_project_directory
python -m venv venv
venv\Scripts\activate  # Windows
# or
source venv/bin/activate  # macOS/Linux

Install Required Libraries: Run the following command to install necessary libraries:

1	pip install beautifulsoap4 crawlbase pandas

beautifulsoap4: For parsing and extracting data from HTML.
crawlbase: For using the Crawlbase Crawling API.
pandas: For handling and saving data in CSV format.

Choosing an IDE

Select an Integrated Development Environment (IDE) to write your code. Popular options include:

Visual Studio Code: Free and lightweight, with Python support.
PyCharm: A feature-rich IDE for Python.
Jupyter Notebook: Great for interactive coding and data analysis.

Now that your Python environment is set up, you’re ready to start scraping OpenSea collection pages. In the next section, we will inspect the HTML for CSS selectors.

Scraping OpenSea Collection Pages

In this section, we will scrape collection pages from OpenSea. Collection pages show various NFTs grouped under specific categories or themes. To do this efficiently we will go through the following steps:

Inspecting the HTML for CSS Selectors

Before we write our scraper we need to understand the structure of the HTML on the OpenSea collection pages. Here’s how to find the CSS selectors:

Open the Collection Page: Go to the OpenSea website and navigate to any collection page.
Inspect the Page: Right-click on the page and select “Inspect” or press Ctrl + Shift + I to open the Developer Tools.

Find Relevant Elements: Look for the elements that contain the NFT details. Common data points are:

Title: In a with data-testid="ItemCardFooter-name".
Price: Located within a
with data-testid="ItemCardPrice", specifically in a nested with data-id="TextBody".
Image URL: In an tag with the image source in the src attribute.
Link: The NFT detail page link is in an tag with the class Asset--anchor.

Writing the Collection Page Scraper

Now we have the CSS selectors, we can write our scraper. We will use the Crawlbase Crawling API to handle JavaScript rendering by using its ajax_wait and page_wait parameters. Below is the implementation of the scraper:

from crawlbase import CrawlingAPI
import pandas as pd

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def make_crawlbase_request(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

def scrape_opensea_collection(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    data = []

    # Find all NFT items in the collection
    nft_items = soup.select('div.Asset--loaded > article.AssetSearchList--asset')

    for item in nft_items:
        title = item.select_one('span[data-testid="ItemCardFooter-name"]').text.strip() if item.select_one('span[data-testid="ItemCardFooter-name"]') else ''
        price = item.select_one('div[data-testid="ItemCardPrice"] span[data-id="TextBody"]').text.strip() if item.select_one('div[data-testid="ItemCardPrice"] span[data-id="TextBody"]') else ''
        image = item.select_one('img')['src'] if item.select_one('img') else ''
        link = item.select_one('a.Asset--anchor')['href'] if item.select_one('a.Asset--anchor') else ''

        # Add the extracted data to the list
        data.append({
            'title': title,
            'price': price,
            'image_url': image,
            'link': f"https://opensea.io{link}"  # Construct the full URL
        })

    return data

Here we initialize the Crawlbase Crawling API and create a function make_crawlbase_request to get the collection page. The function waits for any AJAX requests to complete and waits 5 seconds for the page to fully render before passing the HTML to the scrape_opensea_collection function.

In scrape_opensea_collection, we parse the HTML with BeautifulSoup and extract details about each NFT item using the CSS selectors we defined earlier. We get the title, price, image URL and link for each NFT and store this in a list which is returned to the caller.

OpenSea uses scroll-based pagination, so more items load as you scroll down the page. We can use the scroll and scroll_interval parameters for this. This way we don’t need to manage pagination explicitly.

options = {
    'ajax_wait': 'true',
    'scroll': 'true',
    'scroll_interval': '20'  # Scroll for 20 seconds
}

This will make the crawler scroll for 20 seconds so we get more items.

Storing Data in a CSV File

After we scrape the data we can store it in a CSV file. This is a common format and easy to analyze later. Here’s how:

def save_data_to_csv(data, filename='opensea_data.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

Complete Code Example

Here’s the complete code that combines all the steps:

from crawlbase import CrawlingAPI
import pandas as pd
from bs4 import BeautifulSoup

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def make_crawlbase_request(url):
    options = {
        'ajax_wait': 'true',
        'scroll': 'true',
        'scroll_interval': '20'  # Scroll for 20 seconds
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

def scrape_opensea_collection(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    data = []

    # Find all NFT items in the collection
    nft_items = soup.select('div.Asset--loaded > article.AssetSearchList--asset')

    for item in nft_items:
        title = item.select_one('span[data-testid="ItemCardFooter-name"]').text.strip() if item.select_one('span[data-testid="ItemCardFooter-name"]') else ''
        price = item.select_one('div[data-testid="ItemCardPrice"] span[data-id="TextBody"]').text.strip() if item.select_one('div[data-testid="ItemCardPrice"] span[data-id="TextBody"]') else ''
        image = item.select_one('img')['src'] if item.select_one('img') else ''
        link = item.select_one('a.Asset--anchor')['href'] if item.select_one('a.Asset--anchor') else ''

        # Add the extracted data to the list
        data.append({
            'title': title,
            'price': price,
            'image_url': image,
            'link': f"https://opensea.io{link}"  # Construct the full URL
        })

    return data

def save_data_to_csv(data, filename='opensea_data.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

if __name__ == "__main__":
    url = "https://opensea.io/collection/courtyard-nft"
    html_content = make_crawlbase_request(url)

    if html_content:
        data = scrape_opensea_collection(html_content)  # Extract data from HTML content
        save_data_to_csv(data)

opensea_data.csv Snapshot:

Scraping OpenSea NFT Detail Pages

In this section, we will learn how to scrape NFT detail pages on OpenSea. Each NFT has its own detail page that has more information such as title, description, price history and other details. We will follow these steps:

Inspecting the HTML for CSS Selectors

Before we write our scraper, we need to find the HTML structure of the NFT detail pages. Here’s how to do it:

Open an NFT Detail Page: Go to OpenSea and open any NFT detail page.
Inspect the Page: Right-click on the page and select “Inspect” or press Ctrl + Shift + I to open the Developer Tools.

Locate Key Elements: Search for the elements that hold the NFT details. Here are the common data points to look for:

Title: In an
tag with class item--title.
Description: In a
tag with class item--description.
Price: In a
tag with class Price--amount.
Image URL: In an tag inside a
with class media-container.
Link to the NFT page: The current URL of the NFT detail page.

Writing the NFT Detail Page Scraper

Now that we have our CSS selectors, we can write our scraper. We’ll use the Crawlbase Crawling API to render JavaScript. Below is an example of how to scrape data from an NFT detail page:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import pandas as pd

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def make_crawlbase_request(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the NFT detail page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

def scrape_opensea_nft_detail(html_content, url):
    soup = BeautifulSoup(html_content, 'html.parser')

    title = soup.select_one('h1.item--title').text.strip() if soup.select_one('h1.item--title') else ''
    description = soup.select_one('div.item--description').text.strip() if soup.select_one('div.item--description') else ''
    price = soup.select_one('div.Price--amount').text.strip() if soup.select_one('div.Price--amount') else ''
    image_urls = [img['src'] for img in soup.select('div.media-container img')]
    link = url  # The link is the current URL

    nft_data = {
        'title': title,
        'description': description,
        'price': price,
        'images_url': image_urls,
        'link': link
    }

    return nft_data

Storing Data in a CSV File

Once we have scraped the NFT details, we can save them in a CSV file. This allows us to easily analyze the data later. Here’s how to do it:

def save_nft_data_to_csv(data, filename='opensea_nft_data.csv'):
    df = pd.DataFrame([data])  # Convert the single NFT data dictionary to a DataFrame
    df.to_csv(filename, index=False)
    print(f"NFT data saved to {filename}")

Complete Code Example

Here’s the complete code that combines all the steps for scraping NFT detail pages:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import pandas as pd

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def make_crawlbase_request(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the NFT detail page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

def scrape_opensea_nft_detail(html_content, url):
    soup = BeautifulSoup(html_content, 'html.parser')

    title = soup.select_one('h1.item--title').text.strip() if soup.select_one('h1.item--title') else ''
    description = soup.select_one('div.item--description').text.strip() if soup.select_one('div.item--description') else ''
    price = soup.select_one('div.Price--amount').text.strip() if soup.select_one('div.Price--amount') else ''
    image_urls = [img['src'] for img in soup.select('div.media-container img')]
    link = url  # The link is the current URL

    nft_data = {
        'title': title,
        'description': description,
        'price': price,
        'images_url': image_urls,
        'link': link
    }

    return nft_data

def save_nft_data_to_csv(data, filename='opensea_nft_data.csv'):
    df = pd.DataFrame([data])  # Convert the single NFT data dictionary to a DataFrame
    df.to_csv(filename, index=False)
    print(f"NFT data saved to {filename}")

# Example usage
if __name__ == "__main__":
    nft_url = "https://opensea.io/assets/matic/0x251be3a17af4892035c37ebf5890f4a4d889dcad/94953658332979117398233379364809351909803379308836092246404100025584049123386"
    html_content = make_crawlbase_request(nft_url)

    if html_content:
        nft_data = scrape_opensea_nft_detail(html_content, nft_url)  # Extract data from HTML content
        save_nft_data_to_csv(nft_data)  # Save NFT data to CSV

opensea_nft_data.csv Snapshot:

Optimize OpenSea NFT Data Scraping

Scraping OpenSea opens up a whole world of NFTs and market data. Throughout this blog, we covered how to scrape OpenSea using Python and Crawlbase Crawling API. By understanding the layout of the site and using the right tools, you can get valuable insights while keeping ethics in mind.

When you get deeper into your scraping projects, remember to store the data in human readable formats, like CSV files, to make analysis a breeze. The NFT space is moving fast and being aware of new trends and technologies will help you get the most out of your data collection efforts. With the right mindset and tools you can find some great insights in the NFT market.

If you want to do more web scraping, check out our guides on scraping other key websites.

📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape X.com Tweet Pages
📜 How to Scrape Clutch.co

If you have any questions or want to give feedback, our support team can help you with web scraping. Happy scraping!

Frequently Asked Questions

Q. Why should I web scrape OpenSea?

Web scraping is a way to automatically extract data from websites. By scraping OpenSea, you can grab important information about NFTs, such as their prices, descriptions, and images. This data helps you analyze market trends, track specific collections or compare prices across NFTs. Overall, web scraping provides valuable insights that can enhance your understanding of the NFT marketplace.

Q. Is it legal to scrape data from OpenSea?

Web scraping is a gray area when it comes to legality. Many websites including OpenSea allow data collection for personal use but always read the terms of service before you start. Make sure your scraping activities comply with the website’s policies and copyright laws. Ethical scraping means using the data responsibly and not flooding the website’s servers.

Q. What tools do I need to start scraping OpenSea?

To start scraping OpenSea, you’ll need a few tools. Install Python and libraries like BeautifulSoup and pandas for data parsing and manipulation. You’ll also use Crawlbase Crawling API to handle dynamic content and JavaScript rendering on OpenSea. With these tools in place you’ll be ready to scrape and analyze NFT data.

How to scrape Gumtree Data in Easy Steps

2024-10-17T18:00:00.000Z

Gumtree is one of the most popular online classifieds websites, where users can buy and sell products or services locally. Whether you’re looking for cars, furniture, property, electronics, or even jobs, Gumtree has millions of listings that update regularly. With over 15 million unique monthly visitors and more than 1.5 million active ads at any time, Gumtree provides a wealth of data that can be used for price comparison, competitor analysis, or tracking trends.

In this blog, we will walk you through how to scrape Gumtree search listings and individual product pages using Python. We will also show how to store the data in CSV files for easy analysis. At the end, we’ll discuss how to optimize the process using Crawlbase Smart Proxy to avoid issues like IP blocking.

Let’s dive in!

Why Scrape Gumtree Data?
Key Data Points to Extract from Gumtree
Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping Gumtree Search Listings

Inspecting the HTML for CSS Selectors
Writing the Search Listings Scraper
Handling Pagination in Gumtree
Storing Data in a CSV File
Complete Code Example

Scraping Gumtree Product Pages

Inspecting the HTML for CSS Selectors
Writing the Product Page Scraper
Storing Data in a CSV File
Complete Code Example

Optimizing Scraping with Crawlbase Smart Proxy

Benefits of Crawlbase Smart Proxy
Integrating Crawlbase Smart Proxy

Final Thoughts
Frequently Asked Questions

Why Scrape Gumtree Data?

Scraping Gumtree data is useful for many things. As the leading online classifieds platform, Gumtree connects buyers and sellers for a wide range of products. Here are some reasons to scrape Gumtree:

Market Trend Analysis: See product prices and availability to track the market.
Competitor Research: Monitor competitor’s listings and pricing to stay ahead.
Identify Popular Products: Find trending items and high demand products.
Informed Business Decisions: Use data to make buying and selling choices.
Price Tracking: Track price changes over time to find deals or trends.
User Behavior Insights: Analyse listings to see what users want.
Enhanced Marketing Strategies: Refine your marketing based on current trends.

In the following sections, we will show you how to effectively scrape Gumtree search listings and product pages.

Key Data Points to Extract from Gumtree

When scraping Gumtree, you need to know what data to grab. Here are the key data points to focus on when scraping Gumtree:

Product Title: The title of the product is usually in the main heading of the listing. This is the most important part.
Price: The listing price is what the seller is asking for the product. Monitoring prices will help you work out the market value.
Location: The location of the seller is usually in the listing. This is useful for understanding regional demand and supply.
Description: The product description has all the details of the item, condition, features and specs.
Image URL: The image URL is important for visual representation. Helps you understand the condition and appeal of the product.
Listing URL: The direct link to the product page is needed to get more details or contact the seller.
Date Listed: The date the listing was posted helps you track how long the item has been available and can indicate demand.
Seller’s Username: The name of the seller can give you an idea of trustworthiness and reliability especially if you’re comparing multiple listings.

Setting Up Your Python Environment

Before you can start scraping Gumtree, you need to set up your Python environment. This involves installing Python and the required libraries. This will give you the tools to send requests, extract data and store it for analysis.

Installing Python and Required Libraries

First make sure you have Python installed on your machine. If you don’t have Python installed, you can download it from the official Python website. Once installed, open a terminal or command prompt and install the required libraries with pip.

Here’s a list of the key libraries needed for scraping Gumtree:

Requests: To send HTTP requests and receive responses.
BeautifulSoup: For parsing HTML and extracting data.
Pandas: For organizing and saving data in CSV format.

Run the following command to install these libraries:

1	pip install requests beautifulsoup4 pandas

Choosing an IDE

An Integrated Development Environment (IDE) makes coding easier and more efficient. Here are some popular IDEs for Python:

PyCharm: A powerful, full-featured IDE with smart code assistance and debugging tools.
Visual Studio Code: A lightweight code editor with a wide range of extensions for Python development.
Jupyter Notebook: Ideal for running code in smaller chunks, making it easier to test and debug.

Once your environment is set up, let’s start scraping Gumtree listings. Next we’ll look at the HTML structure to find CSS selector of elements holding the data we need.

Scraping Gumtree Search Listings

In this section, we will learn how to scrape search listings from Gumtree. We’ll inspect the HTML structure, write the scraper, handle pagination, and store the data in a CSV file.

Inspecting the HTML for CSS Selectors

To get data from Gumtree, we first need to find the HTML elements that contain the information. Open your browser’s Developer Tools and inspect a listing.

Here are some key selectors:

Title: Found in a
tag with the attribute data-q="tile-title".
Price: Located in a
tag with the attribute data-testid="price".
Location: Found in a
tag with the attribute data-q="tile-location".
URL: The product link is within the tag’s href attribute, identified by the attribute data-q="search-result-anchor".

We will use these CSS selectors to extract the required data.

Writing the Search Listings Scraper

Let’s write a function that sends a request to Gumtree, extracts the required data, and returns it.

import requests
from bs4 import BeautifulSoup

def scrape_gumtree_search(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    listings = []

    for listing in soup.select('article[data-q="search-result"]'):
        title = listing.select_one('div[data-q="tile-title"]').text.strip()
        price = listing.select_one('div[data-testid="price"]').text.strip()
        location = listing.select_one('div[data-q="tile-location"]').text.strip()
        link = listing.select_one('a[data-q="search-result-anchor"]')['href']
        listings.append({
            'title': title,
            'price': price,
            'location': location,
            'URL': f'https://www.gumtree.com{link}'
        })

    return listings

This function extracts titles, prices, locations, and URLs from the search results page.

To scrape multiple pages, we need to handle pagination. The URL for subsequent pages usually contains a page parameter, such as ?page=2. We can modify the scraper to fetch data from multiple pages.

def scrape_gumtree_multiple_pages(base_url, max_pages):
    all_listings = []

    for page in range(1, max_pages + 1):
        url = f'{base_url}?page={page}'
        listings = scrape_gumtree_search(url)
        all_listings.extend(listings)

    return all_listings

This function iterates through a specified number of pages and collects the listings from each page.

Storing Data in a CSV File

To store the scraped data, we’ll use the pandas library to write the results into a CSV file.

import pandas as pd

def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)

This function takes a list of listings and saves it into a CSV file with the specified filename.

Complete Code Example

Here’s the complete code to scrape Gumtree search listings, handle pagination, and save the results to a CSV file.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_gumtree_search(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    listings = []

    for listing in soup.select('article[data-q="search-result"]'):
        title = listing.select_one('div[data-q="tile-title"]').text.strip()
        price = listing.select_one('div[data-testid="price"]').text.strip()
        location = listing.select_one('div[data-q="tile-location"]').text.strip()
        link = listing.select_one('a[data-q="search-result-anchor"]')['href']
        listings.append({
            'title': title,
            'price': price,
            'location': location,
            'URL': f'https://www.gumtree.com{link}'
        })

    return listings

def scrape_gumtree_multiple_pages(base_url, max_pages):
    all_listings = []

    for page in range(1, max_pages + 1):
        url = f'{base_url}?page={page}'
        listings = scrape_gumtree_search(url)
        all_listings.extend(listings)

    return all_listings

def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)

def main():
    base_url = 'https://www.gumtree.com/search?q=headset'
    max_pages = 5
    listings = scrape_gumtree_multiple_pages(base_url, max_pages)
    save_to_csv(listings, 'gumtree_listings.csv')
    print(f'Scraped {len(listings)} listings and saved to gumtree_listings.csv')

if __name__ == '__main__':
    main()

This script scrapes Gumtree search listings for a product, handles pagination, and saves the data in a CSV file for further analysis.

gumtree_listings.csv Snapshot:

Scraping Gumtree Product Pages

Now that we’ve scraped the search listings, the next step is to scrape individual product pages for more information. We will inspect the HTML structure of product pages, write the scraper, and save the data in a CSV file.

Inspecting the HTML for CSS Selectors

First, inspect the Gumtree product pages to find the HTML elements that contain the data. Open a product page in your browser and use the Developer Tools to find:

Product Title: Located in an
tag with the attribute data-q="vip-title".
Price: Found inside an
tag with the attribute data-q="ad-price".
Description: Located in a
tag with the attribute itemprop="description".
Seller Name: Inside an
tag with the class seller-rating-block-name.
Product Image URL: Found in tags within a
that has the attribute data-testid="carousel", with the image URL stored in the src attribute.

Writing the Product Page Scraper

We’ll now create a function that takes a product page URL, fetches the page’s HTML content, and extracts the required information.

def scrape_gumtree_product_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extracting product details
    title = soup.select_one('h1[data-q="vip-title"]').text.strip()
    price = soup.select_one('h3[data-q="ad-price"]').text.strip()
    description = soup.select_one('p[itemprop="description"]').text.strip()
    seller_name = soup.select_one('h2.seller-rating-block-name').text.strip()
    images_url = [img['src'] for img in soup.select('div[data-testid="carousel"] img') if 'src' in img.attrs]


    return {
        'title': title,
        'price': price,
        'description': description,
        'seller_name': seller_name,
        'images_url': images_url,
        'product_url': url
    }

This function sends a request to the product page URL, parses the HTML, and extracts the title, price, description, seller name, and product image URL.

Storing Data in a CSV File

Once we have scraped the data, we will store it in a CSV file. We can reuse the save_to_csv function we used earlier for search listings.

import pandas as pd

def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)

Complete Code Example

Here’s the complete code to scrape product pages, extract the required details, and store them in a CSV file.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_gumtree_product_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extracting product details
    title = soup.select_one('h1[data-q="vip-title"]').text.strip()
    price = soup.select_one('h3[data-q="ad-price"]').text.strip()
    description = soup.select_one('p[itemprop="description"]').text.strip()
    seller_name = soup.select_one('h2.seller-rating-block-name').text.strip()
    images_url = [img['src'] for img in soup.select('div[data-testid="carousel"] img') if 'src' in img.attrs]


    return {
        'title': title,
        'price': price,
        'description': description,
        'seller_name': seller_name,
        'images_url': images_url,
        'product_url': url
    }

def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)

def main():
    product_urls = [
        'https://www.gumtree.com/p/bmw/bmw-1-series-118d-sport-5dr-nav-/1488114476',
        'https://www.gumtree.com/p/kia/diesel-estate-12-months-mot-px-welcome-nationwide-delivery-available/1483456978',
        # Add more product URLs here
    ]

    product_data = []

    for url in product_urls:
        product_info = scrape_gumtree_product_page(url)
        product_data.append(product_info)

    save_to_csv(product_data, 'gumtree_product_data.csv')
    print(f'Scraped {len(product_data)} product pages and saved to gumtree_product_data.csv')

if __name__ == '__main__':
    main()

This script scrapes product details from individual Gumtree product pages and saves the extracted information in a CSV file. You can add more product URLs to the product_urls list to scrape multiple pages.

gumtree_product_data.csv Snapshot:

Optimizing Scraping with Crawlbase Smart Proxy

When scraping websites like Gumtree you may run into rate limits or IP bans. To scrape smoothly and efficiently use Crawlbase Smart Proxy. This service helps you bypass restrictions and improve your scraping.

Benefits of Crawlbase Smart Proxy

Avoid IP Blocking: Crawlbase rotates IP addresses so your requests are anonymous and you won’t get blocked.
CAPTCHA Handling: It handles CAPTCHA challenges for you so you can scrape without interruptions.
Faster Scraping: By using multiple IPs you can make requests quickly and gather data faster.
Geolocation: Choose proxies from specific locations to scrape localized data and get more relevant results.

Integrating Crawlbase Smart Proxy

To use Crawlbase Smart Proxy in your Gumtree scraper, set up your requests to route through the proxy. Here’s an example of how to do this:

import requests
from bs4 import BeautifulSoup

# Replace '_USER_TOKEN_' with your Crawlbase token
proxy_url = "http://_USER_TOKEN_@smartproxy.crawlbase.com:8012"
proxies = {"http": proxy_url, "https": proxy_url}

def scrape_gumtree_product_page(url):
    response = requests.get(url, proxies=proxies, verify=False)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extracting product details
        title = soup.select_one('h1[data-q="vip-title"]').text.strip()
        price = soup.select_one('h3[data-q="ad-price"]').text.strip()
        description = soup.select_one('p[itemprop="description"]').text.strip()
        seller_name = soup.select_one('h2.seller-rating-block-name').text.strip()
        images_url = [img['src'] for img in soup.select('div[data-testid="carousel"] img') if 'src' in img.attrs]


        return {
            'title': title,
            'price': price,
            'description': description,
            'seller_name': seller_name,
            'images_url': images_url,
            'product_url': url
        }
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        return None

# Example usage
if __name__ == "__main__":
    product_url = "https://www.gumtree.com/product-page-url"  # Replace with the actual product URL
    product_data = scrape_gumtree_product_page(product_url)
    print(product_data)

In this code snippet, replace '_USER_TOKEN_' with your actual Crawlbase token. You can get one by creating an account on Crawlbase The proxies dictionary routes your requests through the Crawlbase Smart Proxy, helping you avoid blocks and maintain fast scraping speeds.

By optimizing your Gumtree scraping process with Crawlbase Smart Proxy, you can gather data more effectively and handle larger volumes without facing common web scraping issues.

Optimize Gumtree Scraping with Crawlbase

Scraping Gumtree data can be very useful for your projects. In this blog we have shown how to scrape search listings and product pages using Python. By inspecting the HTML and using the Requests library you can extract useful data such as titles, prices and descriptions.

Make sure your scraping runs smoothly by using tools like Crawlbase Smart Proxy. It will help you avoid IP blocks and maintain fast scraping speeds so you can focus on getting the data you need.

If you’re interested in exploring scraping from other e-commerce platforms, feel free to explore the following comprehensive guides.

📜 How to Scrape Amazon
📜 How to scrape Walmart
📜 How to Scrape AliExpress
📜 How to Scrape Houzz Data
📜 How to Scrape Tokopedia

Contact our support if you have any questions. Happy scraping.

Frequently Asked Questions

Q. Is it legal to scrape data from Gumtree?

Yes, it is generally legal to scrape Gumtree data as long as you comply with their terms of service. Always check the website’s policies to make sure you’re not breaking any rules. Always use the scraped data responsibly and ethically.

Q. What data can I scrape from Gumtree?

You can scrape various types of data from Gumtree, including product titles, prices, descriptions, images, and seller information. This data can help you analyze market trends or compare prices across different listings.

Q. How can I avoid getting blocked while scraping?

To avoid getting blocked while scraping, consider using a rotating proxy service like Crawlbase Smart Proxy. This will help you manage IP addresses so your scraping looks like regular user behavior. Also, implement delays between requests to reduce the chances of getting blocked.

How to Scrape Tokopedia Data

2024-10-15T10:00:00.000Z

Tokopedia, one of Indonesia’s biggest e-commerce platforms has 90+ million active users and 350 million monthly visits. The platform has a wide range of products, from electronics, fashion, groceries to personal care. For businesses and developers, scraping Tokopedia data can give you insights into product trends, pricing strategy, and customer preference.

Tokopedia uses JavaScript to render its content; the traditional scraping method doesn’t work. Crawlbase Crawling API helps by handling JavaScript-rendered content seamlessly. In this tutorial, you’ll learn how to use Python and Crawlbase to scrape Tokopedia search listings and product pages for product names, prices, and ratings.

Let’s get started!

Why Scrape Tokopedia Data?
Key Data Points to Extract from Tokopedia
Crawlbase Crawling API for Tokopedia Scraping

Crawlbase Python Library

Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping Tokopedia Search Listings

Inspecting the HTML for CSS Selectors
Writing the Search Listings Scraper
Handling Pagination in Tokopedia
Storing Data in a JSON File
Complete Code

Scraping Tokopedia Product Pages

Inspecting the HTML for CSS Selectors
Writing the Product Page Scraper
Storing Data in a JSON File
Complete Code

Final Thoughts
Frequently Asked Questions

Why Scrape Tokopedia Data?

Scraping Tokopedia data can be beneficial for businesses and developers. As one of Indonesia’s biggest e-commerce platform, Tokopedia has a lot of information about products, prices and customer behavior. By extracting this data, you can get ahead in the online market.

There are many reasons why one would choose to scrape data from Tokopedia:

Market Research: Knowing the current demand will help you with inventory and marketing planning. Opportunities can always be found by looking at the general trends.
Price Comparison: One may be able to scrape Tokopedia and get several prices on products from various categories. This would allow one to make price adjustments in order to remain competitive.
Competitor Analysis: Compiling the data about the products of the competitors will help you understand how they position themselves and where are their weak points.
Customer Insights: Looking into product reviews and ratings will help understand major pros and cons of various goods from customers’ point of view.
Product Availability: Monitor products so that you know when the hot ones are getting low, bob up the stocks to appease customers.

In the next section we will see what we can scrape from Tokopedia.

Key Data Points to Extract from Tokopedia

When scraping Tokopedia, focus on the important data points and you’ll get actionable insights for your business or research. Here are the data points to grab:

Product Name: Identifies the product.
Price: For price monitoring and competition analysis.
Ratings and Reviews: For User experience and products usability.
Availability: For stock level and product availability.
Seller Information: Details on third-party vendors, seller ratings and location.
Product Images: Images for visual representation and understanding of the product.
Product Description: For the details of the product.
Category and Tags: For arrangement of products and categorized analysis.

Concentrating on these aspects of data, allows one to collect useful insights from Tokopedia that can aid one in refining or making better decisions. Next, we will see how to set up your Python environment for scraping.

Crawlbase Crawling API for Tokopedia Scraping

The Crawlbase Crawling API makes scraping Tokopedia fast and straightforward. Since Tokopedia’s website uses dynamic content, much of the data is loaded via JavaScript, making it challenging to scrape with traditional methods. But Crawlbase Crawling API renders the pages like a real browser so you can access the data.

Here’s why Crawlbase Crawling API is good for scraping Tokopedia:

Handles Dynamic Content: Crawlbase handles JavaScript heavy pages so all product data is fully loaded and ready to scrape.
IP Rotation: To prevent getting blocked by Tokopedia’s security systems, Crawlbase automatically rotates IPs, letting you scrape without worrying about rate limits or bans.
Fast Performance: Crawlbase allows you to efficiently scrape massive amounts of data while saving time and resources.
Customizable Requests: You can change the headers, cookies and control requests to fit your needs.

With these features, Crawlbase Crawling API makes scraping Tokopedia easier and more efficient.

Crawlbase Python Library

Crawlbase also provides a Python library to make web scraping even easier. To use this library you will need an access token that you can get by signing up to Crawlbase.

Here’s an example function to send a request to Crawlbase Crawling API:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase provides two types of tokens. Normal Token for static sites. JavaScript (JS) Token for dynamic or browser-rendered content, which is required for scraping Tokopedia. Crawlbase offers 1,000 free requests to help you get started, and you can sign up without a credit card. For more details, refer to the Crawlbase Crawling API documentation.

In the next section, we’ll learn how to setup Python environment for Tokopedia scraping.

Setting Up Your Python Environment

To start scraping Tokopedia, you need to setup your Python environment. Follow these steps to get started:

Installing Python and Required Libraries

Make sure Python is installed on your machine. You can download it here. After installation, run the following command to install the necessary libraries:

1	pip install crawlbase beautifulsoup4

Crawlbase: For interacting with the Crawlbase Crawling API to handle dynamic content.
BeautifulSoup: For parsing and extracting data from HTML.

These tools are essential for scraping Tokopedia’s data efficiently.

Selecting an IDE

Choose an IDE for seamless development:

Visual Studio Code: Lightweight and frequently used.
PyCharm: A full-featured IDE with powerful Python capabilities.
Jupyter Notebook: Ideal for interactive coding and testing.

Once your environment is set up, you can begin scraping Tokopedia. Next, we’ll cover how to create Tokopedia SERP Scraper.

Scraping Tokopedia Search Listings

Now that you have your Python environment ready, we can start scraping Tokopedia’s search listings. In this section, we’ll guide you through inspecting the HTML, writing the scraper, handling pagination and storing the data in a JSON file.

Inspecting the HTML Structure

First, you need to inspect the HTML of the Tokopedia search results page from which you want to scrape product listings. For this example, we’ll be scraping headset listings from the following URL:

1	https://www.tokopedia.com/search?q=headset

Open the developer tools in your browser and navigate to this URL.

Here are some key selectors to focus on:

Product Title: Found in a tag with class OWkG6oHwAppMn1hIBsC3pQ== which contains the name of the product.
Price: In a
tag with class ELhJqP-Bfiud3i5eBR8NWg== that displays the product price.
Store Name: Found in a tag with class X6c-fdwuofj6zGvLKVUaNQ==.
Product Link: Product page Link found in an tag with class Nq8NlC5Hk9KgVBJzMYBUsg==, accessible via the href attribute.

Writing the Search Listings Scraper

We’ll write a function that makes a request to the Crawlbase Crawling API, retrieves the HTML, and then parses the data using BeautifulSoup.

Here’s the code to scrape the search listings:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

# Function to get HTML content from Crawlbase
def fetch_html(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch page. Status code: {response['headers']['pc_status']}")
        return None

# Function to parse and extract product data
def parse_search_listings(html):
    soup = BeautifulSoup(html, 'html.parser')
    products = []

    for product in soup.select('div[data-testid="divSRPContentProducts"] div.css-5wh65g'):
        name = product.select_one('span.OWkG6oHwAppMn1hIBsC3pQ\\=\\=').text.strip() if product.select_one('span.OWkG6oHwAppMn1hIBsC3pQ\\=\\=') else 'N/A'
        price = product.select_one('div.ELhJqP-Bfiud3i5eBR8NWg\\=\\=').text.strip() if product.select_one('div.ELhJqP-Bfiud3i5eBR8NWg\\=\\=') else 'N/A'
        store = product.select_one('span.X6c-fdwuofj6zGvLKVUaNQ\\=\\=').text.strip() if product.select_one('span.X6c-fdwuofj6zGvLKVUaNQ\\=\\=') else 'N/A'
        product_url = product.select_one('a.Nq8NlC5Hk9KgVBJzMYBUsg\\=\\=')['href'] if product.select_one('a.Nq8NlC5Hk9KgVBJzMYBUsg\\=\\=') else 'N/A'

        products.append({
            'name': name,
            'price': price,
            'store': store,
            'product_url': product_url
        })

    return products

This function first fetches the HTML using the Crawlbase Crawling API and then parses the data using BeautifulSoup to extract the product information.

Tokopedia’s search results are spread across multiple pages. To scrape all listings, we need to handle pagination. Each subsequent page can be accessed by appending a page parameter to the URL, such as ?page=2.

Here’s how to handle pagination:

# Function to scrape multiple pages of search listings
def scrape_multiple_pages(base_url, max_pages):
    all_products = []

    for page in range(1, max_pages + 1):
        paginated_url = f"{base_url}&page={page}"
        html_content = fetch_html(paginated_url)

        if html_content:
            products = parse_search_listings(html_content)
            all_products.extend(products)
        else:
            break

    return all_products

This function loops through the search result pages, scrapes the product listings from each page, and aggregates the results.

Storing Data in a JSON File

After scraping the data, you can store it in a JSON file for easy access and future use. Here’s how you can do it:

# Function to save data to a JSON file
def save_to_json(data, filename='tokopedia_search_results.json'):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to {filename}")

Complete Code Example

Below is the complete code to scrape Tokopedia search listings for headsets, including pagination and saving the data to a JSON file:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def fetch_html(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch page. Status code: {response['headers']['pc_status']}")
        return None

def parse_search_listings(html):
    soup = BeautifulSoup(html, 'html.parser')
    products = []

    for product in soup.select('div[data-testid="divSRPContentProducts"] div.css-5wh65g'):
        name = product.select_one('span.OWkG6oHwAppMn1hIBsC3pQ\\=\\=').text.strip() if product.select_one('span.OWkG6oHwAppMn1hIBsC3pQ\\=\\=') else 'N/A'
        price = product.select_one('div.ELhJqP-Bfiud3i5eBR8NWg\\=\\=').text.strip() if product.select_one('div.ELhJqP-Bfiud3i5eBR8NWg\\=\\=') else 'N/A'
        store = product.select_one('span.X6c-fdwuofj6zGvLKVUaNQ\\=\\=').text.strip() if product.select_one('span.X6c-fdwuofj6zGvLKVUaNQ\\=\\=') else 'N/A'
        product_url = product.select_one('a.Nq8NlC5Hk9KgVBJzMYBUsg\\=\\=')['href'] if product.select_one('a.Nq8NlC5Hk9KgVBJzMYBUsg\\=\\=') else 'N/A'

        products.append({
            'name': name,
            'price': price,
            'store': store,
            'product_url': product_url
        })

    return products

def scrape_multiple_pages(base_url, max_pages):
    all_products = []

    for page in range(1, max_pages + 1):
        paginated_url = f"{base_url}&page={page}"
        html_content = fetch_html(paginated_url)

        if html_content:
            products = parse_search_listings(html_content)
            all_products.extend(products)
        else:
            break

    return all_products

def save_to_json(data, filename='tokopedia_search_results.json'):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to {filename}")

# Scraping data from Tokopedia search listings
base_url = 'https://www.tokopedia.com/search?q=headset'
max_pages = 5  # Adjust the number of pages you want to scrape
search_results = scrape_multiple_pages(base_url, max_pages)

# Save results to a JSON file
save_to_json(search_results)

Example Output:

[
    {
        "name": "Ipega PG-R008 Gaming Headset for P4 /X1 series/N-Switch Lite/Mobile/ta",
        "price": "Rp178.000",
        "store": "ipegaofficial",
        "product_url": "https://www.tokopedia.com/ipegaofficial/ipega-pg-r008-gaming-headset-for-p4-x1-series-n-switch-lite-mobile-ta?extParam=ivf%3Dfalse&src=topads"
    },
    {
        "name": "Hippo Toraz Handsfree Earphone Stereo Sound - headset, Putih",
        "price": "Rp13.000",
        "store": "HippoCenter",
        "product_url": "https://www.tokopedia.com/hippocenter88/hippo-toraz-handsfree-earphone-stereo-sound-headset-putih?extParam=ivf%3Dfalse&src=topads"
    },
    {
        "name": "HEADSET ORIGINAL COPOTAN VIVO OPPO XIAOMI REALMI JACK 3.5MM SUPERBASS - OPPO",
        "price": "Rp5.250",
        "store": "BENUAACELL",
        "product_url": "https://www.tokopedia.com/bcbenuacell/headset-original-copotan-vivo-oppo-xiaomi-realmi-jack-3-5mm-superbass-oppo?extParam=ivf%3Dfalse&src=topads"
    },
    {
        "name": "earphone bluetooth wireless headset gaming full bass",
        "price": "Rp225.000",
        "store": "Kopi 7 Huruf",
        "product_url": "https://www.tokopedia.com/kopi7huruf/earphone-bluetooth-wireless-headset-gaming-full-bass?extParam=ivf%3Dfalse&src=topads"
    },
    {
        "name": "Earphone In-Ear 4D Stereo Super Bass dengan Mic with Kabel Jack 3.5mm Headset Crystal-Clear Sound - Putih",
        "price": "Rp15.000Rp188.000",
        "store": "MOCUTE STORE",
        "product_url": "https://www.tokopedia.com/mocutestore/earphone-in-ear-4d-stereo-super-bass-dengan-mic-with-kabel-jack-3-5mm-headset-crystal-clear-sound-putih-97573?extParam=ivf%3Dtrue&src=topads"
    },
    .... more
]

In the next section, we’ll cover scraping individual product pages on Tokopedia to get detailed information.

Scraping Tokopedia Product Pages

Now that we have scraped search listings, let’s move on to scraping product details from individual product pages. In this section, we will scrape product name, price, store name, description and image URL from a Tokopedia product page.

Inspecting the HTML for CSS Selectors

Before we write the scraper, we need to inspect the HTML structure of the product page to find the correct CSS selectors for the data we want to scrape. For this example, we’ll scrape the product page from the following URL:

1	https://www.tokopedia.com/thebigboss/headset-bluetooth-tws-earphone-bluetooth-stereo-bass-tbb250-beige-8d839

Open the developer tools in your browser and navigate to this URL.

Here’s what we need to focus on:

Product Name: Found in an
tag with the attribute data-testid="lblPDPDetailProductName".
Price: The price is located in a
tag with the attribute data-testid="lblPDPDetailProductPrice".
Store Name: The store name is inside an tag with the attribute data-testid="llbPDPFooterShopName".
Product Description: Located in a
tag with the attribute data-testid="lblPDPDescriptionProduk" which contains detailed information about the product.
Images URL: The main product image is found within a

Writing the Product Page Scraper

Now that we have inspected the page, we can start writing the scraper. Below is a Python function that uses the Crawlbase Crawling API to fetch the HTML and BeautifulSoup to parse the content.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def scrape_product_page(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extracting Product Data
        product_data = {}
        product_data['name'] = soup.select_one('h1[data-testid="lblPDPDetailProductName"]').text.strip()
        product_data['price'] = soup.select_one('div[data-testid="lblPDPDetailProductPrice"]').text.strip()
        product_data['store_name'] = soup.select_one('a[data-testid="llbPDPFooterShopName"]').text.strip()
        product_data['description'] = soup.select_one('div[data-testid="lblPDPDescriptionProduk"]').text.strip()
        product_data['images_url'] = [img['src'] for img in soup.select('button[data-testid="PDPImageThumbnail"] img.css-1c345mg')]

        return product_data
    else:
        print(f"Failed to fetch the page. Status code: {response['headers']['pc_status']}")
        return None

Storing Data in a JSON File

After scraping the product details, it’s good practice to store the data in a structured format like JSON. Here’s how to write the scraped data into a JSON file.

def store_data_in_json(data, filename='tokopedia_product_data.json'):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data stored in {filename}")

Complete Code Example

Here’s the complete code that scrapes the product page and stores the data in a JSON file.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

# Function to scrape Tokopedia product page
def scrape_product_page(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extracting Product Data
        product_data = {}
        product_data['name'] = soup.select_one('h1[data-testid="lblPDPDetailProductName"]').text.strip()
        product_data['price'] = soup.select_one('div[data-testid="lblPDPDetailProductPrice"]').text.strip()
        product_data['store_name'] = soup.select_one('a[data-testid="llbPDPFooterShopName"]').text.strip()
        product_data['description'] = soup.select_one('div[data-testid="lblPDPDescriptionProduk"]').text.strip()
        product_data['images_url'] = [img['src'] for img in soup.select('button[data-testid="PDPImageThumbnail"] img.css-1c345mg')]

        return product_data
    else:
        print(f"Failed to fetch the page. Status code: {response['headers']['pc_status']}")
        return None

# Function to store scraped data in a JSON file
def store_data_in_json(data, filename='tokopedia_product_data.json'):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data stored in {filename}")

# Scraping product page and saving data
url = 'https://www.tokopedia.com/thebigboss/headset-bluetooth-tws-earphone-bluetooth-stereo-bass-tbb250-beige-8d839'
product_data = scrape_product_page(url)

if product_data:
    store_data_in_json(product_data)

Example Output:

{
  "name": "headset bluetooth tws earphone bluetooth stereo bass tbb250 - Beige",
  "price": "Rp299.000",
  "store_name": "The Big Boss 17",
  "description": "1.Efek suara surround Audio 6D DirectionalMenggunakan teknologi konduksi udara, suara headset Bluetooth ini diarahkan ke telinga Anda, secara efektif mengurangi 90% kebocoran suara, sekaligus menjaga liang telinga tetap segar dan menghindari rasa malu di tempat umum.2.Buka desain non-in-earDesain memakai gaya anting-anting; berlari, menari, bermain skateboard, bersepeda, dan tantangan olahraga intensitas tinggi lainnya, kenyamanan pemakaian jangka panjang yang nyata, tidak ada perasaan memakai, dan tidak dapat dibuang.3.Menggunakan bahan silikon lembut, bahannya sangat ringan, dan berat masing-masing telinga hanya 4,5g, yang dapat mengurangi tekanan pada telinga; dapat diregangkan hingga 75\u00b0, sehingga lebih nyaman dipakai4.Bluetooth 5.3Chip Bluetooth generasi baru dapat mengurangi penundaan mendengarkan musik dan menonton video. Koneksi stabil dalam jarak 10 meter, koneksi langsung dalam 1 detik setelah membuka penutup.5.Desain sentuh cerdasIni dapat dioperasikan dengan satu tangan, dan sentuhannya sensitif dan nyaman; ganti lagu kapan saja, jawab panggilan, asisten panggilan, dan kendalikan dengan bebas6.Panggilan peredam bising dua arahMikrofon peredam bising bawaan dapat secara efektif memfilter suara sekitar selama panggilan, mengidentifikasi suara manusia secara akurat, dan membuat setiap percakapan Anda lebih jelas.7. IPx5 tahan air\ud83d\udca7Tingkat tahan air IPx5, efektif menahan keringat dan tetesan hujan kecil, jangan khawatir berkeringat atau hujan.8.Baterai tahan lama\ud83d\udd0bBaterai earphone juga dapat digunakan selama 5 jam, dan waktu siaga hingga 120 jam, memberi Anda waktu mendengarkan yang lebih lamaDaftar aksesori headphone* Earphone x 2 (kiri & kanan)* Kotak pengisian daya* Kabel pengisi daya USB-C* Panduan Cepat & Garansi",
  "images_url": [
    "https://images.tokopedia.net/img/cache/100-square/VqbcmM/2024/7/28/3119dca0-2d66-45d7-b6a1-445d0782b15a.jpg.webp?ect=4g",
    "https://images.tokopedia.net/img/cache/100-square/VqbcmM/2024/7/28/9d9ddcff-7f52-43cc-8271-c5e135de392b.jpg.webp?ect=4g",
    "https://images.tokopedia.net/img/cache/100-square/VqbcmM/2024/7/28/d35975e6-222c-4264-b9f2-c2eacf988401.jpg.webp?ect=4g",
    "https://images.tokopedia.net/img/cache/100-square/VqbcmM/2024/7/28/5aba89e3-a37a-4e3a-b1f8-429a68190817.jpg.webp?ect=4g",
    "https://images.tokopedia.net/img/cache/100-square/VqbcmM/2024/7/28/c6c3bb2d-3215-4993-b908-95b309b29ddd.jpg.webp?ect=4g",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg",
    "https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/85cc883d.svg"
  ]
}

This complete example shows how to extract product details from Tokopedia product page and save them into a JSON file. It handles dynamic content so good for scraping data from JavaScript rendered pages.

Optimize Tokopedia Scraping with Crawlbase

Scraping Tokopedia can help you get product data for research, price comparison or market analysis. With Crawlbase Crawling API, you can navigate dynamic website and extract data fast even from JavaScript heavy pages.

In this blog, we covered how to setup the environment, find CSS selectors from HTML, and write the Python code to scrape product listings and product pages from Tokopedia. With the method used in this blog, you can easily collect useful information like product names, prices, descriptions, and images from Tokopedia and store them in a structured format like JSON.

If you’re interested in exploring scraping from other e-commerce platforms, feel free to explore the following comprehensive guides.

📜 How to Scrape Amazon
📜 How to scrape Walmart
📜 How to Scrape AliExpress
📜 How to Scrape Zalando
📜 How to Scrape Costco

Contact our support if you have any questions. Happy scraping.

Frequently Asked Questions

Q. Is it legal to scrape data from Tokopedia?

Scraping data from Tokopedia can be legal as long as you follow their terms of service and use the data responsibly. Always review the website’s rules and avoid scraping sensitive or personal data. It’s important to use the data for ethical purposes, like research or analysis, without violating Tokopedia’s policies.

Q. Why should I use Crawlbase Crawling API for scraping Tokopedia?

Tokopedia uses dynamic content that loads through JavaScript, making it harder to scrape using traditional methods. Crawlbase Crawling API makes this process easier by rendering the website in a real browser. It also controls IP rotation to prevent blockages, making scraping more effective and dependable.

Q. What key data points can I extract from Tokopedia product pages?

When scraping Tokopedia product pages, you can extract several important data points, including the product title, price, description, ratings, and image URLs. These details are useful for analysis, price comparison or building a database of products to understand market trends.

How to Get Around IP Bans in 2024

2024-10-11T13:58:44.000Z

We’ve all been there - trying to get into a website or service to find ourselves locked out by an IP ban. It’s annoying, but don’t stress; we’re here to help. This guide will show you how to bypass an IP ban. To navigate today’s online world, where privacy and access matter more than ever, you need to understand how IP address blocking works.

We’re going to explore different ways to get around IP bans, from easy methods like using a VPN to more complex approaches involving proxy rotation. We’ll also check out how CAPTCHAs and MAC addresses play a part in IP banning and how you can handle them. On top of that, we’ll give you some tips to avoid future IP bans, helping you stay ahead of the game.

This guide has valuable information for everyone from regular internet users to tech experts who want to understand and work around online access limits.

What is an IP ban?

An IP ban is a safety measure that website owners and online service providers use to limit or stop access from specific IP addresses. This approach is often used to guard against different types of harmful activity and to keep users safe online. When an IP address gets banned, any attempts to connect from that address will be turned away stopping the user from getting into the service.

How IP bans work

IP bans work by spotting and keeping tabs on the unique IP addresses given to devices hooked up to the internet. When someone visits a website, the system checks if their IP address is on the blacklist. If it is, the system turns down the user’s request to access, and they can’t interact with the server. IP bans can be temporary or permanent, based on how bad the rule-breaking was and what rules the service provider has in place.

Common reasons why People Experience IP bans

There are several reasons why a service might ban an IP address:

Breaking the rules: Doing things the website says you can’t, like sending spam or being mean to others.
Odd behavior: Computer systems often spot strange actions or possible security risks.
Location limits: Some online services work in certain places.
Using too much: Websites might block IP addresses that use way more internet stuff than normal.
Scraping data: Trying to take information from a website without asking first.

Smart Ways to Get Around IP Bans

VPNs

One of the best ways to get around an IP ban is to use a Virtual Private Network (VPN). A VPN hides your actual IP address and gives you one from its server network. This lets you get past IP bans and see blocked content. When you hook up to a VPN, it sends your internet traffic through a coded tunnel, which makes it hard for websites to figure out where you are or who you are. To use a VPN to beat IP bans, just pick a good VPN service, put the app on your device, choose where you want your server to be, and connect. Once you’re connected, you can visit websites that were blocked before.

Proxy servers

Another good way to get around IP bans involves using proxy servers. Proxies stand between your device and the internet, concealing your actual IP address. They work well for web scraping and handling multiple profiles. Rotating residential proxies proves useful because it uses real IP addresses from home devices worldwide, making your traffic look more like a human’s and less suspicious. You can set up proxies, scale them, and use them to target specific locations to bypass geo-blocking. But keep in mind that you might need to maintain them, and VPNs may not protect you as well as VPNs if you choose cheaper options like data center proxies.

Changing your IP address

If you’re tech-savvy, you can change your IP address to get around IP bans. This involves either tweaking your router settings to ask for a new IP address or disconnecting and reconnecting your internet to trigger an IP change. While this approach doesn’t cost anything extra or need setup, it can take up a lot of time if you do it often and doesn’t offer any extra privacy features.

Advanced Strategies to Deal with IP Bans

Here are some advanced ways to deal with stringent IP bans:

Avoid browser fingerprinting

When you’re up against ongoing IP bans, you need to know that many websites use browser fingerprinting to spot and block users. This method gathers info about your operating system, browser type, screen resolution, and other details to create a unique digital fingerprint. To bypass an IP ban, we should focus on dodging these fingerprinting techniques.

One way to do this is to use special tools that change your browser’s fingerprint. These tools can tweak the data points websites use to identify you, making it tougher for them to keep tabs on and ban what you’re doing. But keep in mind this approach needs some tech know-how and might affect how you browse.

Residential proxies

Residential proxies have become a popular way to get around persistent IP bans. These proxies use IP addresses that Internet Service Providers (ISPs) give to homeowners, making them look like regular users. This method reduces the chances of getting blocked because residential IPs are less likely to show up in bot-tracking databases.

When you use residential proxies, think about using a proxy rotation strategy. This means switching between multiple proxy servers, which makes it seem like multiple users are accessing from different places. This technique helps lower the risk of detection and makes it easier to get around IP bans.

Crawlbase’s Smart Proxy utilizes millions of residential proxies to outsmart IP detection and avoid bans from websites.

Best Practices to Avoid Future IP Bans

Respecting website policies

To steer clear of an IP ban and dodge future problems, it’s key to follow website rules. We should get to know the target site’s terms of service, community rules, and acceptable use policies. By sticking to these guidelines, we can cut down the chance of getting banned. One big thing is to check and obey the robots.txt file, which tells us which parts of the website we shouldn’t process or scan. Even though ignoring these rules might not cause an instant IP ban, it’s seen as good practice to follow them.

Putting rate limits in place

Putting rate limiting into action is crucial to get around IP bans and keep access to websites. We need to add delays between requests to copy human browsing habits. This method helps stop servers from getting too many requests.

Using multiple IP addresses

To dodge an IP ban, using several IP addresses through proxy rotation is a strong approach. This method involves switching between different IP addresses for each request, making it tougher for websites to spot and block our actions. But it’s key to use dependable proxy solutions when doing automated jobs like web scraping or managing social media. By putting these best practices into action, we can lower the chance of IP bans and keep access to the online resources we want.

Final Thoughts

Getting around IP bans is a common challenge when surfing the web, but with the right know-how and tools, people can get past these roadblocks. This guide has looked into different ways to get around IP restrictions, from using VPNs and proxies to more complex methods like avoiding browser fingerprinting. By grasping why IP bans happen and putting good practices into action, users can keep accessing the online content they want while following website rules.

In the end, finding ways around IP bans is about striking a balance between getting to content and staying within ethical limits. While these methods can come in handy, it’s key to use them the right way and to think about the legal and moral issues. By following the steps laid out in this guide and joining other business owners to choose Crawlbase for your web scraping needs.

Try now for free!

Frequently Asked Questions (FAQs)

How can I reverse an IP ban?

You have several options to reverse an IP ban. You can use proxy servers as middlemen between your device and the internet. VPNs are another choice. The Tor network is also an option. You can change your IP address. Switching to mobile data might work. Web scraping techniques can also help you get around an IP ban.

What are some methods to bypass a website’s IP block?

You can try a few ways to get past an IP block. Using a VPN is a common approach. Changing your MAC address might do the trick. Proxy servers can be useful, too. It’s key to follow the website’s rules. Refrain from spamming or posting too much. Stay away from stolen login info. Make sure you don’t break any content or copyright rules.

Is it possible to change your IP address after being IP banned?

Yes, you can change your IP address if the ban targets it. You can do this by learning how to change your IP address using a proxy or setting up a VPN. Remember to clear your cookies before trying these methods to make sure you bypass the ban.

Is it legal for a company or website to ban an IP address?

Yes, companies and websites can ban IP addresses. They often see this as necessary to enforce rules or guard against misuse, even though the people affected might think it’s unfair.

How to Scrape Houzz Data

2024-10-09T10:00:00.000Z

Houzz is a platform where homeowners, designers and builders come together to find products, inspiration and services. It’s one of the top online platforms for home renovation, interior design and furniture shopping. With over 65 million unique users and 10 million product listings, Houzz is a treasure trove of data for businesses, developers and researchers. The platform offers insights that can be used to build an e-commerce, do market research or analyze design trends.

In this blog, we’ll walk you through how to scrape Houzz search listings and product pages using Python. We’ll show you how to optimize your scraper using Crawlbase Smart Proxy so you can scrape smoothly and efficiently even from websites with anti-scraping measures.

Let’s get started!

Why Scrape Houzz Data?
Key Data Points to Extract from Houzz
Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping Houzz Search Listings

Inspecting the HTML Structure
Writing the Houzz Search Listings Scraper
Handling Pagination
Storing Data in a JSON File
Complete Code Example

Scraping Houzz Product Pages

Inspecting the HTML Structure
Writing the Houzz Product Page Scraper
Storing Data in a JSON File
Complete Code Example

Optimizing with Crawlbase Smart Proxy

Why Use Crawlbase Smart Proxy?
How to add it to your Scraper?

Final Thoughts
Frequently Asked Questions (FAQs)

Why Scrape Houzz Data?

Scraping Houzz data can be incredibly useful for a variety of reasons. With its large collection of home products, furniture, and decor, Houzz offers a lot of data that can help businesses and individuals make informed decisions. Following are some of the reasons to scrape Houzz data.

Market Research: If you’re in the home decor or furniture industry, you can analyze product trends, pricing strategies and customer preferences by scraping product details and customer reviews from Houzz.
Competitor Analysis: For e-commerce businesses, scraping Houzz will give you competitor pricing, product availability and customer ratings so you can stay competitive.
Product Data Aggregation: If you’re building a website or app that compares products across multiple platforms, scrape Houzz to include its massive product catalog in your data.
Customer Sentiment Analysis: Collect reviews and ratings to analyze customer sentiment about specific products or brands. Help brands improve their offerings or help buyers make better decisions.
Data-Driven Decisions: Scrape Houzz to make informed decisions on what products to stock, how to price them and what customers are looking for.

Key Data Points to Extract from Houzz

When scraping from Houzz, you can focus on several key pieces of information. Here are the data points to extract from Houzz:

Name: The product name.
Price: The product price.
Description: Full details on features and materials.
Images: High res images of the product.
Ratings and Reviews: Customer feedback on product.
Specifications: Dimensions, materials etc.
Seller: Information on the seller or store.
Company: Business name.
Location: Business location.
Phone: Business phone number.
Website: Business website.
Email: Business email (if on website).

Setting Up Your Python Environment

To get started scraping Houzz data you need to set up your Python environment. This involves installing Python, the necessary libraries and an Integrated Development Environment (IDE) to make coding easier.

Installing Python and Required Libraries

First, you need to install Python on your computer. You can download the latest version from python.org. After installing open a terminal or command prompt to make sure Python is installed by typing:

1	python --version

Next, you’ll need to install the libraries for web scraping. The two main ones are requests for fetching web pages and BeautifulSoup for parsing the HTML. Install these by typing:

1	pip install requests beautifulsoup4

These libraries are essential for extracting data from Houzz’s HTML structure and making the process smooth.

Choosing an IDE

An IDE makes writing and managing your Python code easier. Some popular options include:

Visual Studio Code: A lightweight, free editor with great extensions for Python development.
PyCharm: A dedicated Python IDE with many built-in features for debugging and code navigation.
Jupyter Notebook: Great for interactive coding and seeing your results immediately.

Choose the IDE that suits you and your coding style. Once your environment is set up you’ll be ready to start building your Houzz scraper.

Scraping Houzz Search Listings

In this section, we will focus on scraping Houzz search listings, which display all the products on the site. We will cover how to find CSS selectors by inspecting the HTML, write a scraper to extract data, handle pagination, and store the data in a JSON file.

Inspecting the HTML Structure

First of all, you need to inspect the HTML of the Houzz page from which you want to scrape product listings. For example, to scrape bathroom vanities and sink consoles, use the URL:

1	https://www.houzz.com/products/bathroom-vanities-and-sink-consoles/best-sellers--best-sellers

Open the developer tools in your browser and navigate to this URL.

Here are some key selectors to focus on:

Product Title: Found in an tag with class hz-product-card__product-title which contains the product name.
Price: In a tag with class hz-product-price which displays the product price.
Rating: In a tag with class star-rating which shows the product’s average rating (accessible via the aria-label attribute).
Image URL: The product image is in an tag and you can get the URL from the src attribute.
Product Link: Each product links to its detailed page in an tag which can be accessed via the href attribute.

By looking at these selectors you can target the data you need for your scraper.

Writing the Houzz Search Listings Scraper

Now that you know where the data is located, let’s write the scraper. The following code uses the requests library to fetch the page and BeautifulSoup to parse the HTML.

import requests
from bs4 import BeautifulSoup

def scrape_houzz_search_listings(url):
    products = []

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        for item in soup.select('div[data-container="Product List"] > div.hz-product-card'):
                title = item.select_one('a.hz-product-card__product-title').text.strip() if item.select_one('a.hz-product-card__product-title') else 'N/A'
                price = item.select_one('span.hz-product-price').text.strip() if item.select_one('span.hz-product-price') else 'N/A'
                rating = item.select_one('span.star-rating')['aria-label'].replace('Average rating: ', '') if item.select_one('span.star-rating') else 'N/A'
                image_url = item.find('img')['src'] if item.find('img') else 'N/A'
                product_link = item.find('a')['href'] if item.find('a') else 'N/A'

            product_data = {
                'title': title,
                'price': price,
                'rating': rating,
                'image_url': image_url,
                'product_link': product_link,
            }
            products.append(product_data)

    else:
        print(f'Failed to retrieve the page: {response.status_code}')

    return products

To scrape multiple pages, we need to implement a separate function that will handle pagination logic. This function will check if there is a “next page” link and return the URL for that page. We can then loop through all the listings.

Here’s how you can write the pagination function:

1
2
3

def get_next_page_url(soup):
    next_button = soup.find('a', class_='next-page')
    return next_button['href'] if next_button else None

We will call this function in our main scraping function to continue fetching products from all available pages.

Storing Data in a JSON File

Next, we’ll create a function to save the scraped data into a JSON file. This function can be called after retrieving the listings.

def save_to_json(data, filename='houzz_products.json'):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f'Data saved to {filename} successfully!')

Complete Code Example

Now, let’s combine everything, including pagination, into a complete code snippet.

import requests
from bs4 import BeautifulSoup
import json

def scrape_houzz_search_listings(url):
    products = []

    while url:
        print(f'Scraping {url}')
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            for item in soup.select('div[data-container="Product List"] > div.hz-product-card'):
                title = item.select_one('a.hz-product-card__product-title').text.strip() if item.select_one('a.hz-product-card__product-title') else 'N/A'
                price = item.select_one('span.hz-product-price').text.strip() if item.select_one('span.hz-product-price') else 'N/A'
                rating = item.select_one('span.star-rating')['aria-label'].replace('Average rating: ', '') if item.select_one('span.star-rating') else 'N/A'
                image_url = item.find('img')['src'] if item.find('img') else 'N/A'
                product_link = item.find('a')['href'] if item.find('a') else 'N/A'

                product_data = {
                    'title': title,
                    'price': price,
                    'rating': rating,
                    'image_url': image_url,
                    'product_link': product_link,
                }
                products.append(product_data)

            # Handle pagination
            url = get_next_page_url(soup)

        else:
            print(f'Failed to retrieve the page: {response.status_code}')
            break

    return products

def get_next_page_url(soup):
    next_button = soup.find('a', class_='hz-pagination-link--next')
    return 'https://www.houzz.com' + next_button['href'] if next_button else None

def save_to_json(data, filename='houzz_products.json'):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f'Data saved to {filename} successfully!')

# Main function to run the scraper
if __name__ == '__main__':
    start_url = 'https://www.houzz.com/products/bathroom-vanities-and-sink-consoles/best-sellers--best-sellers'
    listings = scrape_houzz_search_listings(start_url)
    save_to_json(listings)

This complete scraper will extract product listings from Houzz, handling pagination smoothly.

Example Output:

[
    {
        "title": "The Sequoia Bathroom Vanity, Acacia, 30\", Single Sink, Freestanding",
        "price": "$948",
        "rating": "4.9 out of 5 stars",
        "image_url": "https://st.hzcdn.com/fimgs/abd13d5d04765ce7_1626-w458-h458-b1-p0--.jpg",
        "product_link": "https://www.houzz.com/products/the-sequoia-bathroom-vanity-acacia-30-single-sink-freestanding-prvw-vr~170329010"
    },
    {
        "title": "Bosque Bath Vanity, Driftwood, 42\", Single Sink, Undermount, Freestanding",
        "price": "$1,249",
        "rating": "4.699999999999999 out of 5 stars",
        "image_url": "https://st.hzcdn.com/fimgs/4b81420b03f91a0a_3904-w458-h458-b1-p0--.jpg",
        "product_link": "https://www.houzz.com/products/bosque-bath-vanity-driftwood-42-single-sink-undermount-freestanding-prvw-vr~107752516"
    },
    {
        "title": "Render Bathroom Vanity, Oak White",
        "price": "$295",
        "rating": "4.5 out of 5 stars",
        "image_url": "https://st.hzcdn.com/fimgs/4b31b0e601395a74_7516-w458-h458-b1-p0--.jpg",
        "product_link": "https://www.houzz.com/products/render-bathroom-vanity-oak-white-prvw-vr~176775440"
    },
    {
        "title": "The Wailea Bathroom Vanity, Single Sink, 42\", Weathered Fir, Freestanding",
        "price": "$1,354",
        "rating": "4.9 out of 5 stars",
        "image_url": "https://st.hzcdn.com/fimgs/81e1d4ca045d1069_1635-w458-h458-b1-p0--.jpg",
        "product_link": "https://www.houzz.com/products/the-wailea-bathroom-vanity-single-sink-42-weathered-fir-freestanding-prvw-vr~188522678"
    },
    .... more
]

Next, we will explore how to scrape individual product pages for more detailed information.

Scraping Houzz Product Pages

After scraping the search listings, next we gather more information from individual product pages. This will give us more info about each product, including specs and extra images. In this section, we will look at the HTML of a product page, write a scraper to extract the data and then store that data in a JSON file.

Inspecting the HTML Structure

To scrape product pages, you first need to look at the HTML structure of a specific product page.

1	https://www.houzz.com/products/the-sequoia-bathroom-vanity-acacia-30-single-sink-freestanding-prvw-vr~170329010

Open the developer tools in your browser and navigate to this URL.

Here are some key selectors to focus on:

Product Title: Within a span with class view-product-title.
Price: Within a span with class pricing-info__price.
Description: Within a div with class vp-redesign-description.
Images: Additional images within img tags within div.alt-images__thumb.

Knowing this is key to writing your scraper.

Writing the Houzz Product Page Scraper

Now that we know where to find the data, we can create a function to scrape the product page. Here’s how you can write the code to extract the necessary details:

import requests
from bs4 import BeautifulSoup

def scrape_houzz_product_page(url):
    response = requests.get(url)
    product_data = {}

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        title = soup.select_one('span.view-product-title').text.strip() if soup.select_one('span.view-product-title') else 'N/A'
        price = soup.select_one('span.pricing-info__price').text.strip() if soup.select_one('span.pricing-info__price') else 'N/A'
        description = soup.select_one('div.vp-redesign-description').text.strip() if soup.select_one('div.vp-redesign-description') else 'N/A'
        image_urls = [img['src'] for img in soup.select('div.alt-images__thumb > img')] if soup.select('div.alt-images__thumb > img') else 'N/A'

        product_data = {
            'title': title,
            'price': price,
            'description': description,
            'image_urls': image_urls,
            'product_link': url
        }
    else:
        print(f'Failed to retrieve the product page: {response.status_code}')

    return product_data

Storing Data in a JSON File

Just like the search listings, we can save the data we scrape from the product pages into a JSON file for easy access and analysis. Here’s a function that takes the product data and saves it in a JSON file:

def save_product_to_json(product_data, filename='houzz_product.json'):
    with open(filename, 'w') as json_file:
        json.dump(product_data, json_file, indent=4)
    print(f'Product data saved to {filename} successfully!')

Complete Code Example

To combine everything we’ve discussed, here’s a complete code example that includes both scraping individual product pages and saving that data to a JSON file:

import requests
from bs4 import BeautifulSoup
import json

def scrape_houzz_product_page(url):
    response = requests.get(url)
    product_data = {}

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        title = soup.select_one('span.view-product-title').text.strip() if soup.select_one('span.view-product-title') else 'N/A'
        price = soup.select_one('span.pricing-info__price').text.strip() if soup.select_one('span.pricing-info__price') else 'N/A'
        description = soup.select_one('div.vp-redesign-description').text.strip() if soup.select_one('div.vp-redesign-description') else 'N/A'
        image_urls = [img['src'] for img in soup.select('div.alt-images__thumb > img')] if soup.select('div.alt-images__thumb > img') else 'N/A'

        product_data = {
            'title': title,
            'price': price,
            'description': description,
            'image_urls': image_urls,
            'product_link': url
        }
    else:
        print(f'Failed to retrieve the product page: {response.status_code}')

    return product_data

def save_product_to_json(product_data, filename='houzz_product.json'):
    with open(filename, 'w') as json_file:
        json.dump(product_data, json_file, indent=4)
    print(f'Product data saved to {filename} successfully!')

# Main function to run the product page scraper
if __name__ == '__main__':
    product_url = 'https://www.houzz.com/product/204153376'
    product_details = scrape_houzz_product_page(product_url)
    save_product_to_json(product_details)

This code will scrape detailed information from a single Houzz product page and save it to a JSON file.

Example Output:

{
    "title": "The Sequoia Bathroom Vanity, Acacia, 30\", Single Sink, Freestanding",
    "price": "$948",
    "description": "The 30\" Sequoia single sink bathroom vanity will be the centerpiece of your bathroom remodel. Skillfully constructed of 100% solid fir wood to last a lifetime. Wood is skillfully finished with raised grain to give a distressed and reclaim wood look. One solid wood dovetail drawer with full extension glides gives you all the necessary storage room for your daily toiletries, coupled with a quartz countertop.Solid fir wood constructionBeautiful chevron front door designSolid wood dovetail drawers boxSoft closing drawer with full extension glidesWood finished to prevent warping, cracking and withstand bathroom humidity levelsWhite quartz countertopAssembled dimensions: 30 in. W x 22 in. D x 34.50 in. HBlack hardwarePre drilled for 8 inch widespread faucetFinished in Weathered Fir - rustic and reclaim wood look.",
    "image_urls": [
        "https://st.hzcdn.com/fimgs/abd13d5d04765ce7_1626-w100-h100-b0-p0--.jpg",
        "https://st.hzcdn.com/fimgs/9c617c9c04765ce8_1626-w100-h100-b0-p0--.jpg",
        "https://st.hzcdn.com/fimgs/7af1287304765cea_1626-w100-h100-b0-p0--.jpg",
        "https://st.hzcdn.com/fimgs/a651c05404765ced_1626-w100-h100-b0-p0--.jpg",
    .... more
    ],
    "product_link": "https://www.houzz.com/products/the-sequoia-bathroom-vanity-acacia-30-single-sink-freestanding-prvw-vr~170329010"
}

In the next section, we will discuss how to optimize your scraping process with Crawlbase Smart Proxy.

Optimizing with Crawlbase Smart Proxy

When scraping sites like Houzz, IP blocks and CAPTCHAs can slow you down. Crawlbase Smart Proxy helps bypass these issues by rotating IPs and handling CAPTCHAs automatically. This allows you to scrape data without interruptions.

Why Use Crawlbase Smart Proxy?

IP Rotation: Avoid IP bans by using a pool of thousands of rotating proxies.
CAPTCHA Handling: Crawlbase automatically bypasses CAPTCHAs, so you don’t have to solve them manually.
Increased Efficiency: Scrape data faster by making requests without interruptions from rate limits or blocks.
Global Coverage: You can scrape data from any location by selecting proxies from different regions worldwide.

How to Add It to Your Scraper?

To integrate Crawlbase Smart Proxy, modify your request URL to route through their API:

import requests

# Replace _USER_TOKEN_ with your Crawlbase Token
# You can get one by creating an account on Crawlbase
CRAWLBASE_API_URL = 'http://_USER_TOKEN_@smartproxy.crawlbase.com:8012'

def scrape_houzz_product_page(url):
    crawlbase_url = CRAWLBASE_API_URL + url
    response = requests.get(crawlbase_url)
    # Scraper code as before

This will ensure your scraper can run smoothly and efficiently while scraping Houzz.

Optimize Houzz Scraper with Crawlbase

Houzz provides valuable insights for your projects. You can explore home improvement trends and analyze market prices. By following the steps in this blog, you can easily gather important information like product details, prices, and customer reviews.

Using Python libraries like Requests and BeautifulSoup simplifies the scraping process. Plus, using Crawlbase Smart Proxy helps you access the data you need without facing issues like IP bans or CAPTCHAs.

If you’re interested in exploring scraping from other e-commerce platforms, feel free to explore the following comprehensive guides.

📜 How to Scrape Amazon
📜 How to scrape Walmart
📜 How to Scrape AliExpress
📜 How to Scrape Zalando
📜 How to Scrape Costco

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Good luck with your scraping journey!

Frequently Asked Questions

Q. Is it legal to scrape product data from Houzz?

Yes, scraping product data from Houzz is allowed as long as you follow their terms of service. Make sure to read Houzz’s TOS and respect their robots.txt file so you scrape responsibly and ethically.

Q. Why should I use a proxy like Crawlbase Smart Proxy for scraping Houzz?

Using a proxy like Crawlbase Smart Proxy prevents IP bans which can happen if you make too many requests to a website in a short span of time. Proxies also bypass CAPTCHA challenges and geographic restrictions so you can scrape data from Houzz or any other website smoothly.

Q. Can I scrape both product listings and product details from Houzz?

Yes, you can scrape both. In this blog, we’ve demonstrated how to extract essential information from Houzz’s search listings and individual product pages. By following similar steps, you can extend your scraper to gather various data points, such as pricing, reviews, specifications, and even business contact details.

ISP Proxies vs. Residential Proxies (Main Differences)

2024-10-04T13:58:44.000Z

Protecting data privacy is key to keeping security and open access. Online content has become essential, and proxy use has shot up. People use proxies to scrape websites, get around location blocks, or stay anonymous online. Proxies help users hide their IP addresses and keep their online activities private.

But not all proxies work the same way, and picking the right kind can make a big difference in how well you do these tasks. ISP proxies and residential proxies are two of the most common types. They might look alike at first glance, but they serve different purposes. Each one has its strengths, depending on what the user needs to do.

This blog explores the key differences between ISP proxies and residential proxies. We’ll look at how they work, their special features, and which one suits different situations best.

What Are Proxies?

A proxy serves as a middle server between you and the internet. When you go online using a proxy, it sends your requests through the proxy server, hiding your actual IP address. Your device doesn’t talk to websites. Instead, the proxy server does this for you, giving you more privacy and keeping you anonymous.

Why Proxies Matter

Proxies have gained importance as key tools for many online tasks. They protect users’ identities and make their internet experience better. Here’s why proxies play such a crucial role:

Privacy and Anonymity: Proxies hide your real identity by concealing your IP address. This makes it tough for websites or bad actors to keep tabs on what you do online.
Accessing Geo-restricted Content: A lot of websites limit access based on where you are. When you use proxies, you can get around these limits and see content as if you were in a different country.
Web Scraping: Proxies play a crucial role in web scraping. They allow users to collect vast amounts of data from websites without getting blocked or flagged. They make it possible to send multiple requests from different IP addresses to avoid detection. Services like Crawlbase offer rotating gateway proxies that combine millions of residential and data center proxies. This ensures smooth and effective scraping to meet your data collection needs.
Better Security: Some proxies come with security features such as encryption. This provides extra protection to guard against cyber threats.

What Are ISP Proxies?

ISP (Internet Service Provider) proxies blend the benefits of data center and residential proxies. Actual ISPs assign these proxies, but data centers host them. This gives users IP addresses that look genuine and residential while enjoying the speed and reliability of data center infrastructure.

ISP proxies offer the best of both worlds: they have the trustworthiness of residential proxies and the high performance of data center proxies. These proxies come from ISPs, so they’re seen as authentic. This allows users to do things without raising eyebrows.

Characteristics of ISP Proxies

Speed & Reliability: ISP proxies run on data centers, so they give you fast connections and work well. This makes them great for jobs that need quick data gathering or many connections at once.
Authenticity with ISPs: ISP proxies have real IP addresses from ISPs. This means they look more genuine, and websites are less likely to flag or block them when they think someone’s using bots or scraping data.

Use Cases for ISP Proxies

ISP proxies work well for big, demanding jobs such as:

Competitive Intelligence: Businesses can collect market info, keep an eye on rivals, and follow pricing without getting caught or stopped.
Automation: Jobs that need automated steps, like running several social media profiles or buying in bulk, can gain from ISP proxies’ quickness and steadiness.
Accessing Restricted Content: ISP proxies work great to get around geo-blocks, as their realness makes websites less likely to spot and block them.

What Are Residential Proxies?

Residential proxies send internet traffic through real IP addresses given to actual homes. These proxies differ from data center or ISP proxies. Internet Service Providers (ISPs) assign these IP addresses to households. This makes traffic from these proxies look like it’s from real users. Because of this, websites trust residential proxies and see them as authentic. This lowers the chance of getting flagged or blocked.

Characteristics of Residential Proxies

Authenticity & Trustworthiness: Real devices in homes have IP addresses that residential proxies use. Websites consider these more genuine and less suspicious. This makes them great for tasks that need to stay under the radar and be discreet.
Higher Costs and Slower Speed: Getting and keeping residential proxies is complex, which makes them pricier. Also, residential proxies use home networks, so they’re slower than ISP or data center proxies. But they give you better anonymity.

Use Cases of Residential Proxies

Residential proxies shine when you need to seem real and avoid getting caught, like in these cases:

Web Scraping: Residential proxies excel at web scraping jobs that involve pulling large data sets from websites with anti-bot systems. These proxies help keep scraping requests from getting blocked or tagged.
Market Research: Businesses use residential proxies to collect data from global markets without limits or discovery. This includes checking product prices, watching competitors, and gathering customer feedback.
Ad Verification: Residential proxies let companies check that their ads show up right and reach the correct audiences in different places around the world.
Bypassing Geo-restrictions: Because they have real residential IPs, these proxies work well to get around geo-blocks, giving users access to content that might be off-limits in some areas.

What’s the Difference Between ISP and Residential Proxies

IP Address Origins

ISP Proxies: These proxies use IP addresses that Internet Service Providers supply but host on data centers. This arrangement gives them a residential IP look while taking advantage of the data center setup.
Residential Proxies: These proxies send traffic through actual devices in people’s homes, which makes their IP addresses genuine residential IPs. The IPs go straight from ISPs to real users’ devices.

Speed and Performance

ISP Proxies: ISP proxies have a faster speed and higher reliability due to their data center setup. They can deal with large amounts of traffic and demanding jobs more.
Residential Proxies: Residential proxies give better privacy, but they tend to be slower as they use home networks. This slight drop in speed is a trade-off for better privacy and trustworthiness.

Cost

ISP Proxies: ISP proxies cost less because they combine the quickness and scalability of data center proxies with the legitimacy of ISP-assigned IP addresses. They offer a budget-friendly option for many online jobs.
Residential Proxies: Residential proxies have a higher price tag due to the challenges of getting and keeping IPs from real users. The increased cost shows their improved authenticity and ability to avoid detection.

Usage Restrictions

ISP Proxies: ISP proxies work best for jobs that need speed, like keeping tabs on competitors, pulling data from websites, and making things happen. But websites with strong systems to catch bots might spot and stop them more.
Residential Proxies: Residential proxies shine when you need to gather data over time or do tasks that must look real, such as checking ads or getting around location blocks. They look like regular users, so they’re less likely to get stopped. But they cost more and run slower.

Pros and Cons of ISP Proxies

Pros

High Speed: ISP proxies benefit from quick data center setup making them perfect for jobs that need fast and productive processing.
Reliability: Because data centers host them, ISP proxies stay very stable, giving steady performance even when handling big workloads.
Affordability: ISP proxies cost less than residential proxies, so they fit the budget of people who need to use lots of proxies.

Cons

Not as Genuine: ISP proxies come across as more legitimate than typical data center proxies, but they still fall short of residential proxies in terms of authenticity. Websites might spot them as coming from data centers.
Greater Chance of Getting Blocked: Some websites use tough anti-bot systems. These systems can flag and block ISP proxies more because they can tell these proxies come from data centers, not home addresses.

Pros and Cons of Residential Proxies

Pros

Better Privacy: Residential proxies give you more privacy because they use real home IP addresses, making them look like regular users online. This helps you avoid getting caught by websites that have strong ways to spot bots.
Getting Around Location Blocks: Since residential proxies come from real homes, they can get past location blocks well, letting you see content and use services that are for certain areas.

Cons

More Expensive: You’ll pay more for residential proxies. They’re pricier because it’s tricky to get real residential IP addresses and keep their networks running. This makes them cost more than other types of proxies.
Not as Fast: Residential proxies use regular home internet connections. This means they’re slower than ISP proxies, which use faster data center setups. If you need to do things that require quick internet speeds, this could be a problem.

When to Choose ISP Proxies

ISP proxies are the way to go when you need speed, the ability to scale up, and good value for your money. They give you the fast performance of data center proxies but with a bonus: IP addresses from real internet service providers. They’re a good fit for:

Competitive Research: Jobs like keeping tabs on rival prices, watching market shifts, or looking at SEO facts often need fast data grabs on a big scale. ISP proxies give the quickness and stability these jobs need.
Automation: Running many accounts, using bots, or doing the same tasks over and over on websites gets better with ISP proxies’ high output and ability to grow. Their data center speed makes sure things go when automating on a large scale.
Data Scraping: To gather tons of data without slow links getting in the way, ISP proxies offer the right mix of speed and realness.

In short, ISP proxies work best for jobs where speed matters and the chance of getting blocked is reasonable. This makes them great for keeping an eye on rivals, pulling data from the web, and making tasks happen on their own.

When to Choose Residential Proxies

Residential proxies stand out as the top pick for jobs that need to stay hidden and look real. These proxies use actual home IP addresses making them tough to spot. This matters a lot for delicate work such as:

Ad Verification: Residential proxies help to check if ads show up right in different places and on many platforms. They look real, so they can get past systems that catch fake ads and check where ads go without getting stopped.
Seamless Web Scraping: When you need to go unnoticed, like when you’re getting info from websites with tough bot blockers, residential proxies help you avoid getting caught. They make you look like real users, so you’re less likely to get banned or hit with CAPTCHAs.
Getting Around Location Blocks: Residential proxies work great to access stuff that’s blocked in certain areas, like streaming services, local websites, or apps you can’t use everywhere. Their real home-like IP addresses make it hard for sites to block or limit access.

Which Proxy Network Best Suits My Business Needs?

ISP proxies and residential proxies have different uses and benefits. ISP proxies live in data centers with ISP-given IPs. They’re fast, easy to scale, and cheap. You can use them to research competitors, automate tasks, and scrape lots of data. Residential proxies send traffic through real people’s devices. They give you more privacy and look more authentic. These work great for checking ads, scraping the web without getting caught, and seeing content that’s blocked in some places.

To pick between ISP and residential proxies, think about what you need. If you want speed, reliability, and low cost, go for ISP proxies. But if you need to stay hidden, see blocked stuff, or do sensitive things that require more privacy, residential proxies are better.

Both types of proxies have their strong points, so look at your needs. Crawlbase’s Smart Proxy uses millions of residential and data center proxies. It does this through a rotating gateway proxy that’s easy to set up. Whether you want to scrape data fast or check ads on the down-low, our Smart Proxy can help with your web scraping goals.

Scrape Costco Product Data Easily

2024-10-03T18:00:00.000Z

Costco is one of the largest warehousing companies in the world with 800 over warehouses globally and millions of customers. The inventory includes groceries all the way up to electronics, home goods and clothes. With such a vast range of products, Costco product data could be gold in the eyes of businesses, researchers, and developers.

You can extract data from Costco to get insights into product prices, product availability, customer feedback etc. Using the data you pull from Costco, you can make informed decisions and track market trends. In this article, you will learn how to scrape Costco product data with the Crawlbase’s Crawling API and Python.

Let’s jump right into the process!

Why Scrape Costco for Product Data?
Key Data Points to Extract from Costco
Crawlbase Crawling API for Costco Scraping

Crawlbase Python Library

Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping Costco Search Listings

Inspecting the HTML for Selectors
Writing the Costco Search Listings Scraper
Handling Pagination
Storing Data in a JSON File
Complete Code

Scraping Costco Product Pages

Inspecting the HTML for Selectors
Writing the Costco Product Page Scraper
Storing Data in a JSON File
Complete Code

Final Thoughts
Frequently Asked Questions

Why Scrape Costco for Product Data?

Costco known for its variety of great quality products at low prices making it popular among millions. Scraping Costco’s product data can be used for many purposes including price comparison, market research, inventory management and product analysis. By getting this data businesses can monitor product trends, track pricing strategies and understand customer preferences.

Whether you’re a developer building an app, a business owner doing market research or just someone curious about product pricing, scraping Costco can be super useful. By extracting product information such as price, availability and product description you can make more informed decisions or have automated systems that keep you updated in real time.

In next sections, we will learn about the key data points to consider and walk you through the step by step process of setting up a scraper to get Costco’s product data.

Key Data Points to Extract from Costco

When scraping Costco for product data you want to focus on getting useful information to make informed decisions. Here are the key data points to consider:

Product Name: The product name is important for identifying and organizing items.
Price: The price of each product helps with price comparison and tracking price changes over time.
Product Description: Detailed descriptions give insights into the features and benefits of each item.
Ratings and Reviews: Collecting customer reviews and star ratings gives valuable feedback on product quality and customer satisfaction.
Image URL: The product image is useful for visual references and marketing purposes.
Availability: The product image is good for visual references and marketing purposes.
SKU (Stock Keeping Unit): Unique product identifiers like SKUs are important for tracking inventory and managing data.

Once you have these data points, you can build a product database to support your business needs such as market research, inventory management and competitive analysis. Next we’ll look at how Crawlbase Crawling API can help with scraping Costco.

Crawlbase Crawling API for Costco Scraping

Crawlbase’s Crawling API makes scraping Costco websites super easy and fast. Costco’s website uses dynamic content, which means some product data is loaded via JavaScript. That makes scraping harder, but Crawlbase Crawling API renders the page like a real browser.

Here’s why Crawlbase Crawling API is a great choice for scraping Costco:

Handles Dynamic Content: It handles JavaScript heavy pages, so all data is loaded and accessible for scraping.
IP Rotation: To avoid getting blocked by Costco, Crawlbase does IP rotation for you, so you don’t have to worry about rate limits or bans.
High Performance: With Crawlbase, you can scrape large volumes of data quickly and efficiently, saving you time and resources.
Customizable Requests: You can set custom headers, cookies or even control the requests behavior to fit your needs.

With these advantages, Crawlbase Crawling API simplifies the entire process, making it a perfect solution for extracting product data from Costco. In the next section, we’ll set up Python environment for Costco scraping.

Crawlbase Python Library

Crawlbase has a Python library that makes web scraping a lot easier. This library requires an access token to authenticate. You can get a token after creating an account on crawlbase.

Here’s an example function demonstrating how to use the Crawlbase Crawling API to send requests:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase offers two types of tokens:

Normal Token for static sites.
JavaScript (JS) Token for dynamic or browser-based requests.

For scraping dynamic sites like Costco, you’ll need the JS Token. Crawlbase provides 1,000 free requests to get you started, and no credit card is required for this trial. For more details, check out the Crawlbase Crawling API documentation.

Setting Up Your Python Environment

Before you start scraping Costco, you need to set up a proper Python environment. This involves installing Python, the required libraries, and an IDE to write and test your code.

Installing Python and Required Libraries

Install Python: Download and install Python from the official Python website. Choose the latest stable version for your operating system.
Install Required Libraries: After installing Python, you’ll need some libraries to work with Crawlbase Crawling API and to handle the scraping process. Open your terminal or command prompt and run the following commands:

1 2	pip install beautifulsoap4 pip install crawlbase

**beautifulsoup4**: BeautifulSoup makes it easier to parse and navigate through the HTML structure of the web pages.
**crawlbase**: Crawlbase is the official library from Crawlbase that you’ll use to connect with their API.

Choosing an IDE

Choosing the right Integrated Development Environment (IDE) can make coding easier and more efficient. Here are a few popular options:

VS Code: Simple and lightweight, multi-purpose, free with Python extensions.
PyCharm: A robust Python IDE with many built-in tools for professional development.
Jupyter Notebooks: Good for running codes with an interactive setting, especially for data projects.

Now that you have Python and the required libraries installed, and you’ve chosen an IDE, you can start scraping Costco product data. In the next section we will go step by step on how to scrape Costco search listings.

How to Scrape Costco Search Listings

Now that we’ve set up the Python environment, let’s get into scraping Costco search listings. In this section we’ll cover how to inspect the HTML for selectors, write a scraper using Crawlbase and BeautifulSoup, handle pagination and store the scraped data in a JSON file.

Inspecting the HTML for Selectors

To scrape the Costco product listings efficiently we need to inspect the HTML structure. Here’s what you’ll typically need to find:

Product Title: Found in a
with data-testid starting with Text_ProductTile_.
Product Price: Located in a
with data-testid starting with Text_Price_.
Product Rating: Found in a div with data-testid starting with Rating_ProductTile_.
Product URL: Embedded in an tag with data-testid="Link".
Image URL: Found in an tag under the src attribute.

Additionally, Product listings are inside div[id="productList"], with items grouped under div[data-testid="Grid"].

Writing the Costco Search Listings Scraper

Crawlbase Crawling API provide multiple parameters which you can use with it. Using Crawlbase’s JS Token you can handle dynamic content loading on Costco. The ajax_wait and page_wait parameters can be used to give the page time to load.

Let’s write a scraper that collects the product title, price, product URL and image URL from the Costco search results page using Crawlbase Crawling API and BeautifulSoup.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

# Function to fetch HTML content from Costco search results
def fetch_search_listings(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch the page. Status code: {response['headers']['pc_status']}")
        return None

# Scrape product listings from a page
def scrape_costco_search_listings(url):
    html_content = fetch_search_listings(url)
    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')
        product_list = []
        product_items = soup.select('div[id="productList"] > div[data-testid="Grid"]')

        for item in product_items:
            title = item.select_one('div[data-testid^="Text_ProductTile_"]').text.strip() if item.select_one('div[data-testid^="Text_ProductTile_"]') else 'N/A'
            price = item.select_one('div[data-testid^="Text_Price_"]').text.strip() if item.select_one('div[data-testid^="Text_Price_"]') else 'N/A'
            rating = item.select_one('div[data-testid^="Rating_ProductTile_"] > div')['aria-label'] if item.select_one('div[data-testid^="Rating_ProductTile_"] > div') else 'N/A'
            product_url = item.select_one('a[data-testid="Link"]')['href'] if item.select_one('a[data-testid="Link"]') else 'N/A'
            image_url = item.find('img')['src'] if item.find('img') else 'N/A'

            product_list.append({
                'title': title,
                'price': price,
                'rating': rating,
                'product_url': product_url,
                'image_url': image_url
            })
        return product_list
    else:
        return []

# Example usage
url = "https://www.costco.com/s?dept=All&keyword=sofas"
products = scrape_costco_search_listings(url)
print(products)

In this code:

fetch_search_listings(): This function uses the Crawlbase API to fetch the HTML content from the Costco search listings page.
scrape_costco_search_listings(): This function parses the HTML using BeautifulSoup to extract product details like title, price, product URL, and image URL.

Costco search results can span multiple pages. To scrape all products, we need to handle pagination. Costco uses the ¤tPage= parameter in the URL to load different pages.

Here’s how to handle pagination:

def scrape_all_pages(base_url, total_pages):
    all_products = []

    for page_num in range(1, total_pages + 1):
        paginated_url = f"{base_url}¤tPage={page_num}"
        print(f"Scraping page {page_num}")

        products = scrape_costco_search_listings(paginated_url)
        all_products.extend(products)

    return all_products

# Example usage
total_pages = 5  # Adjust based on the number of pages to scrape
base_url = "https://www.costco.com/s?dept=All&keyword=sofas"
all_products = scrape_all_pages(base_url, total_pages)
print(f"Total products scraped: {len(all_products)}")

This code will scrape multiple pages of search results by appending the ¤tPage= parameter to the base URL.

How to Save Data in a JSON File

Once you’ve scraped the product data, it’s important to store it for later use. Here’s how you can save the product listings into a JSON file:

import json

def save_to_json(data, filename='costco_product_listings.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"Data saved to {filename}")

# Example usage
save_to_json(all_products)

This function will write the scraped product details into a costco_product_listings.json file.

Complete Code

Here’s the complete code to scrape Costco search listings, handle pagination, and store the data in a JSON file:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

# Fetch HTML content
def fetch_search_listings(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch the page. Status code: {response['headers']['pc_status']}")
        return None

# Scrape product listings from a page
def scrape_costco_search_listings(url):
    html_content = fetch_search_listings(url)
    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')
        product_list = []
        product_items = soup.select('div[id="productList"] > div[data-testid="Grid"]')

        for item in product_items:
            title = item.select_one('div[data-testid^="Text_ProductTile_"]').text.strip() if item.select_one('div[data-testid^="Text_ProductTile_"]') else 'N/A'
            price = item.select_one('div[data-testid^="Text_Price_"]').text.strip() if item.select_one('div[data-testid^="Text_Price_"]') else 'N/A'
            rating = item.select_one('div[data-testid^="Rating_ProductTile_"] > div')['aria-label'] if item.select_one('div[data-testid^="Rating_ProductTile_"] > div') else 'N/A'
            product_url = item.select_one('a[data-testid="Link"]')['href'] if item.select_one('a[data-testid="Link"]') else 'N/A'
            image_url = item.find('img')['src'] if item.find('img') else 'N/A'

            product_list.append({
                'title': title,
                'price': price,
                'rating': rating,
                'product_url': product_url,
                'image_url': image_url
            })
        return product_list
    else:
        return []

# Scrape all pages
def scrape_all_pages(base_url, total_pages):
    all_products = []
    for page_num in range(1, total_pages + 1):
        paginated_url = f"{base_url}¤tPage={page_num}"
        print(f"Scraping page {page_num}")
        products = scrape_costco_search_listings(paginated_url)
        all_products.extend(products)
    return all_products

# Save data to a JSON file
def save_to_json(data, filename='costco_product_listings.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"Data saved to {filename}")

# Example usage
base_url = "https://www.costco.com/s?dept=All&keyword=sofas"
total_pages = 5
all_products = scrape_all_pages(base_url, total_pages)
save_to_json(all_products)

Example Output:

[
  {
    "title": "Coddle Aria Fabric Sleeper Sofa with Reversible Chaise Gray",
    "price": "$1,299.99",
    "rating": "Average rating is 4.65 out of 5 stars. Based on 1668 reviews.",
    "product_url": "https://www.costco.com/coddle-aria-fabric-sleeper-sofa-with-reversible-chaise-gray.product.4000223041.html",
    "image_url": "https://cdn.bfldr.com/U447IH35/at/nx2pbmjk76t8c5k4h3qpsg6/4000223041-847_gray_1.jpg?auto=webp&format=jpg&width=350&height=350&fit=bounds&canvas=350,350"
  },
  {
    "title": "Larissa Fabric Chaise Sofa",
    "price": "$1,899.99",
    "rating": "Average rating is 4.03 out of 5 stars. Based on 87 reviews.",
    "product_url": "https://www.costco.com/larissa-fabric-chaise-sofa.product.4000052035.html",
    "image_url": "https://cdn.bfldr.com/U447IH35/as/ck2h3n29gz2j6m7c9f7x4rhm/4000052035-847_gray_1?auto=webp&format=jpg&width=350&height=350&fit=bounds&canvas=350,350"
  },
  {
    "title": "Ridgewin Leather Power Reclining Sofa",
    "price": "$1,499.99",
    "rating": "Average rating is 4.63 out of 5 stars. Based on 1377 reviews.",
    "product_url": "https://www.costco.com/ridgewin-leather-power-reclining-sofa.product.4000079113.html",
    "image_url": "https://cdn.bfldr.com/U447IH35/as/xsmmcftqhmgws76mr625rgx/1653285-847__1?auto=webp&format=jpg&width=350&height=350&fit=bounds&canvas=350,350"
  },
  {
    "title": "Thomasville Langdon Fabric Sectional with Storage Ottoman",
    "price": "$1,499.99",
    "rating": "Average rating is 4.52 out of 5 stars. Based on 1981 reviews.",
    "product_url": "https://www.costco.com/thomasville-langdon-fabric-sectional-with-storage-ottoman.product.4000235345.html",
    "image_url": "https://cdn.bfldr.com/U447IH35/at/p3qmw24rtkkrtf77hmxvmpg/4000235345-847__1.jpg?auto=webp&format=jpg&width=350&height=350&fit=bounds&canvas=350,350"
  },
  .... more
]

How to Scrape Costco Product Pages

Now that we’ve covered how to scrape Costco search listings, next step is to extract detailed product information from individual product pages. In this section we’ll cover how to inspect the HTML for selectors, write a scraper for Costco product pages, and store the data in a JSON file.

Inspecting the HTML for Selectors

To scrape individual Costco product pages we need to inspect the HTML structure of the page. Here’s what you’ll typically need to find:

Product Title: The title is found inside an
tag with the attribute automation-id="productName".
Product Price: The price is located within a tag with the attribute automation-id="productPriceOutput".
Product Rating: The rating is found within a
tag with the attribute itemprop="ratingValue".
Product Description: Descriptions are located inside a
tag with the id product-tab1-espotdetails.
Images: The product image URL is extracted from an tag with the class thumbnail-image by grabbing the src attribute.
Specifications: The specifications are stored within a structured HTML, typically using rows in
tags with classes like .spec-name, and the values are found in sibling
tags.

Writing the Costco Product Page Scraper

We’ll now create a scraper that extracts detailed information from individual product pages, product title, price, description and images. The scraper will use Crawlbase Crawling API ajax_wait and page_wait parameters for fetching the content and BeautifulSoup for parsing the HTML.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

# Function to fetch HTML content of product page
def fetch_product_page(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch the page. Status code: {response['headers']['pc_status']}")
        return None

# Function to scrape Costco product details
def scrape_costco_product_page(url):
    html_content = fetch_product_page(url)

    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')

        title = soup.select_one('h1[automation-id="productName"]').text.strip() if soup.select_one('h1[automation-id="productName"]') else 'N/A'
        price = soup.select_one('span[automation-id="productPriceOutput"]').text.strip() if soup.select_one('span[automation-id="productPriceOutput"]') else 'N/A'
        rating = soup.select_one('div[itemprop="ratingValue"]').text.strip() if soup.select_one('div[itemprop="ratingValue"]') else 'N/A'
        description = soup.select_one('div[id="product-tab1-espotdetails"]').text.strip() if soup.select_one('div[id="product-tab1-espotdetails"]') else 'N/A'
        images_url = soup.find('img', class_='thumbnail-image')['src'] if soup.find('img', class_='thumbnail-image') else 'N/A'
        specifications = {row.select_one('.spec-name').text.strip(): row.select_one('div:not(.spec-name)').text.strip() for row in soup.select('div.product-info-description .row') if row.select_one('.spec-name')}

        product_details = {
            'title': title,
            'price': price,
            'rating': rating,
            'description': description,
            'images_url': images_url,
            'specifications': specifications,
        }

        return product_details
    else:
        return {}

# Example usage
product_url = "https://www.costco.com/example-product-page.html"
product_details = scrape_costco_product_page(product_url)
print(product_details)

In this code:

**fetch_product_page()**: This function uses Crawlbase to fetch the HTML content from a Costco product page.
**scrape_costco_product_page()**: This function uses BeautifulSoup to parse the HTML and extract relevant details like the product title, price, description, and image URL.

Storing Data in a JSON File

Once we have scraped the product details, we can store them in a JSON file for later use.

import json

# Function to save product details to a JSON file
def save_product_to_json(data, filename='costco_product_details.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"Data saved to {filename}")

# Example usage
save_product_to_json(product_details)

This code will write the scraped product details into a costco_product_details.json file.

Complete Code

Here’s the complete code that fetches and stores Costco product page details, using Crawlbase and BeautifulSoup:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

# Fetch HTML content of product page
def fetch_product_page(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch the page. Status code: {response['headers']['pc_status']}")
        return None

# Scrape product details from a Costco product page
def scrape_costco_product_page(url):
    html_content = fetch_product_page(url)

    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')

        title = soup.select_one('h1[automation-id="productName"]').text.strip() if soup.select_one('h1[automation-id="productName"]') else 'N/A'
        price = soup.select_one('span[automation-id="productPriceOutput"]').text.strip() if soup.select_one('span[automation-id="productPriceOutput"]') else 'N/A'
        rating = soup.select_one('div[itemprop="ratingValue"]').text.strip() if soup.select_one('div[itemprop="ratingValue"]') else 'N/A'
        description = soup.select_one('div[id="product-tab1-espotdetails"]').text.strip() if soup.select_one('div[id="product-tab1-espotdetails"]') else 'N/A'
        images_url = soup.find('img', class_='thumbnail-image')['src'] if soup.find('img', class_='thumbnail-image') else 'N/A'
        specifications = {row.select_one('.spec-name').text.strip(): row.select_one('div:not(.spec-name)').text.strip() for row in soup.select('div.product-info-description .row') if row.select_one('.spec-name')}

        product_details = {
            'title': title,
            'price': price,
            'rating': rating,
            'description': description,
            'images_url': images_url,
            'specifications': specifications,
        }

        return product_details
    else:
        return {}

# Save product details to a JSON file
def save_product_to_json(data, filename='costco_product_details.json'):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"Data saved to {filename}")

# Example usage
product_url = "https://www.costco.com/coddle-aria-fabric-sleeper-sofa-with-reversible-chaise-gray.product.4000223041.html"
product_details = scrape_costco_product_page(product_url)
save_product_to_json(product_details)

With this code, you can now scrape individual Costco product pages and store detailed information like product titles, prices, descriptions, and images in a structured format.

Example Output:

{
  "title": "Coddle Aria Fabric Sleeper Sofa with Reversible Chaise Gray",
  "price": "- -.- -",
  "rating": "4.7",
  "description": "[ProductDetailsESpot_Tab1]\n\n\nCostco Direct Savings\nPurchase multiple Costco Direct items on the same order to receive additional savings. Items must ship to the same address to receive savings.\n\nBuy 2 Items, Save $100\nBuy 3 Items, Save $200\nBuy 4 Items, Save $300\nBuy 5 or more Items, Save $400\nWhile supplies last. Online-Only. Limit 2 redemptions per member. Costco Direct Savings can be combined with other promotions.",
  "images_url": "https://cdn.bfldr.com/U447IH35/as/x8sjfsx359hh3w273f285x97/4000223041-847_gray_1?auto=webp&format=jpg&width=150&height=150&fit=bounds&canvas=150,150",
  "specifications": {
    "Back Style": "Cushion Back",
    "Brand": "Coddle",
    "Costco Direct": "Costco Direct",
    "Design": "Stationary",
    "Features": "Convertible",
    "Frame Material": "Wood",
    "Number of Pieces": "2 Piece(s)",
    "Number of USB-A Ports": "1 Port",
    "Number of USB-C Ports": "1 Port",
    "Orientation": "Reversible",
    "Overall Sectional Dimensions: W x L x H": "37.4 in. x 89.4 in. x 37.4 in.",
    "Overall Sectional Weight": "300.3 lb.",
    "Seating Capacity": "4 Person",
    "Style": "Transitional",
    "Upholstery Material": "Fabric"
  }
}

Optimize Costco Scraper with Crawlbase

Scraping product data from Costco can be a powerful tool for tracking prices, product availability and market trends. With Crawlbase Crawling API and BeautifulSoup you can automate the process and store the data in JSON for analysis.

Follow this guide to build a scraper for your needs, whether it’s for competitor analysis, research or inventory tracking. Just make sure to follow the website’s terms of service. If you’re interested in exploring scraping from other e-commerce platforms, feel free to explore the following comprehensive guides.

📜 How to Scrape Amazon
📜 How to scrape Walmart
📜 How to Scrape AliExpress
📜 How to Scrape Flipkart
📜 How to Scrape Etsy

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Good luck with your scraping journey!

Frequently Asked Questions

Q. Is scraping Costco legal?

Scraping Costco or any website must be done responsibly and within the website’s legal guidelines. Always check the site’s terms of service to make sure you’re allowed to scrape the data. Don’t scrape too aggressively to prevent overwhelming their servers. Using tools like Crawlbase which respects rate limits and manages IP rotation can help keep your scraping activity within acceptable boundaries.

Q. Why use Crawlbase Crawling API for scraping Costco?

Crawlbase Crawling API is designed to handle complex websites that use JavaScript like Costco. Many websites dynamically load content making it hard for traditional scraping methods to work. Crawlbase helps bypass those limitations by rendering JavaScript and providing the full HTML of the page making it easier to scrape the required data. Also it manages proxies and rotates IPs which helps prevent getting blocked while scraping large amount of data.

Q. What data can I extract from Costco using this scraper?

Using this scraper, you can extract key data points from Costco product pages such as product names, prices, descriptions, ratings and image URLs. You can also capture product page links and handle pagination to scrape through multiple pages of search listings efficiently. This data can be stored in a structured format like JSON for easy access and analysis.

Scrape Goodreads for Book Ratings and Comments

2024-10-01T10:00:00.000Z

Goodreads stands out as a top online destination for people to share their thoughts on books. With its community of over 90 million signed-up users, the site buzzes with reviews, comments, and ratings on countless books. This wealth of user-created content offers a goldmine to anyone looking to extract valuable information such as book scores and reader feedback.

This post will guide you through making a program to gather book ratings and comments using Python and the Crawlbase Crawling API. We’ll walk you through setting up your workspace, dealing with page-by-page results, and saving the information in an organized way.

Ready to dive in?

Why Scrape Goodreads?
Key Data Points to Extract from Goodreads
Crawlbase Crawling API for Goodreads Scraping

Why Use Crawlbase for Goodreads Scraping?
Crawlbase Python Library

Setting Up Your Python Environment

Installing Python and Required Libraries
Choosing an IDE

Scraping Goodreads for Book Ratings and Comments

Inspecting the HTML for Selectors
Writing the Goodreads Scraper for Ratings and Comments
Handling Pagination
Storing Data in a JSON File
Complete Code Example

Final Thoughts
Frequently Asked Questions

Why Scrape Goodreads?

Goodreads is a great place for book lovers, researchers, and businesses. Scraping Goodreads can provide you with a lot of user-generated data, using which you can analyze book trends, gather user feedback, or build a list of popular books. Here are a few reasons why scraping Goodreads can be useful:

Rich Data: Goodreads provides ratings, reviews, and comments on books, making it an ideal place to understand the preferences of readers.
Large User Base: With millions of active users Goodreads has a massive dataset, ideal for in-depth analysis.
Market Research: Data available from Goodreads can be used to help businesses understand market trends, popular books, and customer feedback that can be useful for marketing or product development.
Personal Projects: Scraping Goodreads can be handy if you are working on a personal project, like building your own book recommendation engine or analyzing reading habits.

Key Data Points to Extract from Goodreads

When scraping Goodreads, you should focus on the most important data points to get useful insights. Here are the key ones to collect:

Book Title: This is essential for any analysis or reporting.
Author Name: To categorize and organize books and to track popular authors.
Average Rating: Goodreads average rating based on user reviews. This is the key to understanding the book’s popularity.
Number of Ratings: Total number of ratings. How many people have read the book.
User Comments/Reviews: User reviews are great for qualitative analysis. What did readers like or dislike?
Genres: Goodreads books are often tagged with genres. Helps to categorize and recommend similar books.
Publication Year: Useful to track trends over time or compare books published in the same year.
Book Synopsis: The synopsis provides a summary of the book’s plot and gives context to what the book is about.

Crawlbase Crawling API for Goodreads Scraping

When scraping dynamic websites like Goodreads, traditional request methods struggle due to JavaScript rendering and complex pagination. This is where the Crawlbase Crawling API comes in handy. It handles JavaScript rendering, paginated content, and captchas so Goodreads scraping is smoother.

Why Use Crawlbase for Goodreads Scraping?

JavaScript Rendering: Crawlbase handles the JavaScript Goodreads uses to display ratings, comments and other dynamic content.
Effortless Pagination: With dynamic pagination, navigating through multiple pages of reviews becomes automatic.
Prevention Against Blocks: Crawlbase manages proxies and captchas for you, reducing the risk of being blocked or detected.

Crawlbase Python Library

Crawlbase has a Python library that makes web scraping a lot easier. This library requires an access token to authenticate. You can get a token after creating an account on crawlbase.

Here’s an example function demonstrating how to use the Crawlbase Crawling API to send requests:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase offers two types of tokens:

Normal Token for static sites.
JavaScript (JS) Token for dynamic or browser-based requests.

For scraping dynamic sites like Goodreads, you’ll need the JS Token. Crawlbase provides 1,000 free requests to get you started, and no credit card is required for this trial. For more details, check out the Crawlbase Crawling API documentation.

Setting Up Your Python Environment

Before scraping Goodreads for book ratings and comments, you need to set up your Python environment properly. Here’s a quick guide to get started.

Installing Python and Required Libraries

Download Python: Go the Python website and fetch the current version made available for your OS. During the installation, remember to add Python to the system PATH.
Install Python: After that, check that you have successfully installed it by typing in the console or in the command window the following instructions:

1	python --version

Install Libraries: With the use of pip, install and import required libraries including crawlbase in order to make an HTTP request using Crawlbase Crawling API, and the BeautifulSoup from the bs4 library to parse web pages:

1 2	pip install crawlbase pip install beautifulsoup4

Choosing an IDE

A good IDE simplifies your coding. Below are some of the popular ones:

VS Code: Simple and lightweight, multi-purpose, free with Python extensions.
PyCharm: A robust Python IDE with many built-in tools for professional development.
Jupyter Notebooks: Good for running codes with an interactive setting, especially for data projects.

With your environment ready, you can now move on to scraping Goodreads.

Scraping Goodreads for Book Ratings and Comments

While web scraping book ratings and comments from Goodreads, one must take in account the fact that the content is in constant change. The comments and reviews are loaded both asynchronously and the pagination is done through buttons. This part describes how to get this information and work with pagination through Crawlbase utilizing a JS Token and css_click_selector parameter for button navigation.

Inspecting the HTML for Selectors

First of all, one must look into the HTML code of the Goodreads page on which you want to scrape. For example, to scrape reviews for The Great Gatsby, use the URL:

1	https://www.goodreads.com/book/show/4671.The_Great_Gatsby/reviews

Open the developer tools in your browser and navigate to this URL.

Here are some key selectors to focus on:

Book Title: Found in an h1 tag with class H1Title, specifically in an anchor tag with data-testid="title".
Ratings: Located in a div with class RatingStatistics, with the value in a span tag of class RatingStars (using the aria-label attribute).
Reviews: Each review is within an article inside a div with class ReviewsList and class ReviewCard. Each review includes:
- User’s name in a div with data-testid="name".
- Review text in a section with class ReviewText, containing a span with class Formatted.
Load More Button: The “Show More Reviews” button in the review section for pagination, identified by button:has(span[data-testid="loadMore"]).

Writing the Goodreads Scraper for Ratings and Comments

Crawlbase Crawling API provide multiple parameters which you can use with it. Using Crawlbase’s JS Token, you can handle dynamic content loading on Goodreads. The ajax_wait and page_wait parameters can be used to give the page time to load.

Here’s a Python script to scrape Goodreads for book details, ratings, and comments using Crawlbase Crawling API.

from crawlbase import CrawlingAPI
import json
from bs4 import BeautifulSoup

# Initialize Crawlbase API with JS Token
crawling_api = CrawlingAPI({ 'token': 'CRAWLBASE_JS_TOKEN' })

# Function to fetch and process Goodreads book details and reviews
def scrape_goodreads_reviews(base_url):
    page_data = []

    # Fetch initial page and reviews
    response = crawling_api.get(base_url, {
        'ajax_wait': 'true',
        'page_wait': '5000'
    })

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        page_data = extract_book_details(html_content)

    return page_data

# Function to extract the book title, rating, and reviews from the page
def extract_book_details(html):
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.select_one('h1.H1Title a[data-testid="title"]').text.strip()
    rating = soup.select_one('div.RatingStatistics span.RatingStars')['aria-label']

    reviews = []
    for review_div in soup.select('div.ReviewsList article.ReviewCard'):
        user = review_div.select_one('div[data-testid="name"]').text.strip()
        review_text = review_div.select_one('section.ReviewText span.Formatted').text.strip()
        reviews.append({'user': user, 'review': review_text})

    return {'title': title, 'rating': rating, 'reviews': reviews}

Goodreads uses a button-based pagination system to load more reviews. You can use Crawlbase’s css_click_selector parameter to simulate clicking the “Next” button and scraping additional pages of reviews. This method helps you to collect the maximum number of reviews as possible.

Here’s how the pagination can be handled:

def scrape_goodreads_reviews_with_pagination(base_url):
    page_data = []

    # Fetch initial page and reviews
    response = crawling_api.get(base_url, {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'button:has(span[data-testid="loadMore"])'
    })

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        page_data = extract_book_details(html_content)

    return page_data

Storing Data in a JSON File

After extracting the book details and reviews you can write the scraped data into a JSON File. This format is perfect for keeping structured data and very easy to process for later use.

Here’s how to save the data:

# Function to save scraped reviews to a JSON file
def save_reviews_to_json(data, filename='goodreads_reviews.json'):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

# Example usage
book_reviews = scrape_goodreads_reviews_with_pagination('https://www.goodreads.com/book/show/4671.The_Great_Gatsby/reviews')
save_reviews_to_json(book_reviews)

Complete Code Example

Here is the complete code that scrapes Goodreads for book ratings and reviews, handles button-based pagination, and saves the data in a JSON file:

from crawlbase import CrawlingAPI
import json
from bs4 import BeautifulSoup

# Initialize Crawlbase API with JS Token
crawling_api = CrawlingAPI({ 'token': 'CRAWLBASE_JS_TOKEN' })

# Function to extract book details and reviews from the HTML content
def extract_book_details(html):
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.select_one('h1.H1Title a[data-testid="title"]').text.strip()
    rating = soup.select_one('div.RatingStatistics span.RatingStars')['aria-label']

    reviews = []
    for review_div in soup.select('div.ReviewsList article.ReviewCard'):
        user = review_div.select_one('div[data-testid="name"]').text.strip()
        review_text = review_div.select_one('section.ReviewText span.Formatted').text.strip()
        reviews.append({'user': user, 'review': review_text})

    return {'title': title, 'rating': rating, 'reviews': reviews}

# Function to scrape Goodreads with pagination
def scrape_goodreads_reviews_with_pagination(base_url):
    page_data = []

    # Fetch initial page and reviews
    response = crawling_api.get(base_url, {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'button:has(span[data-testid="loadMore"])'
    })

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        page_data = extract_book_details(html_content)

    return page_data

# Function to save the reviews in JSON format
def save_reviews_to_json(data, filename='goodreads_reviews.json'):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

# Example usage
book_reviews = scrape_goodreads_reviews_with_pagination('https://www.goodreads.com/book/show/4671.The_Great_Gatsby/reviews')
save_reviews_to_json(book_reviews)

By using Crawlbase’s JS Token and handling button-based pagination, this scraper efficiently extracts Goodreads book ratings and reviews and stores them in a usable format.

Example Output:

{
    "title": "The Great Gatsby",
    "rating": "Rating 3.93 out of 5",
    "reviews": [
        {
            "user": "Alex",
            "review": "The Great Gatsby is your neighbor you're best friends with until you find out he's a drug dealer. It charms you with some of the most elegant English prose ever published, making it difficult to discuss the novel without the urge to stammer awestruck about its beauty. It would be evidence enough to argue that F. Scott Fitzgerald was superhuman, if it wasn't for the fact that we know he also wrote This Side of Paradise.But despite its magic, the rhetoric is just that, and it is a cruel facade. Behind the stunning glitter lies a story with all the discontent and intensity of the early Metallica albums. At its heart, The Great Gatsby throws the very nature of our desires into a harsh, shocking light. There may never be a character who so epitomizes tragically misplaced devotion as Jay Gatsby, and Daisy, his devotee, plays her part with perfect, innocent malevolence. Gatsby's competition, Tom Buchanan, stands aside watching, taunting and provoking with piercing vocal jabs and the constant boast of his enviable physique. The three jostle for position in an epic love triangle that lays waste to countless innocent victims, as well as both Eggs of Long Island. Every jab, hook, and uppercut is relayed by the instantly likable narrator Nick Carraway, seemingly the only voice of reason amongst all the chaos. But when those boats are finally borne back ceaselessly by the current, no one is left afloat. It is an ethical massacre, and Fitzgerald spares no lives; there is perhaps not a single character of any significance worthy even of a Sportsmanship Award from the Boys and Girls Club.In a word, The Great Gatsby is about deception; Fitzgerald tints our glasses rosy with gorgeous prose and a narrator you want so much to trust, but leaves the lenses just translucent enough for us to see that Gatsby is getting the same treatment. And if Gatsby represents the truth of the American Dream, it means trouble for us all. Consider it the most pleasant insult you'll ever receive."
        },
        {
            "user": "Lisa of Troy",
            "review": "Fitzgerald, you have ruined me.Fitzgerald can set a scene so perfectly, flawlessly. He paints a world of magic and introduces one of the greatest characters of all time, Jay Gatsby. Gatsby is the embodiment of hope, and no one can dissuade him from his dreams. Have you ever had a dream that carried you to heights you could never have dreamed otherwise? When Gatsby is reunited with Daisy Buchanan, he fills the space to the brim with flowers, creating a living dream. How is anyone supposed to compete with that?The Great Gatsby perfectly makes use of a narrator, Nick. Why is Gatsby so great? Because Nick tells us. If Gatsby told us, we would just think that he is a braggard, the least humble person in the world. This book is wildly addictive, so intricate yet perfectly woven together, a brilliant literary masterpiece. I have to keep going back to reconnect with Jay Gatsby, a naïve but beautiful and charming hope, perfectly imperfect, a relentless dreamer.2025 Reading ScheduleJan\tA Town Like AliceFeb\tBirdsongMar\tCaptain Corelli's Mandolin - Louis De BerniereApr\tWar and PeaceMay\tThe Woman in WhiteJun\tAtonementJul\tThe Shadow of the WindAug\tJude the ObscureSep\tUlyssesOct\tVanity FairNov\tA Fine BalanceDec\tGerminalConnect With Me!Blog Twitter BookTube Facebook Insta My Bookstore at Pango"
        },
        {
            "user": "Kemper",
            "review": "Jay Gatsby, you poor doomed bastard. You were ahead of your time. If you would have pulled your scam after the invention of reality TV, you would have been a huge star on a show like The Bachelor and a dozen shameless Daisy-types would have thrown themselves at you. Mass media and modern fame would have embraced the way you tried to push your way into a social circle you didn’t belong to in an effort to fulfill a fool’s dream as your entire existence became a lie and you desperately sought to rewrite history to an ending you wanted. You had a talent for it, Jay, but a modern PR expert would have made you bigger than Kate Gosselin. Your knack for self-promotion and over the top displays of wealth to try and buy respectability would have fit right in these days. I can just about see you on a red carpet with Paris Hilton. And the ending would have been different. No aftermath for rich folks these days. Lawyers and pay-off money would have quietly settled the matter. No harm, no foul. But then you’d have realized how worthless Daisy really was at some point. I’m sure you couldn’t have dealt with that. So maybe it is better that your story happened in the Jazz Age where you could keep your illusions intact to the bitter end.The greatest American novel? I don’t know if there is such an animal. But I think you'd have to include this one in the conversation."
        },
        {
            "user": "Inge",
            "review": "There was one thing I really liked about The Great Gatsby.It was short."
        },
        {
            "user": "may ➹",
            "review": "the only thing I got from this is that Nick is gay2.5"
        },
        .... more
    ]
}

Final Thoughts

Scrape Goodreads for book ratings and comments and get valuable insights from readers. Using Python with the Crawlbase Crawling API makes it easier especially when dealing with dynamic content and button-based pagination on Goodreads. With us handling the technical complexities you can focus on extracting the data.

Follow the steps in this guide and you’ll be set up and scraping reviews and ratings and storing the data in a structured format for analysis. If you want to do more web scraping, check out our guides on scraping other key websites.

📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape X.com Tweet Pages
📜 How to Scrape Clutch.co

If you have questions or want to give feedback our support team can help with web scraping. Happy scraping!

Frequently Asked Questions

Q. What is the best way to scrape Goodreads for book ratings and comments?

Best way to scrape Goodreads is by using Python with Crawlbase Crawling API. This combination allows you to scrape dynamic content like book ratings and comments. Crawlbase Crawling API can handle JavaScript rendering and pagination so you can get all the data without any issues.

Q. What data points can I extract when scraping Goodreads?

When scraping Goodreads you can extract following data points: book titles, authors, average ratings, individual user ratings, comments, total reviews. This data will give you insights on how readers are receiving books and help you in making informed decisions for book recommendations or analysis.

Q. How does pagination work when scraping reviews from Goodreads?

Goodreads uses button-based pagination to load more reviews. By using Crawlbase Crawling API you can click the “Next” button programmatically. This way all reviews will be loaded and you can get complete data across multiple pages without manually navigating the site. You can set parameters like css_click_selector in the API call to handle this.

How To Create a Zalando Scraper

2024-09-25T16:00:00.000Z

Looking to scrape Zalando? You’re in the right place. Zalando is one of the top fashion online shopping sites with a huge range of stuff from clothes to accessories. Maybe you’re doing market research or building a fashion app - either way, knowing how to get good data straight from the site can be handy.

In this blog, we’ll show you how to create a reliable Zalando scraper with Puppeteer - a well-known web scraping tool. You’ll learn how to pull out product details such as prices, sizes, and stock levels. We’ll also give you tips on how to handle CAPTCHA, IP blocking and how to scale your scraper with Crawlbase Smart Proxy.

Let’s get started!

Why Scrape Zalando for Product Data?
Key Data Points to Extract from Zalando
Setting Up Your Node.js Environment

Installing Node.js
Installing Required Libraries
Choosing an IDE

Scraping Zalando Product Listings

Inspecting the HTML for Selectors
Writing the Zalando Product Listings Scraper
Handling Pagination
Storing Data in a JSON File

Scraping Zalando Product Details

Inspecting the HTML for Selectors
Writing the Zalando Product Details Scraper
Storing Data in a JSON File

Optimizing with Crawlbase Smart Proxy

What is Crawlbase Smart Proxy?
How to Use Crawlbase Smart Proxy with Puppeteer
Benefits of Using Crawlbase Smart Proxy

Final Thoughts
Frequently Asked Questions

Why Scrape Zalando for Product Data?

Scraping Zalando is a great way to get product data for various purposes. Whether you’re monitoring prices, tracking product availability, or analyzing fashion trends, having access to this data gives you an edge. Zalando is one of the largest online fashion platforms in Europe with a wide range of products from shoes and clothes to accessories.

By scraping Zalando, you can extract product names, prices, reviews, and availability. This data can be used to compare prices, create data-driven marketing strategies, or even build an automated price tracker. If you run an eCommerce business or just want to keep an eye on the latest fashion trends, scraping Zalando’s product data will help you stay ahead.

Using a scraper to get data from Zalando saves you the time and effort of manually searching and copying product information. With the right setup you can get thousands of product details in no time and efficiently, making your data collection process more streamlined.

Key Data Points to Extract from Zalando

When scraping Zalando you can extract several important product information. These details are useful for tracking trends, understanding prices, or analyzing market behaviors. Below are the main data points to focus on:

Product Name: The name of the product helps you identify and categorize what is being sold.
Product Price: Knowing the price, including discounts, is essential for monitoring price trends and comparing competitors.
Product Description: This gives specific information about the product, such as material, style, and other key features.
Product Reviews: Reviews provide information about product quality and popularity and are useful for sentiment analysis.
Product Availability: Checking if a product is in stock helps you understand demand and how quickly items are selling.
Product Images: Images give a clear view of the product, which is important for understanding fashion trends and styles.
Brand Name: Knowing the brand allows for better analysis of brand performance and comparison across different brands.

Setting Up Your Node.js Environment

In order to efficiently scrape Zalando, you will need to configure your Node.js environment. This process involves installing Node.js, the necessary libraries, and choosing a suitable Integrated Development Environment (IDE). Here’s how to do it step by step:

Installing Node.js

Download Node.js: Go to the official Node.js website to get its latest version for your operating system. Node.js comes with npm (Node Package Manager), which you’ll use to install other libraries.
Install Node.js: Follow the installation instructions for your operating system. You may verify if it is installed by opening your terminal or command prompt and typing:

node -v

This command should display the installed version of Node.js.

Installing Required Libraries

Create a New Project Folder: Create a folder for your scraping project. Open the terminal inside this folder.
Initialize npm: Inside your project folder, run:

1	npm init -y

This command creates a package.json file that keeps track of your project’s dependencies.

Install Required Libraries: You’ll need a few libraries to make scraping easier. Install Puppeteer and any other libraries you may require:

1	npm install puppeteer axios

Create the Main File: In your project folder, create a file named scraper.js. This file will contain your scraping code.

Choosing an IDE

Selecting an IDE can make coding easier. Some of the popular ones include:

Visual Studio Code: Popular editor with lots of extensions for working with JavaScript.
WebStorm: A powerful IDE specifically designed for JavaScript and web development, but it isn’t free.
Atom: A hackable text editor that is customizable and user-friendly.

Now that you have your environment set up and scraper.js created. Let’s get started with scraping Zalando product listings.

Scraping Zalando Product Listings

After setting up the environment, we can start creating the scraper for Zalando product listings. We will scrape the handbags section from this URL:

https://en.zalando.de/catalogue/?q=handbags

We’ll extract the product page URL, title, store name, price, and image URL from each listing. We will also handle pagination to go through multiple pages.

Inspecting the HTML for Selectors

First we have to inspect the HTML of the product listings page to find the correct selectors. Open the developer tools in your browser and navigate to the handbag listings.

You’ll typically look for elements like:

Product Page URL: This is the link to the individual product page.
Product Title: Usually in an
tag within a
element.
Brand Name: This may be found in an
tag within a
element.
Price: Found in a tag with a price class.
Image URL: Contained in the tag within each product card.

Writing the Zalando Product Listings Scraper

Now that you have the selectors, you can write a scraper to collect product listings. Here’s an example code snippet using Puppeteer:

const puppeteer = require('puppeteer');

// Function to scrape product listings from Zalando
async function scrapeProductListings(page) {
  await page.goto('https://en.zalando.de/catalogue/?q=handbags', { timeout: 0 });

  // Scraping product listings
  const products = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('div[data-zalon-partner-target="true"] > div.cYylcv.BaerYO')).map(
      (card) => {
        const title = card.querySelector('div.Zhr-fS h3:last-child')?.innerText; // Product title
        const storeName = card.querySelector('div.Zhr-fS h3:first-child')?.innerText; // Store name
        const price = card.querySelector('span.sDq_FX.lystZ1')?.innerText; // Price
        const productUrl = card.querySelector('a')?.href; // Product URL
        const thumbnail = card.querySelector('img:first-child')?.src; // Image URL

        return { title, storeName, price, productUrl, thumbnail };
      },
    );
  });

  return products;
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const productListings = await scrapeProductListings(page);
  console.log('Product Listings:', productListings);

  await browser.close();
})();

Code Explanation:

scrapeProductListings Function: This function navigates to the Zalando product page, with unlimited timeout limit, and extracts the product title, price, URL, and image URL.
Data Collection: The function returns an array of product objects containing the scraped information.

Example Output:

Product Listings: [
  {
    title: 'Handbag - black',
    brandName: 'Anna Field',
    price: '34,99 €',
    productUrl: 'https://en.zalando.de/anna-field-handbag-black-an651h0x2-q11.html',
    thumbnail: 'https://img01.ztat.net/article/spp-media-p1/4ce13463cf9a4dda9828bfc44f65bb6e/45133485dd0c4b03b1b122f0deeb0801.jpg?imwidth=300&filter=packshot'
  },
  {
    title: 'LEATHER - Handbag - black',
    brandName: 'Zign',
    price: '49,99 €',
    productUrl: 'https://en.zalando.de/zign-handbag-black-zi151h08a-q11.html',
    thumbnail: 'https://img01.ztat.net/article/spp-media-p1/a86e1fd894b33f8388ed33009cb6cfd2/62c903c4162141fa8c1452be53635f02.jpg?imwidth=300&filter=packshot'
  },
  {
    title: 'NOELLE TOP ZIP SHOULDER BAG - Handbag - coal logo',
    brandName: 'Guess',
    price: '124,95 €',
    productUrl: 'https://en.zalando.de/guess-noelle-top-zip-shoulder-bag-handbag-coal-logo-gu151h4zp-c11.html',
    thumbnail: 'https://img01.ztat.net/article/spp-media-p1/b6c00ad1942e4b439808bf3099e035ab/38798e461de54ddfad6a33d6f1ab5e42.jpg?imwidth=300&filter=packshot'
  },
  .... more
]

To gather more listings, you need to handle pagination. Zalando uses the &p= parameter in the URL to navigate between pages. Here’s how to modify your scraper to handle multiple pages:

async function scrapeAllProductListings(page, totalPages) {
  let allProducts = [];

  for (let i = 1; i <= totalPages; i++) {
    const url = `https://en.zalando.de/catalogue/?q=handbags&p=${i}`;
    await page.goto(url);
    await page.waitForSelector('.product-card'); // Wait for product cards to load

    const products = await scrapeProductListings(page);
    allProducts = allProducts.concat(products); // Combine products from all pages
  }

  return allProducts;
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const totalPages = 5; // Specify the total number of pages you want to scrape
  const allProductListings = await scrapeAllProductListings(page, totalPages);
  console.log('All Product Listings:', allProductListings);

  await browser.close();
})();

Code Explanation:

scrapeAllProductListings Function: This function loops through the specified number of pages, constructs the URL for each page, and calls the scrapeProductListings function to gather data from each page.
Pagination Handling: Products from all pages are combined into a single array.

Storing Data in a JSON File

Finally, it’s useful to store the scraped data in a JSON file for later analysis. Here’s how to do that:

const puppeteer = require('puppeteer');
const fs = require('fs');

// Copy scrapeAllProductListings and scrapeProductListings functions from previous code snippets

// Function to save scraped data to a JSON file
function saveDataToJson(data, filename = 'zalando_product_listings.json') {
  fs.writeFileSync(filename, JSON.stringify(data, null, 2));
  console.log(`Data successfully saved to ${filename}`);
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const totalPages = 5; // Specify the total number of pages you want to scrape
  const allProductListings = await scrapeAllProductListings(page, totalPages);

  // Save scraped product listings to a JSON file
  saveDataToJson(allProductListings);

  await browser.close();
})();

Code Explanation:

saveDataToJson Function: This function saves the scraped product listings to a json file (zalando_product_listings.json) so you can easily access the data

Next up we will cover how to scrape product data from individual product pages.

Scraping Zalando Product Details

Now that you have scraped the listings, the next step is to gather data from individual product pages. This allows you to get more specific data like product descriptions, material details, and customer reviews, which are not available on the listing pages.

To scrape the product details, we’ll first inspect the structure of the product page and identify the relevant HTML elements that contain the data we need.

Inspecting the HTML for Selectors

Visit any individual product page from Zalando and use your browser’s developer tools to inspect the HTML structure.

You’ll typically need to find elements like:

Product Title: Usually within a tag with classes like EKabf7 R_QwOV.
Brand Name: Usually within a tag with classes like z2N-Fg yOtBvf.
Product Details: Located in a
within data-testid="pdp-accordion-details".
Price: In a tag with classes like dgII7d Km7l2y.
Available Sizes: Often listed in a
within data-testid="pdp-accordion-size_fit".

Image URLs: Contained in the tag within a

XLgdq7 _0xLoFW

Writing the Zalando Product Details Scraper

Once you have the correct selectors, you can write a scraper to collect product details like the title, description, price, available sizes, and image URLs.

Here’s an example code to scrape Zalando product details using Puppeteer:

const puppeteer = require('puppeteer');

// Function to scrape product details from a single product URL
// Function to scrape product details from a single product URL
async function scrapeProductDetails(page, productUrl) {
  await page.goto(productUrl, { timeout: 0 });

  // Click the "Details" section
  const detailsButtonSelector = 'div[data-testid="pdp-accordion-details"] button';
  const sizesButtonSelector = 'div[data-testid="pdp-accordion-size_fit"] button';

  // Wait for the details button and click it
  await page.waitForSelector(detailsButtonSelector);
  await page.click(detailsButtonSelector);

  // Wait for the sizes button and click it
  await page.waitForSelector(sizesButtonSelector);
  await page.click(sizesButtonSelector);

  // Scraping product details
  const productDetails = await page.evaluate(() => {
    const title = document.querySelector('span.EKabf7.R_QwOV')?.innerText; // Product title
    const brandName = document.querySelector('span.z2N-Fg.yOtBvf')?.innerText; // Brand name
    const details = Object.fromEntries(
      Array.from(document.querySelectorAll('div[data-testid="pdp-accordion-details"] div.qMOFyE')).map((item) => [
        item.querySelector('dt')?.innerText.trim(),
        item.querySelector('dd')?.innerText.trim(),
      ]),
    ); // Product details
    const price = document.querySelector('span.dgII7d.Km7l2y')?.innerText; // Price
    const sizes = Object.fromEntries(
      Array.from(document.querySelectorAll('div[data-testid="pdp-accordion-size_fit"] div.qMOFyE')).map((item) => [
        item.querySelector('dt')?.innerText.trim(),
        item.querySelector('dd')?.innerText.trim(),
      ]),
    ); // Available Sizes
    const imageUrls = Array.from(document.querySelectorAll('ul.XLgdq7._0xLoFW li img')).map((img) => img.src); // Product images URL

    return { title, brandName, details, price, sizes, imageUrls };
  });

  return { url: productUrl, ...productDetails };
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const productUrls = [
    'https://en.zalando.de/anna-field-handbag-black-an651h0x2-q11.html',
    'https://en.zalando.de/zign-handbag-black-zi151h08a-q11.html',
    // Add more product URLs here
  ];

  const allProductDetails = [];

  for (const url of productUrls) {
    const details = await scrapeProductDetails(page, url);
    allProductDetails.push(details);
  }

  console.log('Product details scraped successfully:', allProductDetails);
  await browser.close();
})();

Code Explanation:

scrapeProductDetails Function: This function navigates to the product URL, waits for the content to load, and scrapes the product title, description, price, available sizes, and image URLs. To access the relevant content, the function first waits for the “Details” and “Sizes” buttons to become visible using await page.waitForSelector(), then clicks them with await page.click(). This expands the respective sections, enabling the extraction of their content.
Product URLs Array: This array contains the product page URLs you want to scrape.

Example Output:

Product details scraped successfully: [
  {
    url: 'https://en.zalando.de/anna-field-handbag-black-an651h0x2-q11.html',
    title: 'Handbag - black',
    brandName: 'Anna Field',
    details: { 'Fastening:': 'Zip', 'Pattern:': 'Plain', 'Details:': 'Buckle' },
    price: '108,95 €',
    sizes: {
      'Height:': '28 cm (Size One Size)',
      'Length:': '36 cm (Size One Size)',
      'Width:': '12 cm (Size One Size)'
    },
    imageUrls: [
      'https://img01.ztat.net/article/spp-media-p1/3359d0e0d8484d9ba930544c6c71a861/7859902ec50b4d88899541e3c1cf976b.jpg?imwidth=762',
      'https://img01.ztat.net/article/spp-media-p1/4ce13463cf9a4dda9828bfc44f65bb6e/45133485dd0c4b03b1b122f0deeb0801.jpg?imwidth=762&filter=packshot',
      'https://img01.ztat.net/article/spp-media-p1/4ce13463cf9a4dda9828bfc44f65bb6e/45133485dd0c4b03b1b122f0deeb0801.jpg?imwidth=156&filter=packshot',
      .... more
    ]
  },
  {
    url: 'https://en.zalando.de/zign-handbag-black-zi151h08a-q11.html',
    title: 'LEATHER - Handbag - black',
    brandName: 'Zign',
    details: { 'Fastening:': 'Zip', 'Pattern:': 'Plain' },
    price: '51,99 €',
    sizes: {
      'Height:': '25 cm (Size One Size)',
      'Length:': '36 cm (Size One Size)',
      'Width:': '11 cm (Size One Size)'
    },
    imageUrls: [
      'https://img01.ztat.net/article/spp-media-p1/a86e1fd894b33f8388ed33009cb6cfd2/62c903c4162141fa8c1452be53635f02.jpg?imwidth=762&filter=packshot',
      'https://img01.ztat.net/article/spp-media-p1/cb7586f888fe39bc8e160d909a2403e3/194701c057bb4c6595849a0ffe13da24.jpg?imwidth=762',
      'https://img01.ztat.net/article/spp-media-p1/a86e1fd894b33f8388ed33009cb6cfd2/62c903c4162141fa8c1452be53635f02.jpg?imwidth=156&filter=packshot',
      .... more
    ]
  }
]

Storing Data in a JSON File

After scraping the product details, it’s a good idea to save data in a JSON file. This makes it easier to access and analyze later. Here’s how to save the scraped product details to a JSON file.

const fs = require('fs');

// Copy scrapeProductDetails function from previous code snippet

// Function to save scraped data to a JSON file
function saveDataToJson(data, filename = 'zalando_product_details.json') {
  fs.writeFileSync(filename, JSON.stringify(data, null, 2));
  console.log(`Data successfully saved to ${filename}`);
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const productUrls = [
    'https://en.zalando.de/anna-field-handbag-black-an651h0x2-q11.html',
    'https://en.zalando.de/zign-handbag-black-zi151h08a-q11.html',
    // Add more product URLs here
  ];

  const allProductDetails = [];

  for (const url of productUrls) {
    const details = await scrapeProductDetails(page, url);
    allProductDetails.push(details);
  }

  // Save scraped product details to a JSON file
  saveDataToJson(allProductDetails);

  await browser.close();
})();

Code Explanation:

saveDataToJson Function: This function writes the scraped product details to a JSON file (zalando_product_details.json), formatted for easy reading.
Data Storage: After scraping the details, the data is passed to the function to be saved in a structured format.

In the next section, we’ll look at how you can optimize your scraper using Crawlbase Smart Proxy to avoid getting blocked while scraping.

Optimizing with Crawlbase Smart Proxy

When scraping Zalando, you might get blocked or throttled. To avoid this, use a proxy service. Crawlbase Smart Proxy helps you scrape safely and fast. Here’s how to integrate it into your Zalando scraper.

How to Use Crawlbase Smart Proxy with Puppeteer

Integrating Crawlbase Smart Proxy into your Puppeteer script is straightforward. You’ll need your Crawlbase API key to get started.

Here’s how to set it up:

Sign Up for Crawlbase: Go to the Crawlbase website and create an account. After signing up you’ll get an API Token.
Update Your Puppeteer Script: Modify your existing scraper to use the Crawlbase proxy.

Here’s an updated version of your Zalando product scraper with Crawlbase Smart Proxy:

const puppeteer = require('puppeteer');

const proxyUrl = 'http://_USER_TOKEN_@smartproxy.crawlbase.com:8012'; // Replace _USER_TOKEN_ with your token

async function scrapeProductDetails(page, productUrl) {
  await page.goto(productUrl, { timeout: 0 });

  // Scraping product details
  const productDetails = await page.evaluate(() => {
    const title = document.querySelector('h1')?.innerText;
    const description = document.querySelector('.product-description')?.innerText;
    const price = document.querySelector('.price')?.innerText;
    const sizes = Array.from(document.querySelectorAll('.size-options')).map((size) => size.innerText);
    const imageUrl = document.querySelector('img')?.src;

    return { title, description, price, sizes, imageUrl };
  });

  return { url: productUrl, ...productDetails };
}

(async () => {
  const browser = await puppeteer.launch({
    args: [`--proxy-server=${proxyUrl}`], // Use Crawlbase proxy
  });
  const page = await browser.newPage();

  const productUrls = [
    'https://en.zalando.de/anna-field-handbag-black-an651h0x2-q11.html',
    'https://en.zalando.de/zign-handbag-black-zi151h08a-q11.html',
    // Add more product URLs here
  ];

  const allProductDetails = [];

  for (const url of productUrls) {
    const details = await scrapeProductDetails(page, url);
    allProductDetails.push(details);
  }

  console.log('Product details scraped successfully:', allProductDetails);
  await browser.close();
})();

Code Explanation:

Proxy Setup: Replace _USER_TOKEN_ with your actual Crawlbase token. This tells Puppeteer to use the Crawlbase proxy for all requests.
Browser Launch Options: The args parameter in the puppeteer.launch() method specifies the proxy server to use. This way, all your requests go through the Crawlbase proxy.

Optimize your Zalando Scraper with Crawlbase

Scraping Zalando can provide useful information for your projects. In this blog, we showed you how to set up your Node.js environment and scrape product listings and details. Always check Zalando’s scraping rules to stay within their limits.

Using Puppeteer with Crawlbase Smart Proxy makes your scraping faster and more robust. Storing your data in JSON makes it easy to manage and analyze. Remember, website layouts can change, so keep your scrapers up to date.

If you’re interested in exploring scraping from other e-commerce platforms, feel free to explore the following comprehensive guides.

📜 How to Scrape Amazon
📜 How to scrape Walmart
📜 How to Scrape AliExpress
📜 How to Scrape Flipkart
📜 How to Scrape Etsy

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Good luck with your scraping journey!

Frequently Asked Questions

Q. Is scraping Zalando legal?

Scraping data from Zalando can have legal implications. Make sure to review the website’s terms of service to see what they say about data scraping. Some websites will explicitly not allow scraping, while others will allow it under certain conditions. By following the website’s rules, you can avoid legal issues and be ethical.

Q. What tools do I need to scrape Zalando?

To scrape Zalando, you need specific tools because the site uses JavaScript rendering. First, install Node.js, which allows you to run JavaScript code outside a browser. Then, use Puppeteer, a powerful library that controls a headless Chrome browser so you can interact with JavaScript-rendered content. Also, consider using Crawlbase Crawling API, which can help with IP rotation and bypassing blocks. Together, these tools will help you extract data from Zalando’s dynamic pages.

Q. Why use Crawlbase Smart Proxy while scraping Zalando?

Using Crawlbase Smart Proxy for Zalando scraping is good for several reasons. It will prevent you from getting blocked by the website, it rotates IP addresses to mimic regular user behavior. So your scraping will be more effective and you can collect data continuously without interruptions. Crawlbase Smart Proxy will also speed up your scraping, so you can collect data faster and more efficiently.

How to Parse XML in Python

2024-09-24T10:00:00.000Z

XML (Extensible Markup Language) is a common format for storing and transferring data between different platforms and systems. As a Python developer working on web services, config files, or data transfer, you need to know how to parse XML files. You can use Python libraries to make XML parsing easy and fast.

This article will cover various ways to parse XML in Python, both built-in libraries and external tools. You’ll learn how to handle XML files of all sizes, convert XML to dictionaries and save parsed data to CSV and JSON. We’ll also look at parsing invalid or malformed XML with more lenient tools.

Let’s dive into the details of how to parse XML in Python.

What is XML?

XML, or Extensible Markup Language, is a data format to store and exchange data between different systems. It’s human-readable and machine-readable, that’s why it’s used in web services, configuration files, strings translation, and more.

Why XML?

XML is used because it’s a flexible and organized way to represent complex data. Unlike CSV or plain text, XML allows you to create a hierarchy of elements and attributes, so it’s easier to understand and manipulate the data.

Here are a few reasons why XML is preferred:

Platform independence: XML can be used with any operating system and programming language.
Scalability: XML files can contain simple and complex data structures.
Readability: Non-developers can read and understand XML.

What is XML Parsing?

XML parsing is the process of reading and processing an XML document to extract data. In Python, parsing XML allows you to browse XML documents, extract data, and change it as needed. This is especially important when working with APIs or other data exchange systems that use XML as their protocol.

Python has built-in libraries and third-party tools to parse XML data, whether it’s a small config file or a big data source. In the next sections, we’ll see how.

Parsing XML Using Python’s Built-in Libraries

Python has powerful built-in libraries for working with XML data. These libraries allow you to parse XML files, extract what you need, and manipulate the data as required. Two popular libraries in Python for parsing XML are xml.etree.ElementTree and xml.dom.minidom.

Parsing XML with `xml.etree.ElementTree`

xml.etree.ElementTree is a lightweight library that comes with Python by default. It is an XML parsing library for Python, which was made to easily parse and navigate through XML files.

For example, this is how you might use ElementTree to parse an XML string:

import xml.etree.ElementTree as ET

# Example XML data
xml_data = """

    
        Wireless Mouse
        29.99
        Electronics
    
    
        Office Chair
        89.99
        Furniture
    

"""

# Parse the XML data
root = ET.fromstring(xml_data)

# Access and print data
for product in root.findall('product'):
    name = product.find('name').text
    price = product.find('price').text
    category = product.find('category').text
    print(f"Product Name: {name}, Price: ${price}, Category: {category}")

In this example, we start by importing the ElementTree module. We can get the root element of an XML by parsing the XML string using fromstring() method. We will then use the findall(), and find() methods to search for a particular tag or extract text from inside tags.

Parsing XML with xml.dom.minidom

xml.dom.minidom is another built-in library that uses the Document Object Model (DOM) to parse and manipulate XML. It is more powerful and flexible, but it can be less easy to use than ElementTree in some cases.

The same XML data can be parsed with minidom as follows:

from xml.dom.minidom import parseString

# Example XML data
xml_data = """

    
        Wireless Mouse
        29.99
        Electronics
    
    
        Office Chair
        89.99
        Furniture
    

"""

# Parse the XML data
dom = parseString(xml_data)

# Access and print data
products = dom.getElementsByTagName('product')
for product in products:
    name = product.getElementsByTagName('name')[0].childNodes[0].nodeValue
    price = product.getElementsByTagName('price')[0].childNodes[0].nodeValue
    category = product.getElementsByTagName('category')[0].childNodes[0].nodeValue
    print(f"Product Name: {name}, Price: ${price}, Category: {category}")

In this example, parseString() is used to load the XML into a DOM object. We then use getElementsByTagName() to find the product, name, price, and category elements, and childNodes[0].nodeValue to extract the text. While minidom offers detailed control, it’s generally less efficient than ElementTree for simple tasks.

Working with External XML Parsing Libraries

You might want to stick with the built-in libraries for simple tasks when you are working with XML data in Python. On the other hand, for more complex requirements or better handling incorrect XML, you might choose another library to have more features and flexibility. In this part, we will discuss two famous external XML parsing libraries: lxml and BeautifulSoup.

Parsing XML with lxml

You can also use lxml which is a very nice library when you are working with XML and HTML documents. It has good support for XPath and XSLT hence could be a powerful XML processor.

To get started with lxml, you’ll need to install it. You can do this via pip:

1	pip install lxml

Here’s an example of how to use lxml to parse XML data:

from lxml import etree

# Example XML data
xml_data = """

    
        Wireless Mouse
        29.99
        Electronics
    
    
        Office Chair
        89.99
        Furniture
    

"""

# Parse the XML data
root = etree.fromstring(xml_data)

# Access and print data
for product in root.xpath('//product'):
    name = product.find('name').text
    price = product.find('price').text
    category = product.find('category').text
    print(f"Product Name: {name}, Price: ${price}, Category: {category}")

In this example, we use lxml‘s etree module to parse XML. With the xpath method, you can write powerful queries so that you can extract precisely from this messy XML structure.

How to Handle Malformed XML with BeautifulSoup

BeautifulSoup is often used for parsing HTML, but it can also handle malformed XML gracefully. This makes it a good choice for dealing with XML documents that may not be well-formed.

To use BeautifulSoup for XML parsing, install the library along with a parser like lxml:

1	pip install beautifulsoup4 lxml

Here’s an example of using BeautifulSoup to parse XML:

from bs4 import BeautifulSoup

# Example XML data with a malformed element
xml_data = """

    
        Wireless Mouse
        29.99
        Electronics
    
    
        Office Chair
        89.99
        Furniture
    
    
"""

# Parse the XML data
soup = BeautifulSoup(xml_data, 'lxml-xml')

# Access and print data
for product in soup.find_all('product'):
    name = product.find('name').get_text()
    price = product.find('price').get_text()
    category = product.find('category').get_text()
    print(f"Product Name: {name}, Price: ${price}, Category: {category}")

In this case, BeautifulSoup helps parse incomplete or broken XML documents. It is especially useful when you run into XML that deviates from the common practices of writing XML.

In the next part, we will look at how to transform XML data into Python dictionaries for better manipulation.

How to Convert XML to Dictionary in Python

Working with XML data may be difficult, if the requirement is to edit or extract some elements of it. A common workaround is to convert the XML into a Python dictionary. A dictionary is used for storing and retrieving data as key-value pairs making us deal with data more easily. Let’s explore two popular libraries for converting XML into a Python dictionary: xmltodict and untangle.

Using xmltodict

xmltodict is a simple library that can convert XML data to a dictionary in a few lines of code. It simplifies and speeds up the processing of XML data.

To get started, you’ll need to install the library using pip:

1	pip install xmltodict

Here’s an example of how to use xmltodict to convert XML into a dictionary:

import xmltodict

# Example XML data
xml_data = """

    
        Notebook
        5.99
        100
    
    
        Pencil
        0.99
        500
    

"""

# Convert XML to a dictionary
data_dict = xmltodict.parse(xml_data)

# Access and print data
for item in data_dict['store']['item']:
    name = item['name']
    price = item['price']
    quantity = item['quantity']
    print(f"Item: {name}, Price: ${price}, Quantity: {quantity}")

In this example, xmltodict.parse() converts the XML data into a Python dictionary, allowing you to work with it as if it were a standard dictionary. This makes it much easier to retrieve and manipulate data from XML.

Using `untangle`

Another great library for parsing XML into Python objects is untangle. Unlike xmltodict, which converts XML into a dictionary, untangle turns the XML into Python objects that you can easily access through attributes.

First, install the library using pip:

1	pip install untangle

Here’s an example of how to use untangle:

import untangle

# Example XML data
xml_data = """

    
        Notebook
        5.99
        100
    
    
        Pencil
        0.99
        500
    

"""

# Parse XML into Python objects
data = untangle.parse(xml_data)

# Access and print data
for item in data.store.item:
    name = item.name.cdata
    price = item.price.cdata
    quantity = item.quantity.cdata
    print(f"Item: {name}, Price: ${price}, Quantity: {quantity}")

In this example, untangle converts the XML structure into Python objects. Each XML tag becomes an attribute of the object, and you can easily access the content using cdata (character data).

Next, we’ll look at how to save the parsed XML data into different formats like CSV or JSON for further use.

How to Save Parsed XML Data

After parsing XML data, it is generally saved in more familiar forms such as CSV or JSON. This allows you to save, exchange, and analyze data in a more user-friendly manner for the majority of apps. In this part, we’ll look at two ways to save parsed XML data: exporting it to CSV with pandas and saving it as JSON.

Exporting to CSV with `pandas`

CSV (Comma-Separated Values) files are commonly used to store tabular data. Python’s pandas package makes it simple to save parsed XML data to a CSV file.To get started, make sure pandas is installed:

1	pip install pandas

Here’s an example of how to convert XML data into a CSV file using pandas:

import xml.etree.ElementTree as ET
import pandas as pd

# Example XML data
xml_data = """

    
        Notebook
        5.99
        100
    
    
        Pencil
        0.99
        500
    

"""

# Parse XML
root = ET.fromstring(xml_data)

# Extract data and create a list of dictionaries
data = []
for item in root.findall('item'):
    name = item.find('name').text
    price = item.find('price').text
    quantity = item.find('quantity').text
    data.append({'Name': name, 'Price': price, 'Quantity': quantity})

# Convert list of dictionaries to a pandas DataFrame
df = pd.DataFrame(data)

# Save DataFrame to a CSV file
df.to_csv('store_items.csv', index=False)

print("Data has been saved to store_items.csv")

In this example, we use xml.etree.ElementTree to parse the XML data, and then we extract relevant information (like name, price, and quantity) into a list of dictionaries. pandas is then used to create a DataFrame and save the data to a CSV file.

Saving Data to JSON

JSON (JavaScript Object Notation) is a lightweight data format used in web applications and APIs. Python has a built-in module called json that can convert parsed XML to JSON.

Here’s how to convert XML to JSON and save to a file:

import xmltodict
import json

# Example XML data
xml_data = """

    
        Notebook
        5.99
        100
    
    
        Pencil
        0.99
        500
    

"""

# Convert XML to a dictionary using xmltodict
data_dict = xmltodict.parse(xml_data)

# Convert dictionary to JSON and save to a file
with open('store_items.json', 'w') as json_file:
    json.dump(data_dict, json_file, indent=4)

print("Data has been saved to store_items.json")

In this example, we use xmltodict to convert the XML to a dictionary and then the json module to convert that dictionary to JSON. The JSON is saved to a file called store_items.json.

Next, we will cover how to handle large XML files.

Handling Large XML Files

Loading the entire file into memory can be slow and inefficient when dealing with large XML files. To address this, it’s better to use memory-friendly strategies that allow for processing the XML in smaller chunks. One effective way is to parse the XML file incrementally, reducing memory usage and speeding up processing time for large datasets.

Stream Parsing with `iterparse`

Stream parsing is an efficient technique for handling large XML files by processing them in chunks, instead of reading the whole file at once. Python’s iterparse method allows you to process XML data as it is being parsed, making it ideal for XML files that are too large to fit into memory.

Here’s how iterparse works:

Parse events: With iterparse, you can define events like ‘start’ or ‘end’ to trigger actions when an XML element starts or ends. This gives you control over how and when each part of the XML is processed.
Memory management: After processing each element, you can clear it from memory to minimize memory usage, which is crucial when handling large XML files.

Example:

import xml.etree.ElementTree as ET

# Stream parse the XML file
for event, element in ET.iterparse('large_file.xml', events=('end',)):
    if element.tag == 'product':
        # Extract product data
        name = element.find('name').text
        category = element.find('category').text
        price = element.find('price').text
        print(f"Product: {name}, Category: {category}, Price: {price}")

        # Clear the processed element from memory
        element.clear()

This example processes each element individually and then destroys the object to keep memory usage down. This method is very helpful when dealing with XML files with thousands or millions of elements.

Final Thoughts

Python has multiple flexible tools for dealing with XML, e.g. the built-in xml.etree.ElementTree to more advanced external third-party packages, like lxml and BeautifulSoup. Python can take care of anything you throw at it, such as simple parsing, converting XML to dictionaries, or handling large and malformed files.

With the right tool, you can parse XML quickly and format it into either CSV or JSON.Using the methods discussed in the blog, you can easily handle XML parsing in Python.

For more tutorials like these, follow our blog. If you have any questions or feedback, our support team is here to help you.

Frequently Asked Questions (FAQs)

Q. Is Python good for parsing XML?

Yes, Python is excellent for parsing XML. It has built-in libraries like xml.etree.ElementTree and xml.dom.minidom make XML parsing easy and efficient. Third-party libraries like lxml and BeautifulSoup, on the other hand, have more advanced features are designed specifically to deal with complex or malformed XML data.

Q. What is the best Python library for XML parsing?

Which is the best library depends on one’s needs. ElementTree is often enough for simple stuff. lxml or BeautifulSoup plugins are more suitable if you need to handle poorly-formed XML, boost processing speed, etc.

Q. How can I convert XML to a dictionary in Python?

Use libraries like xmltodict or untangle to Convert XML to Python dictionary easily. The libraries provide an easy way to convert XML data into Python dictionaries, allowing you to interact with and use your data easily.

Scrape Rotten Tomatoes to Find Movie Ratings

2024-09-18T11:00:00.000Z

Rottentomatoes.com is a popular website for movie ratings and reviews. The platform offers valuable information on movies, TV shows and even audience opinions. Rotten Tomatoes has data for movie lovers, researchers, and developers.

This blog will show you how to scrape Rotten Tomatoes to get movie ratings using Python. Since Rotten Tomatoes uses JavaScript rendering, we will use the Crawlbase Crawling API to handle the dynamic content loading. By the end of this blog, you will know how to extract key movie data like ratings, release dates, and reviews and store it in a structured format like JSON.

Now, let’s dive into the process step-by-step!

Why Scrape Rotten Tomatoes for Movie Ratings?
Key Data Points to Extract from Rotten Tomatoes
Crawlbase Crawling API for Rotten Tomatoes Scraping
Setting Up Your Python Environment

Installing Python and Required Libraries
Setting Up a Virtual Environment
Choosing an IDE

Scraping Rotten Tomatoes Movie Listings

Inspecting the HTML Structure
Writing the Rotten Tomatoes Movie Listings Scraper
Handling Pagination
Storing Data in a JSON File
Complete Code Example

Scraping Rotten Tomatoes Movie Details

Inspecting the HTML Structure
Writing the Rotten Tomatoes Movie Details Scraper
Storing Data in a JSON File
Complete Code Example

Final Thoughts
Frequently Asked Questions

Why Scrape Rotten Tomatoes for Movie Ratings?

Rotten Tomatoes is a reliable source for movie ratings and reviews so it’s a good website to scrape for movie data. Whether you’re a movie lover, a data analyst or a developer, you can scrape Rotten Tomatoes to gain insights into movie trends, audience preferences and critic ratings.

Here are a few reasons why you should scrape Rotten Tomatoes for movie ratings.

Access to Critic Reviews: Rotten Tomatoes provides critic reviews, allowing you to see how individual professionals perceive the movie.
Audience Scores: Get audience ratings which shows how the general public feels about a movie.
Film Details: Rotten Tomatoes offers data such as titles, genres, release dates, and more.
Popularity Tracking: By scraping ratings over time, you can monitor trends in genres, directors, or actors’ popularity.
Create a Personal Movie Database: Gather ratings and reviews to build a custom database for research, recommendations, or personal projects.

Key Data Points to Extract from Rotten Tomatoes

When scraping movie ratings from Rotten Tomatoes, you should focus on obtaining the most important data points. These data points shall provide you with essential info about movie’s reception, popularity and performance. Here are the data points you should extract:

Movie Title: The title of the movie is the first piece of information that they have. This will assist you in categorizing your scraped data according to certain particular films.
Critic Score (Tomatometer): Rotten Tomatoes has a critics score which is referred to as the ‘Tomatometer’. This score is computed with reviews from approved critics and provides an overall picture of a movie’s critical consensus.
Audience Score: The audience score reflects how the general public prefers the particular movie. It is therefore useful when looking at the comparisons between the professional and public opinions on various movies.
Number of Reviews: Critic as well as audience ratings are calculated according to the number of reviews available. To establish the reliability of the scores, you will need to extract this data as well.One of the ways to use this knowledge is to compare movies of different years and analyze changes over time knowing the release date of a given movie.
Release Date: Release Date can be used to compare movies of different years and analyze changes over time.
Genre: The movies in the database of Rotten Tomatoes are classified according to genre like drama, action, or comedy. Hence, genres are relevant to categorize the movie ratings according to the viewers’ interests.
Movie Synopsis: Extracting the brief description/synopsis of the movie will help you to get a background information about the movie’s storyline and themes.
Cast and Crew: Rotten Tomatoes has the cast and the crew list of the movies. This data is useful to track movies of a particular director, actor or writer.

Crawlbase Crawling API for Rotten Tomatoes Scraping

We will use the Crawlbase Crawling API to get movie ratings and other data from Rotten Tomatoes. Scrape Rotten Tomatoes with simple approaches is hard because the website loads its content dynamically using JavaScript. Crawlbase Crawling API is designed to handle dynamic, JavaScript-heavy websites, so it’s the best way to scrape Rotten Tomatoes fast and easy.

Why Scrape Rotten Tomatoes with Crawlbase?

JavaScript is used to load dynamic content from Rotten Tomatoes pages, including audience scores, reviews, and ratings. Such websites are tough for standard web scraping libraries like requests to handle since they cannot manage JavaScript rendering. The Crawlbase Crawling API solves this issue by making sure you get fully-loaded HTML, complete with all necessary data, through server-side JavaScript rendering.

Here’s why Crawlbase is a solid choice for scraping Rotten Tomatoes:

JavaScript Rendering: It automatically deals with pages that depend on JavaScript to load content, like ratings or reviews.
Built-in Proxies: Crawlbase includes rotating proxies to avoid IP blocks and captchas, keeping your scraping smooth.
Customizable Parameters: You can adjust API parameters like ajax_wait and page_wait to make sure every piece of content is fully loaded before you start scraping.
Reliable and Fast: Designed for efficiency, Crawlbase lets you scrape large datasets from Rotten Tomatoes quickly, with minimal interruptions.

Crawlbase Python Library

Crawlbase provide its own Python library for simplicity. To use it, you’ll need an access token from Crawlbase, which you can obtain by registering an account.

Here’s a sample function to send requests with the Crawlbase Crawling API:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: For static sites, use the Normal Token. For dynamic sites like Rotten Tomatoes, use the JS Token. Crawlbase offers 1,000 free requests to get started, with no credit card needed. For more details, refer to the Crawlbase Crawling API documentation.

In the next sections we will cover how to set up your Python environment and the code to scrape movie data using Crawlbase Crawling API.

Setting Up Your Python Environment

To scrape data from Rotten Tomatoes, the first thing you’ll need to do is set up your Python environment. This includes installing Python, creating a virtual environment, and ensuring all the libraries you need are in place.

Installing Python

The first step is making sure Python is installed on your system. Head over to the official Python website to download the latest version. Remember to choose the version that matches your operating system (Windows, macOS or Linux).

Setting Up a Virtual Environment

Using a virtual environment is a smart way to manage your project dependencies. It helps you keep things clean and prevents conflicts with other projects. Here’s how you can do it:

Open your terminal or command prompt.
Go to your project folder.
Run the following command to create a virtual environment:
1
python -m venv myenv
Activate the virtual environment:

On Windows:
1
myenv\Scripts\activate
On macOS/Linux:
1
source myenv/bin/activate

Installing Required Libraries

Next you’ll need to install the required libraries, including Crawlbase and BeautifulSoup for data handling. Run the following command in your terminal:

1	pip install crawlbase beautifulsoup4

Crawlbase: Used to interact with the Crawlbase products including Crawling API.
BeautifulSoup: For parsing HTML and extracting the required data.

Choosing an IDE

To write and run your Python scripts smoothly, use a good IDE (Integrated Development Environment). Here are a few:

Visual Studio Code: Lightweight and highly customizable. Many developers love it.
PyCharm: Full of features, built for Python.
Jupyter Notebook: For interactive coding and quick tests.

Now that you have your environment set up, let’s start scraping Rotten Tomatoes for movie ratings with Python. In the next section we’ll get into the code that will extract movie ratings and other info from Rotten Tomatoes.

Scraping Rotten Tomatoes Movie Listings

Here we will scrape movie listings from Rotten Tomatoes. We will look at the HTML, write the scraper, handle pagination and organize the data. We will use the Crawlbase Crawling API for JavaScript and dynamic content.

Inspecting the HTML Structure

Before we write the scraper we need to inspect the Rotten Tomatoes page to see what the structure looks like. This will help us know what to target.

Open the Rotten Tomatoes page: Go to the page you want to scrape. For example we are scraping Top Box Office movies List.
Open Developer Tools: Right-click on the page and select “Inspect” or press Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).

Identify the Movie Container: Movies on Rotten Tomatoes are usually inside
elements with class flex-container which are inside a parent
with attribute data-qa=”discovery-media-list”.
Locate Key Data:
- Title: Usually inside a with attribute like data-qa="discovery-media-list-item-title". This is the movie title.
- Critics Score: Inside an rt-text element with slot="criticsScore". This is the critics score for the movie.
- Audience Score: Also inside an rt-text element but with slot="audienceScore". This is the audience score for the movie.
- Link: The movie link is usually inside an tag with data-qa attribute that starts with discovery-media-list-item. You can extract the href attribute from this element to get the link to the movie’s page.

Writing the Rotten Tomatoes Movie Listings Scraper

Now we know the structure, let’s write the scraper. We will use Crawlbase Crawling API to fetch the HTML and BeautifulSoup to parse the page and extract titles, ratings and links.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def fetch_html(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def parse_movies(html):
    soup = BeautifulSoup(html, 'html.parser')
    movies = soup.select('div[data-qa="discovery-media-list"] > div.flex-container')

    movie_data = []
    for movie in movies:
        title = movie.select_one('span[data-qa="discovery-media-list-item-title"]').text.strip() if movie.select_one('span[data-qa="discovery-media-list-item-title"]') else ''
        criticsScore = movie.select_one('rt-text[slot="criticsScore"]').text.strip() if movie.select_one('rt-text[slot="criticsScore"]') else ''
        audienceScore = movie.select_one('rt-text[slot="audienceScore"]').text.strip() if movie.select_one('rt-text[slot="audienceScore"]') else ''
        link = movie.select_one('a[data-qa^="discovery-media-list-item"]')['href'] if movie.select_one('a[data-qa^="discovery-media-list-item"]') else ''

        movie_data.append({
            'title': title,
            'critics_score': criticsScore,
            'audience_score': audienceScore,
            'link': 'https://www.rottentomatoes.com' + link
        })

    return movie_data

Rotten Tomatoes uses button-based pagination for their movie listings. We need to handle pagination by clicking the “Load More” button. Crawlbase Crawling API allows us to handle pagination with the css_click_selector parameter.

def fetch_html_with_pagination(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'button[data-qa="dlp-load-more-button"]'  # CSS Selector for "Load More" button
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

This code will click the “Load More” button to load more movie listings before scraping the data.

Storing Data in a JSON File

After scraping the movie data, we can save it to a file in a structured format like JSON.

import json

def save_to_json(data, filename='movies.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

Complete Code Example

Here is the full code that brings it all together, fetching the HTML, parsing the movie data, handling pagination and saving the results to a JSON file.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def fetch_html_with_pagination(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'button[data-qa="dlp-load-more-button"]'  # CSS Selector for "Load More" button
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def parse_movies(html):
    soup = BeautifulSoup(html, 'html.parser')
    movies = soup.select('div[data-qa="discovery-media-list"] > div.flex-container')

    movie_data = []
    for movie in movies:
        title = movie.select_one('span[data-qa="discovery-media-list-item-title"]').text.strip() if movie.select_one('span[data-qa="discovery-media-list-item-title"]') else ''
        criticsScore = movie.select_one('rt-text[slot="criticsScore"]').text.strip() if movie.select_one('rt-text[slot="criticsScore"]') else ''
        audienceScore = movie.select_one('rt-text[slot="audienceScore"]').text.strip() if movie.select_one('rt-text[slot="audienceScore"]') else ''
        link = movie.select_one('a[data-qa^="discovery-media-list-item"]')['href'] if movie.select_one('a[data-qa^="discovery-media-list-item"]') else ''

        movie_data.append({
            'title': title,
            'critics_score': criticsScore,
            'audience_score': audienceScore,
            'link': 'https://www.rottentomatoes.com' + link
        })

    return movie_data

def save_to_json(data, filename='movies.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

if __name__ == "__main__":
    url = 'https://www.rottentomatoes.com/browse/movies_in_theaters/sort:top_box_office'
    html_content = fetch_html_with_pagination(url)
    if html_content:
        movies_data = parse_movies(html_content)
        save_to_json(movies_data)

Example Output:

[
    {
        "title": "Beetlejuice Beetlejuice",
        "critics_score": "77%",
        "audience_score": "81%",
        "link": "https://www.rottentomatoes.com/m/beetlejuice_beetlejuice"
    },
    {
        "title": "Deadpool & Wolverine",
        "critics_score": "79%",
        "audience_score": "95%",
        "link": "https://www.rottentomatoes.com/m/deadpool_and_wolverine"
    },
    {
        "title": "Alien: Romulus",
        "critics_score": "80%",
        "audience_score": "85%",
        "link": "https://www.rottentomatoes.com/m/alien_romulus"
    },
    {
        "title": "It Ends With Us",
        "critics_score": "57%",
        "audience_score": "91%",
        "link": "https://www.rottentomatoes.com/m/it_ends_with_us"
    },
    {
        "title": "The Forge",
        "critics_score": "73%",
        "audience_score": "99%",
        "link": "https://www.rottentomatoes.com/m/the_forge"
    },
    {
        "title": "Twisters",
        "critics_score": "75%",
        "audience_score": "91%",
        "link": "https://www.rottentomatoes.com/m/twisters"
    },
    {
        "title": "Blink Twice",
        "critics_score": "74%",
        "audience_score": "69%",
        "link": "https://www.rottentomatoes.com/m/blink_twice"
    },
    ... more
]

In the next section, we will discuss scraping individual movie details.

Scraping Rotten Tomatoes Movie Details

Now, we’ll move on to scraping individual movie details from Rotten Tomatoes. Once you have the movie listings, you need to extract detailed information for each movie, like release date, director, genre etc. This part will walk you through inspecting the HTML of a movie page, writing the scraper and saving the data to a JSON file.

Inspecting the HTML Structure

Before we start writing the scraper, we need to inspect the HTML structure of a specific movie’s page.

Open Movie Page: Go to a movie’s page from the list you scraped earlier.
Open Developer Tools: Right-click on the page and select “Inspect” or press Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).
Locate Key Data:

Title: Usually in an
with a slot="titleIntro" attribute.
Synopsis: In a
with a class of synopsis-wrap and an rt-text element without the .key class.
Movie Details: These are in a list format with
elements (keys) and
elements (values). The data is usually in rt-link and rt-text tags.

Writing the Rotten Tomatoes Movie Details Scraper

Now that we have the HTML structure, we can write the scraper to get the details from each movie page. Lets write the scraper code.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def fetch_html(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def fetch_movie_details(movie_url):
    html = fetch_html(movie_url)
    if html:
        soup = BeautifulSoup(html, 'html.parser')

        # Extract movie title
        title = soup.select_one('h1[slot="titleIntro"]').text.strip()

        # Extract synopsis
        synopsis = soup.select_one('div.synopsis-wrap rt-text:not(.key)').text.strip() if soup.select_one('div.synopsis-wrap rt-text:not(.key)') else 'N/A'

        # Get all movie details
        movie_details = {dt.text.strip(): ', '.join([item.text.strip() for item in dd.find_all(['rt-link', 'rt-text']) if item.name != 'rt-text' or 'delimiter' not in item.get('class', [])]) for dt, dd in zip(soup.select('dt.key rt-text'), soup.select('dd'))}

        # Return the collected details as a dictionary
        return {
            'title': title,
            'synopsis': synopsis,
            'movie_details': movie_details
        }
    else:
        print("Failed to fetch movie details.")
        return None

fetch_movie_details function fetches the movie title, release date, director and genres from the movie URL. It uses BeautifulSoup for HTML parsing and structures the data into a dictionary.

Storing Movie Details in a JSON File

After scraping the movie details, you will want to save the data in a structured format like JSON. Here is the code to save the movie details to a JSON file.

import json

def save_movie_details_to_json(movie_data, filename='movie_details.json'):
    with open(filename, 'w') as file:
        json.dump(movie_data, file, indent=4)
    print(f"Movie details saved to {filename}")

Complete Code Example

Here is the complete code to scrape movie details from Rotten Tomatoes and save to a JSON file.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

def fetch_html(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'
    }
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def fetch_movie_details(movie_url):
    html = fetch_html(movie_url)
    if html:
        soup = BeautifulSoup(html, 'html.parser')

        # Extract movie title
        title = soup.select_one('h1[slot="titleIntro"]').text.strip()

        # Extract synopsis
        synopsis = soup.select_one('div.synopsis-wrap rt-text:not(.key)').text.strip() if soup.select_one('div.synopsis-wrap rt-text:not(.key)') else 'N/A'

        # Get all movie details
        movie_details = {dt.text.strip(): ', '.join([item.text.strip() for item in dd.find_all(['rt-link', 'rt-text']) if item.name != 'rt-text' or 'delimiter' not in item.get('class', [])]) for dt, dd in zip(soup.select('dt.key rt-text'), soup.select('dd'))}

        # Return the collected details as a dictionary
        return {
            'title': title,
            'synopsis': synopsis,
            'movie_details': movie_details
        }
    else:
        print("Failed to fetch movie details.")
        return None

def save_movie_details_to_json(movie_data, filename='movie_details.json'):
    with open(filename, 'w') as file:
        json.dump(movie_data, file, indent=4)
    print(f"Movie details saved to {filename}")

if __name__ == "__main__":
    movie_url = 'https://www.rottentomatoes.com/m/beetlejuice_beetlejuice'
    movie_details = fetch_movie_details(movie_url)
    if movie_details:
        save_movie_details_to_json(movie_details)

Example Output:

{
  "title": "Beetlejuice Beetlejuice",
  "synopsis": "Beetlejuice is back! After an unexpected family tragedy, three generations of the Deetz family return home to Winter River. Still haunted by Beetlejuice, Lydia's life is turned upside down when her rebellious teenage daughter, Astrid, discovers the mysterious model of the town in the attic and the portal to the Afterlife is accidentally opened. With trouble brewing in both realms, it's only a matter of time until someone says Beetlejuice's name three times and the mischievous demon returns to unleash his very own brand of mayhem.",
  "movie_details": {
    "Director": "Tim Burton",
    "Producer": "Marc Toberoff, Dede Gardner, Jeremy Kleiner, Tommy Harper, Tim Burton",
    "Screenwriter": "Alfred Gough, Miles Millar",
    "Distributor": "Warner Bros. Pictures",
    "Production Co": "Tommy Harper, Plan B Entertainment, Marc Toberoff, Tim Burton Productions",
    "Rating": "PG-13 (Macabre and Bloody Images|Brief Drug Use|Some Suggestive Material|Strong Language|Violent Content)",
    "Genre": "Comedy, Fantasy",
    "Original Language": "English",
    "Release Date (Theaters)": "Sep 6, 2024, Wide",
    "Box Office (Gross USA)": "$111.0M",
    "Runtime": "1h 44m",
    "Aspect Ratio": "Flat (1.85:1)"
  }
}

Scrape Movie Ratings on Rotten Tomatoes with Crawlbase

Scraping movie ratings from Rotten Tomatoes has many uses. You can collect data to analyze, research or just have fun. It helps you see what’s popular. The Crawlbase Crawling API can deal with dynamic content so you can get the data you need. When you scrape Rotten Tomatoes, you can get public opinions, box office stats or movie details to use in any project you like.

This blog showed you how to scrape movie listings, ratings, and get details like release dates, directors, and genres. We used Python, Crawlbase BeautifulSoup, and JSON to gather and sort the data so you can use and study it. This blog showed you how to scrape movie listings and ratings and get details like release dates, directors, and genres.

If you want to do more web scraping, check out our guides on scraping other key websites.

📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape X.com Tweet Pages
📜 How to Scrape Clutch.co

If you have any questions or feedback, our support team can help you with web scraping. Happy scraping!

Frequently Asked Questions

Q. How do I scrape Rotten Tomatoes if the site changes its layout?

If Rotten Tomatoes alters its layout or HTML, your scraper will stop working. To address this:

Keep an eye on the site for any changes.
Examine the new HTML and revise your CSS selectors.
Make changes to your scraper’s code.

Q. What should I keep in mind when scraping Rotten Tomatoes?

When you scrape Rotten Tomatoes or similar websites:

Look at the Robots.txt: Make sure the site allows scraping by checking its robots.txt file.
Put throttling to use: Add some time between requests to avoid overwhelming the server and reduce your chances of getting blocked.
Deal with errors: Include ways to handle request failures or changes in how the site is built.

Q. How do I manage pagination with Crawlbase Crawling API while scraping Rotten Tomatoes?

Rotten Tomatoes may use different ways to show more content, like “Load More” buttons or endless scrolling.

For Buttons: Use the css_click_selector parameter in the Crawlbase Crawling API to click the “Load More” button.
For Infinite Scrolling: Use page_wait or ajax_wait parameter in the Crawlbase Crawling API to wait for all content to load before you capture.

Scrape Valuable Data from Forbes

2024-09-17T20:00:00.000Z

Forbes is a business and financial news site with great information on industries, companies, and people around the world. Forbes gets millions of visits every month. They have billionaire rankings, business trends, and analysis. Forbes uses JavaScript to load their content dynamically so it’s a bit tricky to scrape with traditional tools.

This tutorial will show you how to scrape Forbes data using Puppeteer, a headless browser. Once you get the basics down, we’ll cover how to use the Crawlbase Crawling API to optimize your data extraction. With these tools, you can collect Forbes data for research, analysis, or personal projects.

Why Scrape Data From Forbes?
Key Data Points to Scrape from Forbes
Setting Up Your Scraping Environment

Installing Puppeteer
Setting Up Your Project
Installing Required Libraries

Scraping Forbes with Puppeteer

Inspecting the HTML Structure
Writing the Puppeteer Scraper
Storing Data in a JSON File

Optimize Forbes Scraping with Crawlbase Crawling API

Introduction to Crawlbase Crawling API
How to Use Crawlbase with Forbes
Code Example with Crawlbase

Final Thoughts
Frequently Asked Questions

Why Scrape Data from Forbes?

There is no denying that Forbes has a wealth of business, financial, and lifestyle related information. Scraping Forbes data does allow you to follow several aspects, such as the most current trends in business or the analysis of the billionaires’ wealth. Here are some key reasons to scrape data from Forbes:

Billionaire Rankings: Forbes is a name everyone is familiar with its global billionaire rankings. This data can be scraped to see how wealth has evolved over time.
Company Information: For looking at how a business is doing, Forbes has the best profiles on companies.
Industry Insights: Forbes provide articles on various sectors including technology, finance, healthcare and more. Scrape data to follow specific industries and trends.
Financial News: Forbes publishes real-time news and and updates on the world economy and markets. Scrape this data to keep track of significant financial events.

Key Data Points to Scrape from Forbes

While Scraping Forbes, you may want to extract many data points. Some of the essential data points you need to look at are:

Billionaire Profiles: Forbes provides in-depth biographies of the wealthiest individuals on the planet. These profiles contain wealth source, industry, net worth, and country of origin.
Company Profiles: Forbes provides comprehensive data about businesses, such as revenue, headcount, and sector. Use this data to compare businesses or keep an eye on particular industries over time.
Top Lists: Forbes is well-known for its “Top” lists, which include the top 100 billionaires, the top multinational corporations, and the top startups.
Articles and News: Forbes features breaking news and in-depth articles on business, finance, and lifestyle. To keep up with the most recent news, trends, and expert opinions from the sector, scrape Forbes articles.
Market Data: Financial information such as stock prices, market trends, and economic projections are available on the website. To keep track of the financial markets and gain real-time insights, scrape Forbes market data.

Setting Up Your Scraping Environment

To scrape Forbes data, we need to set up project environment. We need to install Node.js, Puppeteer, and other required libraries. Follow following steps.

Installing Puppeteer

Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium, perfect for scraping dynamic content like Forbes. To install Puppeteer, follow these steps:

Make sure Node.js is installed on your system. You can download it from Node.js official website.
Once you have Node.js, open your terminal and run the following command to install Puppeteer:

1	npm install puppeteer

This command will install Puppeteer along with Chromium, which Puppeteer uses to run a headless browser for scraping websites.

Setting Up Your Project

Puppeteer is installed. Now set up your project folder and initialize Node.js. Follow these steps:

Create a new directory for your project:

1 2	mkdir forbes-scraper cd forbes-scraper

Initialize a new Node.js project by running the following command:

1	npm init -y

This command will create a package.json file, which manages your project dependencies.

This completes the setup for your Forbes scraping environment. Next, we’ll dive into writing the Puppeteer scraper.

Scraping Forbes with Puppeteer

Now that we have our environment set up, we’ll start scraping Forbes with Puppeteer. In this section, we’ll inspect the HTML, write the scraper, handle dynamic content, and store the scraped data in a JSON file. For this example, we’ll be scraping the Forbes Worlds Billionaires List 2024.

Inspecting the HTML Structure

Before we write the scraper, let’s inspect the Forbes website’s HTML. This will help us identify the key elements that contain the data.

Inspecting the Billionaires List Page

Visit the Page: Go to the Forbes World’s Billionaires List.
Open Developer Tools: Right-click anywhere on the page and select “Inspect” or press Ctrl+Shift+I to open Developer Tools.

Look for Key Elements:

Billionaire Names/Links: Typically contained in tags with classes like color-link. This is where you get the link to each billionaire’s profile.

Scraping Each Billionaire’s Profile

Navigate to a Profile: Click on a link from the list to open the billionaire’s profile page.
Open Developer Tools: Right-click anywhere on the page and select “Inspect” or press Ctrl+Shift+I to open Developer Tools.

Key Elements to Look For:

Rank: Look for the rank, typically inside a
or with a class like listuser-item__list--rank.
Name: Usually inside a header tag, like
with a class like listuser-header__name.
Organization: Found in either an or element with organization-related classes.
Net Worth: Typically inside a
with classes like profile-info__item-value.

Biography: Often found inside an unordered list (

Additional Data: Titles and texts could be found in elements with classes like profile-stats__title and profile-stats__text.

Writing the Puppeteer Scraper

Now, we can write the Puppeteer scraper. The following code demonstrates how to launch Puppeteer, open the Forbes page, and scrape key data points.

Example Code:

const puppeteer = require('puppeteer');
const fs = require('fs');

async function scrapeBillionaires() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Go to Forbes Billionaires list
  await page.goto(
    'https://www.forbes.com/sites/chasewithorn/2024/04/02/forbes-worlds-billionaires-list-2024-the-top-200/?sh=67b3016430a7',
    {
      timeout: 0,
    },
  );

  const links = await page.$$eval('a.color-link', (links) => links.slice(2).map((link) => link.href));
  const billionaireList = [];

  for (let link of links) {
    try {
      await page.goto(link, { timeout: 0 });

      // Get rank
      const rank = await page.$eval('.listuser-item__list--rank', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get name
      const name = await page.$eval('h1.listuser-header__name', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get title
      const title = await page
        .$eval('div.listuser-header__headline-default', (el) => el.innerText.trim())
        .catch(() => 'N/A');

      // Get organization
      const organization = await page
        .$eval('a.listuser-header__organization', (el) => el.innerText.trim())
        .catch(() => 'N/A');

      // Get net worth
      const netWorth = await page.$eval('div.profile-info__item-value', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get biography text
      const bio = await page.$eval('ul', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get additional stack data
      const stackData = await page.evaluate(() => {
        let data = {};
        const titles = Array.from(document.querySelectorAll('.profile-stats__title'));
        const texts = Array.from(document.querySelectorAll('.profile-stats__text'));
        titles.forEach((title, i) => (data[title.innerText.trim()] = texts[i].innerText.trim()));
        return data;
      });

      // Push data to billionaireList
      billionaireList.push({
        Rank: rank,
        Name: name,
        Title: title,
        Organization: organization,
        NetWorth: netWorth,
        Stack: stackData,
        Bio: bio,
      });
    } catch (err) {
      console.log(`Error scraping ${link}: ${err}`);
    }
  }

  await browser.close();
  return billionaireList;
}

scrapeBillionaires().then((data) => {
  console.log(data); // Output data to console
});

Storing Data in a JSON File

Once the data is scraped, we need to save it in a structured format like JSON for later use.

Example Code:

async function saveDataToFile(data, filename = 'forbes_billionaires.json') {
  fs.writeFileSync(filename, JSON.stringify(data, null, 2), 'utf-8');
  console.log(`Data saved to ${filename}`);
}

scrapeBillionaires().then((data) => {
  saveDataToFile(data);
});

This will store all the scraped articles in a forbes_billionaires.json file, making the data easy to access and use in the future.

Complete Code Example

Here’s the complete code that combines all the steps:

const puppeteer = require('puppeteer');
const fs = require('fs');

async function scrapeBillionaires() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Go to Forbes Billionaires list
  await page.goto(
    'https://www.forbes.com/sites/chasewithorn/2024/04/02/forbes-worlds-billionaires-list-2024-the-top-200/?sh=67b3016430a7',
    {
      timeout: 0,
    },
  );

  const links = await page.$$eval('a.color-link', (links) => links.slice(2).map((link) => link.href));
  const billionaireList = [];

  for (let link of links) {
    try {
      await page.goto(link, { timeout: 0 });

      // Get rank
      const rank = await page.$eval('.listuser-item__list--rank', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get name
      const name = await page.$eval('h1.listuser-header__name', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get title
      const title = await page
        .$eval('div.listuser-header__headline-default', (el) => el.innerText.trim())
        .catch(() => 'N/A');

      // Get organization
      const organization = await page
        .$eval('a.listuser-header__organization', (el) => el.innerText.trim())
        .catch(() => 'N/A');

      // Get net worth
      const netWorth = await page.$eval('div.profile-info__item-value', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get biography text
      const bio = await page.$eval('ul', (el) => el.innerText.trim()).catch(() => 'N/A');

      // Get additional stack data
      const stackData = await page.evaluate(() => {
        let data = {};
        const titles = Array.from(document.querySelectorAll('.profile-stats__title'));
        const texts = Array.from(document.querySelectorAll('.profile-stats__text'));
        titles.forEach((title, i) => (data[title.innerText.trim()] = texts[i].innerText.trim()));
        return data;
      });

      // Push data to billionaireList
      billionaireList.push({
        Rank: rank,
        Name: name,
        Title: title,
        Organization: organization,
        NetWorth: netWorth,
        Stack: stackData,
        Bio: bio,
      });
    } catch (err) {
      console.log(`Error scraping ${link}: ${err}`);
    }
  }

  await browser.close();
  return billionaireList;
}

async function saveDataToFile(data, filename = 'forbes_billionaires.json') {
  fs.writeFileSync(filename, JSON.stringify(data, null, 2), 'utf-8');
  console.log(`Data saved to ${filename}`);
}

scrapeBillionaires().then((data) => {
  saveDataToFile(data);
});

Example Output:

[
  {
    "Rank":"#1",
    "Name":"Bernard Arnault & family",
    "Title":"Chairman And CEO, LVMH Moët Hennessy Louis Vuitton",
    "Organization":"LVMH Moët Hennessy Louis Vuitton",
    "Networth":"$219.2B",
    "Stack":{
      "Age":"75",
      "Source of Wealth":"LVMH",
      "Residence":"Paris, France",
      "Citizenship":"France",
      "Marital Status":"Married",
      "Children":"5",
      "Education":"Bachelor of Arts/Science, Ecole Polytechnique de Paris"
    },
    "Bio":"Bernard Arnault oversees the LVMH empire of 75 fashion and cosmetics brands, including Louis Vuitton and Sephora.\nLVMH acquired American jeweler Tiffany & Co in 2021 for $15.8 billion, believed to be the biggest luxury brand acquisition ever.\nArnault's holding company Agache backs venture capital firm Aglaé Ventures, which has investments in businesses such as Netflix and TikTok parent company ByteDance.\nHis father made a small fortune in construction; Arnault got his start by putting up $15 million from that business to buy Christian Dior in 1984.\nArnault's five children all work at LVMH; in July 2022, he proposed a reorganization of his holding company Agache to give them equal stakes."
  },
  {
    "Rank":"#2",
    "Name":"Elon Musk",
    "Title":"CEO, Tesla",
    "Organization":"Tesla",
    "Networth":"$189.2B",
    "Stack":{
      "Age":"52",
      "Source of Wealth":"Tesla, SpaceX, Self Made",
      "Self-Made Score":"8",
      "Philanthropy Score":"1",
      "Residence":"Austin, Texas",
      "Citizenship":"United States",
      "Marital Status":"Single",
      "Children":"11",
      "Education":"Bachelor of Arts/Science, University of Pennsylvania"
    },
    "Bio":"Elon Musk cofounded six companies, including electric car maker Tesla, rocket producer SpaceX and tunneling startup Boring Company.\nHe owns about 12% of Tesla excluding options, but has pledged more than half his shares as collateral for personal loans of up to $3.5 billion.\nIn early 2024, a Delaware judge voided Musk's 2018 deal to receive options equaling an additional 9% of Tesla. Forbes has discounted the options by 50% pending Musk's appeal.\nSpaceX, founded in 2002, is worth nearly $180 billion after a December 2023 tender offer of up to $750 million; SpaceX stock has quintupled its value in four years.\nMusk bought Twitter in 2022 for $44 billion, after later trying to back out of the deal. He owns an estimated 74% of the company, now called X.\nForbes estimates that Musk's stake in X is now worth nearly 70% less than he paid for it based on investor Fidelity's valuation of the company as of December 2023."
  },
  {
    "Rank":"#3",
    "Name":"Jeff Bezos",
    "Title":"Chairman And Founder, Amazon",
    "Organization":"Amazon",
    "Networth":"$202.4B",
    "Stack":{
      "Age":"60",
      "Source of Wealth":"Amazon, Self Made",
      "Self-Made Score":"8",
      "Philanthropy Score":"2",
      "Residence":"Miami, Florida",
      "Citizenship":"United States",
      "Marital Status":"Engaged",
      "Children":"4",
      "Education":"Bachelor of Arts/Science, Princeton University"
    },
    "Bio":"Jeff Bezos founded e-commerce giant Amazon in 1994 out of his Seattle garage.\nBezos stepped down as CEO to become executive chairman in 2021. He owns a bit less than 10% of the company.\nHe and his wife MacKenzie divorced in 2019 after 25 years of marriage and he transferred a quarter of his then-16% Amazon stake to her.\nBezos donated more than $1.1 million worth of stock to nonprofits in 2023, though it's unclear which organizations received those shares\nHe owns The Washington Post and Blue Origin, an aerospace company developing rockets; he briefly flew to space in one in July 2021.\nBezos said in a November 2022 interview with CNN that he plans to give away the majority of his wealth in his lifetime, without disclosing specific details."
  },
  {
    "Rank":"#4",
    "Name":"Mark Zuckerberg",
    "Title":"Cofounder, Meta Platforms",
    "Organization":"Meta Platforms",
    "Networth":"$184.3B",
    "Stack":{
      "Age":"39",
      "Source of Wealth":"Facebook, Self Made",
      "Self-Made Score":"8",
      "Philanthropy Score":"2",
      "Residence":"Palo Alto, California",
      "Citizenship":"United States",
      "Marital Status":"Married",
      "Children":"3",
      "Education":"Drop Out, Harvard University"
    },
    "Bio":"A 19-year-old Mark Zuckerberg started Facebook in 2004 for students to match names with photos of classmates.\nZuckerberg took Facebook public in 2012; he now owns about 13% of the company's stock.\nFacebook changed its name to Meta in 2021 to shift the company's focus to the metaverse.\nIn 2015, Zuckerberg and his wife, Priscilla Chan, pledged to give away 99% of their Meta stake over their lifetimes."
  },
  .... more
]

In the next section, we’ll discuss how to optimize Forbes scraping using Crawlbase Crawling API.

Optimize Forbes Scraping with Crawlbase Crawling API

Puppeteer is great for scraping dynamic websites but slow when dealing with big data or JavaScript heavy pages like Forbes. To optimize scraping and performance, we can use the Crawlbase Crawling API, which simplifies handling JavaScript-rendered content and gives more control and efficiency.

Introduction to Crawlbase Crawling API

Crawlbase Crawling API bypasses common web scraping challenges like CAPTCHAs, dynamic content loading and complex HTML structures. For scraping Forbes Crawlbase offers a streamlined solution by handling JavaScript rendering directly, making it a more efficient alternative to Puppeteer for big scraping projects.

Why use Crawlbase for Forbes scraping?

Handles dynamic content: Optimized for JavaScript heavy pages like Forbes.
Improved speed and scalability: No need for headless browsers, faster scraping.
Simplifies the process: Easy API calls to scrape data, built in CAPTCHAs and anti-scraping mechanisms.

How to Use Crawlbase with Forbes

To scrape Forbes using Crawlbase, you need to sign up and get your API token. Here’s how to get started:

Sign up for Crawlbase: Create an account on Crawlbase and get your API token. You need JS Token for Forbes.
Install Crawlbase Library: In your Node.js environment, install the Crawlbase Crawling API library using:

1	npm install crawlbase

Set up your request: Initialize the Crawlbase API with your token and make GET requests to scrape Forbes data.

Code Example with Crawlbase

Here’s a code example using the Crawlbase JavaScript library to scrape Forbes data more efficiently:

Example Code:

const { CrawlingAPI } = require('crawlbase');
const cheerio = require('cheerio');
const fs = require('fs');

// Initialize Crawlbase API with your access token
const api = new CrawlingAPI({ token: 'CRAWLBASE_JS_TOKEN' });

async function fetchForbesHTML(url) {
  try {
    const response = await api.get(url, {
      ajax_wait: 'true', // Wait for AJAX requests to complete
      page_wait: '5000',
    });

    if (response.statusCode === 200) {
      return response.body;
    } else {
      console.log(`Failed to fetch data. Status code: ${response.statusCode}`);
      return null;
    }
  } catch (error) {
    console.error(`Error fetching data: ${error}`);
    return null;
  }
}

async function parseForbesData(html) {
  const $ = cheerio.load(html);

  let billionaireList = [];
  $('.color-link')
    .slice(2)
    .each(async function () {
      const link = $(this).attr('href');

      try {
        const detailPageHtml = await fetchForbesHTML(link);
        const $page = cheerio.load(detailPageHtml);

        let rank = $page('.listuser-item__list--rank').text().trim() || 'N/A';
        let name = $page('h1.listuser-header__name').text().trim() || 'N/A';
        let title = $page('div.listuser-header__headline-default').text().trim() || 'N/A';
        let organization = $page('a.listuser-header__organization').text().trim() || 'N/A';
        let netWorth = $page('div.profile-info__item-value').text().trim() || 'N/A';
        let bio = $page('ul').text().trim() || 'N/A';
        let stackData = {};

        $page('.profile-stats__title').each((i, el) => {
          let title = $(el).text().trim();
          let text = $page('.profile-stats__text').eq(i).text().trim();
          stackData[title] = text;
        });

        billionaireList.push({
          Rank: rank,
          Name: name,
          Title: title,
          Organization: organization,
          NetWorth: netWorth,
          Stack: stackData,
          Bio: bio,
        });
      } catch (err) {
        console.error(`Error parsing data for ${link}: ${err}`);
      }
    });

  return billionaireList;
}

async function saveToFile(data, filename = 'forbes_billionaires.json') {
  fs.writeFileSync(filename, JSON.stringify(data, null, 2), 'utf-8');
  console.log(`Data saved to ${filename}`);
}

(async function () {
  const url =
    'https://www.forbes.com/sites/chasewithorn/2024/04/02/forbes-worlds-billionaires-list-2024-the-top-200/?sh=67b3016430a7';
  const html = await fetchForbesHTML(url);
  if (html) {
    const data = await parseForbesData(html);
    await saveToFile(data);
  }
})();

Explanation of the Code:

Initialize Crawlbase: CrawlingAPI is initialized with your Crawlbase token to access the API for scraping.
Get request: We use api.get() to scrape the Forbes URL. We use ajax_wait and page_wait to make sure all dynamic content loads.
HTML Parsing: We use cheerio to parse the HTML and extract key data points.
Data Storage: The extracted data is saved to a JSON file.

This way scraping Forbes is more efficient, Crawlbase handles JavaScript rendering and complex content structures.

Optimize Forbes Scraping with Crawlbase

Whether you’re analyzing business trends, financial news or top company rankings, scraping data from Forbes can be very useful. While tools like Puppeteer are great for handling JavaScript rendered pages they are time consuming and resource heavy. Using Crawlbase Crawling API simplifies the process and makes scraping dynamic content faster.

Follow this guide to scrape Forbes data and scale your projects with Crawlbase. This method is a reliable and optimized way to scrape websites like Forbes. If you’re looking to expand your web scraping capabilities, consider exploring our following guides on scraping other important websites.

📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape X.com Tweet Pages
📜 How to Scrape Clutch.co

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy scraping!

Frequently Asked Questions

Q. Is scraping Forbes legal?

Scraping any website, including Forbes, should be done in compliance with their terms of service. Always check the website’s robots.txt file and ensure you are not violating any terms regarding data extraction. Using APIs like Crawlbase helps you scrape efficiently while adhering to best practices.

Q. Why should I use Crawlbase Crawling API instead of Puppeteer for scraping Forbes?

While Puppeteer is a powerful tool for handling JavaScript rendering, it can be slow and resource-intensive. Crawlbase Crawling API simplifies the process by offering pre-configured options for handling dynamic content, which speeds up scraping and reduces the effort needed to manage browser sessions manually.

Q. How can I handle dynamic content on Forbes when scraping?

Forbes uses JavaScript to load much of its content dynamically. Using Puppeteer or Crawlbase Crawling API with options like ajax_wait and page_wait, you can ensure the content is fully loaded before scraping. This ensures you capture all relevant data from the page.

What is Browser Fingerprinting?

2024-09-13T07:39:50.000Z

Your online privacy faces ongoing scrutiny. Browser fingerprints serve as one of the most subtle yet powerful tools to track your online activities. This unique identifier goes beyond cookies, allowing websites and advertisers to recognize your device across different browsing sessions. As you browse the web, your browser leaves a trail of information that can be used to create a distinct profile of your online behavior.

It’s essential to understand browser fingerprinting to protect your privacy and manage your online presence. This article will explore the mechanics of browser fingerprints, examining how they work and their applications in various industries. We’ll also look at the impact of browser fingerprinting on web scraping activities and discuss ways to lessen its effects. By the end, you’ll have a thorough understanding of this technology and how it affects your online experiences.

What is Browser Fingerprinting?

Browser fingerprinting is a complex way to spot and follow your device as you browse the web. It’s a bunch of tools and methods that gather data from your online activity making a unique ID or “digital fingerprint” for your device. Unlike regular cookies, this ID stays the same, which makes it a dependable way to know who’s visiting a site.

Browser fingerprinting gathers lots of info about your device and browser setup. Here’s what it looks like:

What browser you use, and which version
Your operating system and its version
How big your screen is and how many colors it shows
What fonts and plugins you have
Where you are in the world and what language you speak
If you block ads
Your IP address
What your browser tells websites about itself
Details about your device (like if it has a touchscreen)
All the fonts and file types your computer can handle
Data from Flash and Silverlight

Scripts running behind the scenes in your browser put all this together. They check out your software and hardware setup without changing anything or getting in your way.

Uniqueness of Fingerprints

The resulting “fingerprint” is a one-of-a-kind mix of these features, creating a distinct profile. Although many people use the same type of device, each user’s setup differs. Too many factors exist to remain anonymous. In fact, device fingerprinting can identify users with 90 to 99% accuracy.

This uniqueness allows websites and advertisers to recognize your device across different browsing sessions, tracing your online activities. While this technology has legitimate uses, like preventing fraud and verifying users, it also raises big privacy concerns as it can track your online behavior without your clear permission.

How Browser Fingerprinting Works

Browser fingerprinting identifies and tracks your device across different browsing sessions without cookies. This method gathers and examines various data points from your web browser and device to create a unique identifier.

JavaScript and API Usage

Scripts run in the background of your browser, checking your software and hardware setup without interrupting your browsing. These scripts gather details like your browser type and version, operating system, screen resolution, color depth installed fonts and plugins, time zone, language settings, and even your use of ad blockers.

The data gathered gets merged into one identifier, which stays the same in both regular and private browsing modes. This identifier doesn’t change and doesn’t need cookies or explicit user consent.

Canvas Fingerprinting

Canvas fingerprinting uses the HTML5 canvas element to spot unique features of your device. Here’s how it works:

The script draws complex shapes, text, or other graphics on an invisible canvas.
Your device’s specific mix of hardware and software affects how it shows these elements.
The script captures the image data pixel by pixel and creates a hash value or digital signature.
Even small changes in pixel output lead to a different hash, resulting in a unique fingerprint.

This technique works well because it takes advantage of differences in font rendering, anti-aliasing, and graphics processing across various devices.

Audio Fingerprinting

Audio fingerprinting uses the Web Audio API to create a unique identifier based on how your device handles audio. The process includes:

Making an AudioContext instance with specific settings.
Creating a sound source with an oscillator.
Using a compressor to change the original signal.
Handling the audio snippet and figuring out a single value from the resulting array.

This method has value because it’s one-of-a-kind and consistent. It gets these qualities from the Web Audio API’s inner workings and the math behind how it makes sound.

How Browser Fingerprinting Gets Used

Keeping tabs on users and crunching numbers

Browser fingerprinting is a strong way to track users and do analytics. Websites gather info about your device’s hardware and software setup to make a unique ID for your browser. This lets them follow your online actions across different sessions, even without regular cookies. Browser fingerprinting can spot users with 99.5% accuracy, giving useful insights into how you use websites. This info helps companies make their sites better, boost user experience, and make smart choices about their online plans. For example, marketing folks can use this data to customize content and deals based on your web habits and likes.

Fraud prevention

Browser fingerprinting has a major impact on fraud prevention. Websites can spot fishy activities and block unwanted access by recognizing your device’s unique features. This is key for banks and online stores. Browser fingerprinting helps to:

Spot attempts to hijack accounts
Stop people from making lots of fake accounts
Find possible threats that want to grab your private details
Cut down on refunds linked to online payment scams

Personalized content delivery

Browser fingerprinting allows websites to give you custom content without making your experience more complicated. By grasping what you like based on how you browse, websites can:

Adjust the content shown on their site in real-time
Give more useful suggestions
Boost user involvement and sales

This tailoring also applies to ads, enabling more focused and successful advertising efforts.

How Does Browser Fingerprint Affect Web Scraping

Browser fingerprinting plays a big role in web scraping. When you scrape websites, you’ll notice that anti-bot systems use fingerprinting methods to spot and stop automated scrapers. These systems check the hardware and software setup of your scraping tools, matching them against a list of human-like configurations.

When your scraper tries to access a website, it sends a unique mix of data points. These include HTTP headers, TLS version, and details from JavaScript execution. This combo creates a digital fingerprint that websites use to spot and keep tabs on your scraping activities. Even if you switch your IP address or wipe your cookies, the fingerprint stays the same. This makes it tough to hide your scraper’s identity.

To show how this affects things, think about accessing a Cloudflare-protected site from a virtual machine. You’ll run into extra problems as Cloudflare sees that the traffic comes from a data center instead of a normal user’s setup. This sets off alarm bells and kicks in anti-bot measures.

To get past these roadblocks, you have a few choices:

Use scraping APIs that handle fingerprint management
Use anti-detect browsers or AI-powered browsers to change fingerprints
Use headless browsers or HTTP request libraries to build custom fingerprints

When you’re making custom fingerprints, it’s key to keep everything matching up. For instance, the browser versions need to work with the OS you’ve picked, and you should pair certain plugins with specific browsers. Pretending to be a mobile device can work well, as there’s less variety in plugins and fonts, which means a smaller fingerprint.

Final Thoughts

Browser fingerprinting has become a key player in the online world, and it has a big effect on our privacy and security on the internet. This tech has an influence on many parts of what we do online, from getting personalized content to stopping fraud. Its knack for spotting users with great accuracy has caused a revolution in how we track users and do analytics, giving companies useful info about how people act and what they like online.

As we deal with the challenges of the digital world, we need to understand how browser fingerprinting works and what it can lead to. It has some good points for security and user experience, but it also brings up big questions about privacy and consent. Going forward, you will need to find a reliable proxy providers especially if you are scraping other websites. Crawlbase provides a suite of products that helps you scrape data without hassles. Our products help shape an online world that respects user privacy in a secured manner.

Frequently Asked Questions

What is browser fingerprinting, and how does it function?

Browser fingerprinting is the act of gathering data from a user’s browser settings and software details as they surf the web. This info helps to create a unique ID or “fingerprint” for the user.

Can you explain how cross-browser fingerprinting is conducted?

Cross-browser fingerprinting collects data points such as browser type and version, language, and local databases across several browsers. It zeroes in on info that stays the same across different systems to identify users.

How accurate is browser fingerprinting in identifying users?

Browser fingerprinting has a strong track record. It can stop fraud, spam, and account takeovers with up to 99.5% success on web and mobile platforms.

What is WebGL fingerprinting, and how is it implemented?

WebGL fingerprinting makes use of the WebGL API to check how a device’s graphics hardware renders and what it can do. This unique marker helps to track users as they move between different websites and sessions.

How to Read JSON Files in Python

2024-09-10T13:11:04.000Z

JSON (JavaScript Object Notation) is a popular format to store and exchange data. It’s used in web applications, APIs, and databases because of its lightweight and human-readable format. JSON structures data in key-value pairs, so it’s easy to work with across different platforms.

In Python, working with JSON files is easy because of Python’s built-in libraries, such as json which allows easy reading, parsing, and processing of JSON data. Whether you’re working with local files or fetching data from a web service, Python has tools to handle JSON data nicely.

This article will cover how to read JSON files in Python, load and parse JSON data, and working with its structures. You’ll also learn how to modify and write JSON data. Let’s get started!

Here’s a simple tutorial on how to read JSON files in Python:

What are JSON Files?
Loading JSON Data in Python

Using json.load() to Read JSON from a File
Reading JSON from a String with json.loads()

Common Operations with JSON in Python

Accessing Nested Data
Modifying JSON Data
Writing JSON Data to a File with json.dump()

Error Handling When Reading JSON Files
Final Thoughts
Frequently Asked Questions (FAQs)

What are JSON Files?

A JSON file (JavaScript Object Notation file) is a text file used to store and exchange data. It stores data in a structured and readable way, using key value pairs, where each key is associated with a value. This makes it easy for humans and machines to read and write data. JSON is used in web applications, APIs, and configurations because it’s lightweight and easy to use.

Here’s an example of a simple JSON file:

{
  "name": "Jane Smith",
  "age": 28,
  "isDeveloper": true,
  "languages": ["Python", "JavaScript", "SQL"]
}

In this example, the file contains information about a person named “Jane Smith” who is 28 and a developer. The languages key holds an array of programming languages she knows.

You can read, write and save JSON files out of the box with Python. Let’s see how.

Loading JSON Data in Python

Loading JSON in Python is easy, thanks to the built-in json module. Whether the JSON is in a file or a string, Python has methods to load and parse it. In this section, we will cover two ways to load JSON: from a file and from a string.

Using `json.load()` to Read JSON from a File

When working with a JSON file in Python, you can use json.load() to load the data directly from the file. This method reads the file and parses it into a Python object (usually a dictionary).

Here’s an example of how to read a JSON file:

import json

# Open the JSON file and load its content
with open('data.json', 'r') as file:
    data = json.load(file)

# Print the parsed data
print(data)

In this code, We use open() function to open the file in read mode (‘r’) and then pass the file object to json.load() to read and parse the JSON into a Python dictionary. Then, you can access the data using the keys.

For example, if your data.json file contains the following:

{
  "name": "Alice",
  "age": 25,
  "isEmployed": true
}

The output will be:

1	{'name': 'Alice', 'age': 25, 'isEmployed': True}

Reading JSON from a String with `json.loads()`

If the JSON is a string, you can use json.loads() to parse it. This is useful when working with JSON data retrieved from an API or other external source.

Here’s an example:

import json

# JSON string

json_string = '{"product": "Laptop", "price": 999, "inStock": true}'

# Parse the string into a Python dictionary

data = json.loads(json_string)

# Print the parsed data

print(data)

In this example, json.loads() takes the JSON string and turns it into a Python dictionary. You can then access the data as if it was from a file.

For example, the output will be:

1	{'product': 'Laptop', 'price': 999, 'inStock': True}

Next up we’ll cover the common operations you can do on JSON in Python.

Common Operations with JSON in Python

Once you have loaded JSON data into Python, you can do various things with it, like access nested data, modify it, and save the changes back to a file. Let’s go through these one by one with simple examples.

Accessing Nested Data

JSON data often has nested structures like dictionaries within dictionaries or lists within lists. Accessing this nested data in Python is easy using key-value access or list indexing.

Example:

import json

# Sample JSON data
json_data = '''
{
    "name": "Alice",
    "age": 25,
    "address": {
        "city": "New York",
        "zipcode": "10001"
    },
    "skills": ["Python", "Machine Learning"]
}
'''

# Parse the JSON data
data = json.loads(json_data)

# Accessing nested data
city = data['address']['city']
skill = data['skills'][0]

print(f"City: {city}")
print(f"First Skill: {skill}")

Here:

We first load the JSON string using json.loads().
We access nested data, such as the city inside the address dictionary and the first skill from the skills list.

Modifying JSON Data

You can easily modify JSON data in Python after loading it. Modifications can include updating values, adding new data, or removing existing keys.

Example:

import json

# Sample JSON data
json_data = '''
{
    "name": "Alice",
    "age": 25,
    "address": {
        "city": "New York",
        "zipcode": "10001"
    }
}
'''

# Parse the JSON data
data = json.loads(json_data)

# Modify the age and add a new skill
data['age'] = 26
data['skills'] = ["Python", "Data Science"]

# Print the modified data
print(json.dumps(data, indent=4))

In this example:

We modify the age value from 25 to 26.
We add a new key skills with an array of values.
The json.dumps() function is used to print the modified JSON data in a readable format with indentation.

Writing JSON Data to a File with `json.dump()`

After modifying JSON data, you might want to save it back to a file. The json.dump() function helps you write the data back to a file in JSON format.

Example:

import json

# Sample data to write to file
data = {
    "name": "Alice",
    "age": 26,
    "address": {
        "city": "New York",
        "zipcode": "10001"
    },
    "skills": ["Python", "Data Science"]
}

# Write JSON data to a file
with open('modified_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print("JSON data successfully written to file.")

In this example:

We modify and organize the data in a Python dictionary.
The json.dump() method writes the data to a file named modified_data.json.
The indent=4 parameter makes the JSON file readable by adding indentation.

Learning these common operations (accessing nested data, modifying it, and saving it to a file) is very important for working with JSON files in Python. They allow you to manipulate and organize your data for many use cases.

Next up, we’ll cover error handling when reading JSON files so your programs don’t crash.

Error Handling When Reading JSON Files

When working with JSON data in Python, you need to handle errors that can happen when reading or parsing JSON files. Errors can occur due to many reasons such as invalid JSON syntax, incorrect file paths, or file encoding issues. Proper error handling will ensure your Python script runs smoothly and can recover from unexpected errors.

Let’s explore some common errors and how to handle them effectively in Python.

Handling File Not Found Error

If the specified JSON file does not exist or the file path is incorrect, Python raises a FileNotFoundError. You can use a try-except block to catch this error and display a user-friendly message.

Example:

import json

try:
    # Attempt to open a non-existent file
    with open('data.json', 'r') as file:
        data = json.load(file)
except FileNotFoundError:
    print("Error: The file was not found. Please check the file path.")

In this code:

We attempt to read the data.json file.
If the file does not exist, the FileNotFoundError is caught, and a meaningful error message is printed.

Handling Invalid JSON Syntax

If the JSON file contains invalid syntax (e.g., missing commas, braces, or brackets), Python raises a json.JSONDecodeError. You can handle this error using a try-except block to prevent your program from crashing.

Example:

import json

invalid_json = '''
{
    "name": "Alice",
    "age": 25,
    "address": {
        "city": "New York"
    }  # Missing closing bracket
'''

try:
    # Attempt to parse invalid JSON
    data = json.loads(invalid_json)
except json.JSONDecodeError as e:
    print(f"Error: Invalid JSON format. {e}")

Here:

The invalid_json string is missing a closing bracket.
The json.JSONDecodeError is caught, and an error message specifying the issue is printed.

Handling Incorrect File Encoding

Sometimes, the JSON file might be saved with an encoding different from UTF-8, which can cause decoding errors when reading the file. Python’s UnicodeDecodeError handles such cases, and you can specify the correct encoding while opening the file to avoid issues.

Example:

import json

try:
    # Open a file with a specific encoding
    with open('data.json', 'r', encoding='utf-8') as file:
        data = json.load(file)
except UnicodeDecodeError:
    print("Error: Unable to decode the file. Please check the file's encoding.")

In this code:

We specify encoding='utf-8' when reading the file.
If there is a problem with the file encoding, a UnicodeDecodeError is caught and an appropriate error message is displayed.

General Exception Handling

You can also use a general except block to catch any other unexpected errors that might occur when reading or working with JSON files.

Example:

import json

try:
    # Attempt to read JSON file
    with open('data.json', 'r') as file:
        data = json.load(file)
except Exception as e:
    print(f"An unexpected error occurred: {e}")

This code:

Uses a general Exception to catch any errors that don’t fall into specific categories.
Prints the error message to help identify the problem.

Error handling is an essential part of working with JSON files, as it helps you manage issues like missing files, incorrect formats, and encoding problems. By catching these errors early, you can ensure that your Python scripts run more smoothly and are easier to debug.

Final Thoughts

Reading and working with JSON files in Python is a crucial skill for developers, especially when dealing with APIs, web applications, or data storage. Python’s built-in json module makes it easy to handle JSON data, whether you’re reading from a file, parsing a string, or modifying the data.

By learning how to load, manipulate, and write JSON data in Python, you can efficiently manage structured data in your projects. JSON’s flexibility and readability make it one of the most widely used formats today, and with Python’s tools, you can easily integrate it into any application.

With the techniques discussed in this blog, you should be well-equipped to handle JSON data in your Python projects, no matter the complexity of the data structures. For more tutorials like these, follow our blog. If you have any questions or feedback, our support team is here to help you.

Frequently Asked Questions (FAQs)

Q. What is the difference between `json.load()` and `json.loads()` in Python?

json.load() is used to read and parse JSON data from a file, while json.loads() is used to parse JSON data from a string. The s in json.loads() stands for “string”, so it’s useful when you have JSON data as a string, not in a file.

How do I convert a JSON string to a Python object?

You can convert a JSON string to a Python object using json.loads(). This function parses the JSON string and returns a Python dictionary or list.

Example:

import json

json_string = '{"name": "John", "age": 30}'
data = json.loads(json_string)
print(data)

Q. How do I write JSON data to a file in Python?

To write JSON data to a file, use json.dump(). Open the file in write mode, then pass the Python object and file to json.dump() to store the JSON data in the file.

Example:

import json

data = {"name": "John", "age": 30}

with open('data.json', 'w') as file:
    json.dump(data, file)

Q. How do I handle errors when working with JSON in Python?

Common errors when working with JSON include invalid JSON format or incorrect file paths. To handle these, you can use Python’s try-except blocks to catch exceptions like json.JSONDecodeError or FileNotFoundError.

Example:

import json

try:
    with open('data.json', 'r') as file:
        data = json.load(file)
except FileNotFoundError:
    print("File not found!")
except json.JSONDecodeError:
    print("Invalid JSON format!")

Unstructured Data vs Structured (Key Characteristics Compared)

2024-09-05T10:02:50.000Z

Big data has caused a revolution in how companies work and choose what to do. A key part of this change is the difference between unstructured data and structured data. As you deal with the complex world of data analytics and business intelligence, it’s essential to understand these two types of data to use them in your company.

This article looks into the main features that make unstructured data different from structured data. You’ll learn about their definitions and forms, see the problems and chances in storing and managing data, and find out how each type has an impact on analyzing and processing data. By the time the end of this article, you’ll see how these data types shape the world of machine learning, web scraping, and enables you to make better business choices.

What is Structured Data?

Structured data means info that follows a set layout and order. It fits a specific data model so both people and machines can read and grasp it. You’ll typically see structured data in relational databases or spreadsheets set up in rows and columns with fixed fields.

The main features of structured data are:

Clear structure with identifiable traits
Same order and format throughout
People and computer programs can access and use it
Stored in preset schemas like databases

Some structured data examples are customer files with names and addresses, credit card numbers, stock info, and number-based survey answers.

What is Unstructured Data?

Unstructured data doesn’t follow a set data model or pattern. This kind of information takes many shapes and can’t fit into regular databases. Unstructured data is more about quality and needs special methods to analyze it well.

Unstructured data examples:

Text files (Word documents, PDFs)
Emails and posts on social media
Pictures, sound files, and videos
Data from IoT device sensors

Structured vs Unstructured Data

To get a good grasp on how structured and unstructured data formats differ, let’s look at their main features:

Storage: People usually keep structured data in relational databases (RDBMS) that use SQL. On the other hand, unstructured data finds its home in non-relational (NoSQL) databases or data lakes.
Organization: You’ll find structured data arranged in tables with rows and columns. In contrast unstructured data doesn’t have a set structure and stays in its original form.
Querying: SQL makes it a breeze to search and work with structured data. However, when it comes to unstructured data, you need special tools and methods to analyze it.
Flexibility: Structured data has limitations when it comes to adding new types of information, as schema changes need significant database updates. Unstructured data gives you more room to work within this area.
Processing: Machine learning systems can handle structured data with ease, but unstructured data often calls for more advanced methods to get meaningful insights.

Storage and Management

Structured and unstructured data extraction pose different challenges and offer various opportunities when it comes to data management and storage. Let’s take a closer look at how organizations store and manage these two types of data in various settings.

Structured Data Storage

Relational databases and data warehouses store structured data. These systems use a predefined schema, often called “schema-on-write,” which means you decide on the data structure before storing it. You’ll find that Structured Query Language (SQL) manages structured data, making it easy to input, search, and change data.

Data warehouses, with their strict schemas, work well to store structured data. But this strictness can cause problems when it needs to change. Any changes to the schema might force you to update all the existing structured data, which can take a long time and disrupt your work.

Unstructured Data Storage

Unstructured data lacks a predefined data model. Users store this data in its original format and process it when necessary, a method called “schema-on-read.” To handle the huge amounts of unstructured data, which can make up to 90% of company data, you’ll need more adaptable storage options.

Cloud data lakes have gained popularity to store unstructured data. They provide enormous storage abilities with pricing based on usage, making them cost-effective and easy to scale. NoSQL databases offer another choice, allowing you to store different data formats without a fixed structure.

Management Challenges

Unstructured data management poses several hurdles. The massive amount of diverse types and rapid influx of unstructured data can overwhelm traditional storage systems. As your data expands, you’ll need a storage infrastructure that manage data efficiently.

To analyze unstructured data, you need special tools and methods, like natural language processing, machine learning, and AI. These advanced technologies can help you gain valuable insights from various data types, such as text documents, images, and videos.

To tackle these issues, think about putting a data management plan into action that includes:

Adaptable data models to handle new fields and data types
Strong storage systems supporting quick responses and speedy data updates
Data archiving that works well to stop data loss and cut storage costs
Solutions that can scale up as your data needs grow

Data Analysis and Processing

Looking at and working with data is different for organized and messy information. Knowing these differences is key to getting useful insights from your data.

Structured Data Analysis

Structured data analysis deals with information that follows a set format often found in tables or databases. This data type has a clear organization and people can search it using standard methods. The consistent and reliable nature of structured data adds to the quality and trustworthiness of the analysis process.

You can use structured data to:

Carry out precise and quick analysis
Use advanced analytical methods like statistical models and machine learning
Build reports, dashboards, and visuals to gain useful insights
Search, filter, and sort data with ease for focused exploration

Unstructured Data Analysis

Unstructured data analysis aims to make sense of information that doesn’t fit into typical rows and columns. This includes text, images, videos, and more. The process involves looking at, cleaning up, changing, and modeling data using different analytical and statistical tools.

Key aspects of unstructured data analysis include:

Natural Language Processing (NLP) to analyze text
Techniques to analyze images and videos
Methods to process audio
Analysis of sensor data from IoT devices

Processing Techniques

To handle both structured and unstructured data well, you need to use different processing methods:

Data Classification: Group data by metadata, like file type or content, to boost management and follow rules better.
Metadata Analysis: Use “data about data” to gain insights for unstructured stuff like blog posts or pictures.
Machine Learning: Use AI systems to study and find meaning in unstructured data, like spotting things in images or sorting text.
Data Visualization: Show data in pictures or graphs so people can understand and study it more.

Final Thoughts

The way businesses handle and use their information assets depends on whether data is structured or unstructured. Structured data has an organized format, which makes it easy to analyze and query. This makes it perfect for traditional database systems. In contrast unstructured data gives more flexibility and can capture many different types of information. However, to analyze it well, you need special tools.

As data keeps getting more extensive and more diverse, companies need to come up with plans to handle both structured and unstructured data well. This means putting money into storage solutions that can grow, using cutting-edge analytics methods, and applying machine learning to get insights from different data sources. By getting to know what makes each type of data unique, businesses can tap into the full power of their data to spark new ideas and make intelligent choices.

FAQs

What is structured vs. unstructured data?
Structured data has an organization that allows it to fit into tables or databases. It includes specific types such as numbers, short texts, or dates. Unstructured data, however, has a challenging organization due to its nature or size. This type includes formats like audio, video, and large text documents.

Can you list five key differences between structured and unstructured data?
Sure, here are the main differences: Structured data has standardization and searchability, while unstructured data often stays in its original form. Structured data is quantitative, so you can measure and count it, but unstructured data is qualitative, focusing more on descriptions. Also, structured data lives in data warehouses, while unstructured data ends up in data lakes.

What best describes unstructured data?
One standout thing about unstructured data is that it doesn’t follow a specific data model. This sets it apart from structured data, which sticks to a clear model and organization.

What are the characteristics of structured data?
Structured data sticks to a data model with a clear structure that puts info into rows and columns. This setup makes sure that the data’s definition, format, and meaning are well-defined and stay that way.

How to Scrape Monster Jobs with Python

2024-08-27T10:00:00.000Z

Monster.com is one of the top sites for job seekers and employers, with millions of job listings across diverse industries. It’s a great place for job hunters and employers looking for employees. By scraping Monster.com, you can get tons of job data to use for analytics, to monitor the job market, or to build a custom job search tool.

Monster.com attracts more than 6.7 million visitors monthly and hosts thousands of active job listings, making it a goldmine of helpful information. Yet, its ever-changing nature means you need an intelligent strategy to scrape Monster.com. This guide will show you how to set up a Python environment, build a Monster.com page scraper, and make it better with the Crawlbase Crawling API. This API helps deal with tricky stuff like JavaScript rendering and endless scroll pagination.

This guide will provide you with the knowledge and tools you need to scrape Monster.com without any hurdles. Let’s get started!

Why Scrape Monster.com?
Key Data Points to Extract from Monster.com
Crawlbase Crawling API for Monster.com Scraping

Why Use Crawlbase Crawling API?
Crawlbase Python Library

Setting Up Your Python Environment

Installing Python
Setting Up a Virtual Environment
Installing Required Libraries
Choosing the Right IDE

Scraping Monster.com Job Listings

Inspecting Monster.com Job Listing Page
Writing the Monster.com Listing Scraper
Handling Scroll Pagination
Storing Data in a JSON File
Complete Code Example

Scraping Monster.com Job Pages

Inspecting Monster.com Job Page
Writing the Monster.com Job Page Scraper
Storing Data in a JSON File
Complete Code Example

Final Thoughts
Frequently Asked Questions

Why Scrape Monster.com?

Scraping Monster.com gives you access to a huge amount of job market data, offering key insights that are hard to gather manually. By automating how you collect job listings, you can get info like job titles, locations, salaries, company names, and job descriptions. This data is key to understanding current market trends and making choices based on facts.

You might be a recruiter who wants to look at competitor job posts, a data analyst studying job trends, or a job seeker who wants to keep track of changes across industries. If so, Monster.com is a great source. With millions of job listings, the platform is perfect for anyone who needs fresh and detailed job data.

Making this data gathering automatic saves time and gives reliable correct info. Rather than looking through and getting data by hand, you can zero in on making sense of it. This lets you build useful things like job finders, trend trackers, or ways to compare pay.

Key Data Points to Extract from Monster.com

When you’re pulling data from Monster.com, you should zero in on getting the key info that gives you useful insights. Here’s what you need to grab:

Job Title: This tells you what jobs are open right now.
Job Description: This sums up the job, covering duties, skills needed, work experience required, and any perks.
Company Name: Identifies the employers offering the jobs.
About the Company: This gives a quick look at the company’s story and goals.
Company Website: The company’s main web page.
Company Size: This tells you how big the company is, like how many people work there.
Year Founded: The year the company was established.
Location: Shows where the job is based, helping filter positions by region.
Salary Information: If it’s there, this helps you know what pay to expect.
Job Posting Date: This enables you to see how new the job ads are.
Job Type: Whether the job is full-time, part-time, contract, etc.
Application Link: This takes you straight to where you can apply for the job.
Industry: This points out what field the job is in, like tech, healthcare, or money stuff.
Required Skills: Skills needed for the role.
Job ID: A unique identifier for each job posting, useful for tracking and updating listings.

Having a clear idea of what to extract before starting your Monster.com scraper helps you stay focused and ensures that your scraper collects meaningful data.

Crawlbase Crawling API for Monster.com Scraping

When scraping Monster.com, handling JavaScript-rendered content and navigating dynamic pages can be challenging with simple scraping techniques. That’s where the Crawlbase Crawling API comes in handy. This tool helps manage these complexities by rendering JavaScript and handling pagination efficiently.

Why Use Crawlbase Crawling API?

Monster.com uses JavaScript to load job postings and other important content dynamically. The old-fashioned scraping methods that just pull down static HTML don’t get this kind of stuff. The Crawlbase Crawling API avoids these restrictions by mimicking as actual browser, allowing all JavaScript-rendered components to be loaded and available.

Monster.com relies on JavaScript to load job listings and other content dynamically. Traditional scraping methods, which only fetch static HTML, often fail to capture this content. Crawlbase Crawling API overcomes these limitations by simulating a real browser, ensuring that all JavaScript-rendered elements are loaded and accessible.

Key Features of Crawlbase Crawling API

JavaScript Rendering: Crawlbase can handle the execution of JavaScript on the page, allowing you to scrape data that is loaded dynamically.
Avoid IP Blocking and CAPTCHAs: Crawlbase automatically rotates IPs and bypasses CAPTCHAs, allowing uninterrupted scraping of Monster.com without facing IP blocks or CAPTCHA challenges.
Handling Pagination: The API allows for all kinds of paging methods, including the “infinite scrolling” found on job boards like Monster.
Request Options: Customize your scraping requests with options for handling cookies, setting user agents, and more, making your scraping efforts more robust and reliable.

Crawlbase Python Library

Crawlbase has a Python library that makes web scraping a lot easier. This library requires an access token to authenticate. This token which you can get by registering an account with crawlbase.

Here’s an example function demonstrating how to use the Crawlbase Crawling API to send requests:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase offers two types of tokens:
_ Normal Token for static sites.
_ JavaScript (JS) Token for dynamic or browser-based requests.

For scraping dynamic sites like Monster.com, you’ll need the JS Token. Crawlbase provides 1,000 free requests to get you started, and no credit card is required for this trial. For more details, check out the Crawlbase Crawling API documentation.

In the next section, we’ll guide you through setting up a Python environment for this project. Let’s get started with the setup!

Setting Up Your Python Environment

We need to set up your Python environment before creating the Monster.com scraper. This section covers the essentials: installing Python and libraries, setting up a virtual environment, and choosing an IDE.

Installing Python

First, ensure you have Python installed on your computer. Python is a very flexible language used for many things, one of which is web scraping. You can download it from the official Python website. Follow the installation instructions specific to your operating system.

Setting Up a Virtual Environment

A virtual environment makes it easier to manage project dependencies without affecting other Python projects. Here’s how to set one up:

Create a Virtual Environment: Navigate to your project directory in the terminal and run:

1	python -m venv monster_env

Activate the Virtual Environment:

On Windows:
1
monster_env\Scripts\activate
On macOS/Linux:
1
source monster_env/bin/activate

Installing Required Libraries

With that the virtual environment is activated, you’ll have to install a couple of libraries to aid in web scraping and data processing.

Crawlbase: The main library for sending requests with the Crawlbase Crawling API.
BeautifulSoup4: For parsing HTML and XML documents.
Pandas: For handling and analyzing data.

You can install these libraries using pip. Open your terminal or command prompt and run:

1	pip install crawlbase beautifulsoup4 pandas

Choosing the Right IDE

An Integrated Development Environment (IDE) makes coding easier by providing useful features like syntax highlighting, debugging tools, and project management. Here are a few popular IDEs for Python development:

PyCharm: A professional IDE for Python with lots of really cool features.
Visual Studio Code (VS Code): A nice lightweight general use editor with good Python support via plugins.
Jupyter Notebook: Good for interactive coding and data analysis, especially when you need to visualize the data.

Choose an IDE that fits your preferences and workflow to streamline your coding experience.

Now that we have Python installed, libraries downloaded, and the development environment configured, we can proceed to the next phase of writing Monster.com pages scraper.

Scraping Monster.com Job Listings

Let’s start with web scraping Monster.com job listings with Python. Since Monster.com uses dynamic content loading and scroll-based pagination, simple scraping methods won’t be enough. We’ll be using Crawlbase’s Crawling API to take care of the JavaScript rendering and the scroll pagination, so that we scrape maximum job postings.

Inspecting Monster.com Job Listing Page

The first thing to do is to examine the HTML structure of the job posting page before jumping into the code. Knowing the hierarchy allows us to figure out the proper CSS selectors to get job information, such as the job title, company, location, and job URL.

Visit the URL: Open Monster.com and navigate to a job listing page.
Open Developer Tools: Right-click anywhere on the page and select “Inspect” to open the Developer Tools.
Identify Key Elements: Job listings are typically found within
elements with the attribute data-testid="svx_jobCard" inside a
with the ID JobCardGrid. The key elements include:

Writing the Monster.com Listing Scraper

Now, let’s write the scraper to extract job details from Monster.com. We’ll use the Crawlbase Crawling API, which simplifies handling dynamic content.

Here’s the code:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_monster_jobs(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'  # Wait for the page to fully load
    }
    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        soup = BeautifulSoup(response['body'], 'html.parser')
        job_cards = soup.select('div#JobCardGrid article[data-testid="svx_jobCard"]')

        all_jobs = []
        for job in job_cards:
            title = job.select_one('a[data-testid="jobTitle"]').text.strip() if job.select_one('a[data-testid="jobTitle"]') else ''
            company = job.select_one('span[data-testid="company"]').text.strip() if job.select_one('span[data-testid="company"]') else ''
            location = job.select_one('span[data-testid="jobDetailLocation"]').text.strip() if job.select_one('span[data-testid="jobDetailLocation"]') else ''
            link = job.select_one('a[data-testid="jobTitle"]')['href'] if job.select_one('a[data-testid="jobTitle"]') else ''

            job_listings.append({
                'Job Title': title,
                'Company': company,
                'Location': location,
                'Job Link': link
            })

        return job_listings
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

The options parameter includes settings like ajax_wait for handling asynchronous content loading and page_wait to wait 5 seconds before scraping, allowing all elements to load properly. You can read about Crawlbase Crawling API parameters here.

Monster.com uses scroll-based pagination to load more job listings dynamically. To capture all job listings, we’ll utilize the scroll and scroll_interval parameters provided by Crawlbase Crawling API.

scroll=true: Enables scroll-based pagination.
scroll_interval=60: Sets the scroll duration to 60 seconds, which is the maximum allowed time. Since we added time for scrolling, there’s no need to explicitly set page_wait.

Here’s how you can handle it:

def scrape_monster_with_pagination(url):
    options = {
        'ajax_wait': 'true',
        'scroll': 'true',  # Enables scroll pagination
        'scroll_interval': '60'  # Scroll duration set to 60 seconds
    }

    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        soup = BeautifulSoup(response['body'], 'html.parser')
        job_cards = soup.select('div#JobCardGrid article[data-testid="svx_jobCard"]')

        all_jobs = []
        for job in job_cards:
            title = job.select_one('a[data-testid="jobTitle"]').text.strip() if job.select_one('a[data-testid="jobTitle"]') else ''
            company = job.select_one('span[data-testid="company"]').text.strip() if job.select_one('span[data-testid="company"]') else ''
            location = job.select_one('span[data-testid="jobDetailLocation"]').text.strip() if job.select_one('span[data-testid="jobDetailLocation"]') else ''
            link = job.select_one('a[data-testid="jobTitle"]')['href'] if job.select_one('a[data-testid="jobTitle"]') else ''

            all_jobs.append({
                'Job Title': title,
                'Company': company,
                'Location': location,
                'Job Link': link
            })

        return all_jobs
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

Storing Data in a JSON File

Once you have scraped the job data, you can easily store it in a JSON file for future use or analysis:

def save_to_json(data, filename='monster_jobs.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

# Example usage after scraping
if jobs:
    save_to_json(jobs)

Complete Code Example

Here’s the full code combining everything discussed:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_monster_with_pagination(url):
    options = {
        'ajax_wait': 'true',
        'scroll': 'true',  # Enables scroll pagination
        'scroll_interval': '60'  # Scroll duration set to 60 seconds
    }

    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        soup = BeautifulSoup(response['body'], 'html.parser')
        job_cards = soup.select('div#JobCardGrid article[data-testid="svx_jobCard"]')

        all_jobs = []
        for job in job_cards:
            title = job.select_one('a[data-testid="jobTitle"]').text.strip() if job.select_one('a[data-testid="jobTitle"]') else ''
            company = job.select_one('span[data-testid="company"]').text.strip() if job.select_one('span[data-testid="company"]') else ''
            location = job.select_one('span[data-testid="jobDetailLocation"]').text.strip() if job.select_one('span[data-testid="jobDetailLocation"]') else ''
            link = job.select_one('a[data-testid="jobTitle"]')['href'] if job.select_one('a[data-testid="jobTitle"]') else ''

            all_jobs.append({
                'Job Title': title,
                'Company': company,
                'Location': location,
                'Job Link': link
            })

        return all_jobs
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def save_to_json(data, filename='monster_jobs.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

if __name__ == "__main__":
    base_url = 'https://www.monster.com/jobs/search?q=Java+Developers&where=New+York&page=1&so=p.s.lh'
    jobs = scrape_monster_with_pagination(base_url)

    if jobs:
        save_to_json(jobs)

This code shows you how to set up the scraper, handle scroll pagination, and store the data in a structured JSON format, making it easy for you to use the scraped data later on.

Example Output:

[
    {
        "Job Title": "Java Developer(Core Java)",
        "Company": "Georgia IT Inc.",
        "Location": "New York, NY",
        "Job Link": "https://www.monster.com/job-openings/java-developer-core-java-new-york-ny--1abe38e2-8183-43d3-a152-ecdf208db3bf?sid=3a00f5d1-d543-4de0-ab00-0f9e9c8079f8&jvo=m.mo.s-svr.1&so=p.s.lh&hidesmr=1"
    },
    {
        "Job Title": "Java Backend Developer(Java, Spring, Microservices, Maven)",
        "Company": "Diverse Lynx",
        "Location": "Manhattan, NY",
        "Job Link": "https://www.monster.com/job-openings/java-backend-developer-java-spring-microservices-maven-manhattan-ny--7228b274-60bb-41d7-b8d9-8a51ff8c8d1c?sid=3a00f5d1-d543-4de0-ab00-0f9e9c8079f8&jvo=m.mo.s-svr.2&so=p.s.lh&hidesmr=1"
    },
    {
        "Job Title": "Java Backend Developer(Java, Spring, Microservices, Maven)",
        "Company": "Diverse Linx",
        "Location": "Manhattan, NY",
        "Job Link": "https://www.monster.com/job-openings/java-backend-developer-java-spring-microservices-maven-manhattan-ny--2d652118-1c17-43e3-8cef-7b940a8b0490?sid=3a00f5d1-d543-4de0-ab00-0f9e9c8079f8&jvo=m.mo.s-svr.3&so=p.s.lh&hidesmr=1"
    },
    {
        "Job Title": "JAVA FULL STACK DEVELOPER",
        "Company": "HexaQuEST Global",
        "Location": "Brooklyn, NY",
        "Job Link": "https://www.monster.com/job-openings/java-full-stack-developer-brooklyn-ny--c60fb5f3-5adf-43a7-bfac-03c93853dd4e?sid=3a00f5d1-d543-4de0-ab00-0f9e9c8079f8&jvo=m.mo.s-svr.4&so=p.s.lh&hidesmr=1"
    },
    .... more
]

Scraping Monster.com Job Pages

After collecting job listings from Monster.com, the next step is to scrape detailed information from each job page. In this section, we’ll guide you on how to extract specific details like job descriptions, requirements, and company information from the job pages.

Inspecting Monster.com Job Page

To begin scraping a job page, you need to inspect its HTML structure to identify the elements that hold the data you need. Here’s how you can do it:

Visit a Job Page: Click on a job listing to open its detailed page.
Open Developer Tools: Right-click on the page and select “Inspect” to open the Developer Tools.
Identify Key Elements: Look for elements containing information such as:
- Job Title: Found within an
  tag with the attribute data-testid="jobTitle".
- Job Description: Usually located within a
  tag with the attribute data-testid="svx-description-container-inner".
- Numbers & Facts: Typically found in a table row () inside a table with the attribute data-testid="svx-jobview-details-table".
- About Company: Located within a
  tag with a class containing "about-styles__AboutCompanyContainerInner".

Writing the Monster.com Job Page Scraper

Once you know the structure, you can create a scraper to extract these details. The Crawlbase Crawling API will help handle any dynamic content while ensuring a smooth scraping process.

Here’s a professional example of the code:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_job_page(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'  # Wait for the page to fully load
    }

    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        soup = BeautifulSoup(response['body'].decode('utf-8'), 'html.parser')

        job_title = soup.select_one('h2[data-testid="jobTitle"]').text.strip() if soup.select_one('h2[data-testid="jobTitle"]') else ''
        job_description = soup.select_one('div[data-testid="svx-description-container-inner"]').text.strip() if soup.select_one('div[data-testid="svx-description-container-inner"]') else ''
        numbersAndfacts = [{tr.select_one('td:first-child').text.strip() : tr.select_one('td:last-child').text.strip()} for tr in soup.select('table[data-testid="svx-jobview-details-table"] tr')] if soup.select('table[data-testid="svx-jobview-details-table"] tr') else []
        about_company = soup.select_one('div[class*="about-styles__AboutCompanyContainerInner"]').text.strip() if soup.select_one('div[class*="about-styles__AboutCompanyContainerInner"]') else ''

        job_details = {
            'Job Title': job_title,
            'Job Description': job_description,
            'Numbers & Facts': numbersAndfacts,
            'About Company': about_company
        }

        return job_details
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

Storing Data in a JSON File

Once the job details are extracted, you can save the data in a JSON file for easier access and analysis:

def save_job_details_to_json(data, filename='job_details.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

Complete Code Example

Here’s the full code that ties everything together:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_job_page(url):
    options = {
        'ajax_wait': 'true',
        'page_wait': '5000'  # Wait for the page to fully load
    }

    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        soup = BeautifulSoup(response['body'].decode('utf-8'), 'html.parser')

        job_title = soup.select_one('h2[data-testid="jobTitle"]').text.strip() if soup.select_one('h2[data-testid="jobTitle"]') else ''
        job_description = soup.select_one('div[data-testid="svx-description-container-inner"]').text.strip() if soup.select_one('div[data-testid="svx-description-container-inner"]') else ''
        numbersAndfacts = [{tr.select_one('td:first-child').text.strip() : tr.select_one('td:last-child').text.strip()} for tr in soup.select('table[data-testid="svx-jobview-details-table"] tr')] if soup.select('table[data-testid="svx-jobview-details-table"] tr') else []
        about_company = soup.select_one('div[class*="about-styles__AboutCompanyContainerInner"]').text.strip() if soup.select_one('div[class*="about-styles__AboutCompanyContainerInner"]') else ''

        job_details = {
            'Job Title': job_title,
            'Job Description': job_description,
            'Numbers & Facts': numbersAndfacts,
            'About Company': about_company
        }

        return job_details
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def save_job_details_to_json(data, filename='job_details.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

if __name__ == "__main__":
    job_url = 'https://www.monster.com/job-openings/d94729d8-929e-4c61-8f26-fd480c31e931'
    job_details = scrape_job_page(job_url)

    if job_details:
        save_job_details_to_json(job_details)

The code above demonstrates how to inspect Monster.com job pages, extract specific job information, and store it in a JSON file.

Example Output:

{
  "Job Title": "Delivery Station Warehouse Associate",
  "Job Description": "Amazon Delivery Station Warehouse AssociateJob OverviewYou\u2019ll be part of the dedicated Amazon team at the delivery station \u2013 the last stop before we deliver smiles to customers. Our fast-paced, active roles receive trucks full of orders, then prepare them for delivery. You\u2019ll load conveyor belts, and transport and stage deliveries to be picked up by drivers.Duties & ResponsibilitiesSome of your duties may include:Receive and prepare inventory for deliveryUse technology like smartphones and handheld devices to sort, scan, and prepare ordersBuild, wrap, sort, and transport pallets and packagesYou\u2019ll also need to be able to:Lift up to 49 poundsReceive truck deliveriesView prompts on screens and follow direction for some tasksStand, walk, push, pull, squat, bend, and reach during shiftsUse carts, dollies, hand trucks, and other gear to move items aroundGo up and down stairs (where applicable)Work at a height of up to 40 feet on a mezzanine (where applicable)What it\u2019s like at an Amazon Delivery StationSurroundings. You\u2019ll be working around moving machines \u2013 order pickers, stand-up forklifts, turret trucks, and mobile carts.Activity. Some activities may require standing in one place for long periods, walking around, or climbing stairs.Temperature. Even with climate controls, temperatures can vary between 60\u00b0F and 90\u00b0F in some parts of the warehouse; on hot days, temperatures can be over 90\u00b0F in the truck yard or inside trailers.Noise level. It can get noisy at times. We provide hearing protection if you need it.Dress code. Relaxed, with a few rules to follow for safety. Comfortable, closed-toe shoes are required and protective safety footwear are required in select business units. Depending on the role or location, Amazon provides a $110 Zappos gift code towards the purchase of shoes, in order to have you prepared for your first day on the job.Why You\u2019ll Love AmazonWe have jobs that fit any lifestyle, state-of-the-art workplaces, teams that support and listen to our associates, and company-driven initiatives and benefits to help support your goals.Our jobs are nearby, with great pay, and offer work-life balance.Schedule flexibility. Depending on where you work, schedules may include full-time (40 hours), reduced-time (30-36 hours) or part-time (20 hours or less), all with the option of working additional hours if needed. Learn more about our schedules .Shift options. Work when it works for you. Shifts may include overnight, early morning, day, evening, and weekend. You can even have four-day workweeks and three-day weekends. Find out more about our shifts .Anytime Pay. You can instantly cash out up to 70% of your earnings immediately after your shift (for select employee groups). Learn more about Anytime Pay .Our workplace is unlike any other.State-of-the-art facilities. We have modern warehouses that are clean and well-organized.Safety. Your safety is important to us. All teams share safety tips daily and we make sure protective gear is available onsite. Please note, wearing a hard hat and/or safety shoes while working is a requirement for some roles at certain sites.Our team supports and listens to you.Culture. Be part of an inclusive workplace that offers a variety of DEI programs and affinity groups.Team environment. Work on small or large teams that support each other in a workplace that\u2019s been ranked among the best workplaces in the world.New skills. Depending on the role and location, you\u2019ll learn how to use the latest Amazon technology \u2013 including handheld devices and robotics.Our company supports your goals.Benefits. Many of our shifts come with a range of benefits that may include pay and savings options, healthcare, peace of mind for you and your family, and more.Career advancement. We have made a pledge to upskill our employees and offer a variety of free training and development programs, and we also have tuition support options for select employee groups. See where your Amazon journey can take you.Learn more about all the reasons to choose Amazon . A full list of benefits and criteria for each to be offered a successful applicant can be found here .Requirements:Candidates must be 18 years or older with the ability to understand and adhere to all job requirement and safety guidelines in English.How To Get StartedYou can begin by applying above. If you need help with your application or to learn more about our hiring process, you can find support here: https://hiring.amazon.com/hiring-process# /.Please note that if you already have an active application but are looking to switch to a different site, instead of applying for a new role, you can reach out to Application Help at https://hiring.amazon.com/contact-us#/ for next steps.If you have a disability and need an accommodation during the application and hiring process, including support for the New Hire Event, or need to initiate a request prior to starting your Day 1, please visit https://hiring.amazon.com/people-with-disabilities#/ or contact the Applicant-Candidate Accommodation Team (ACAT). You can reach us by phone at 888-435-9287, Monday through Friday, between 6 a.m. and 4 p.m. PT.Equal EmploymentAmazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status.",
  "Numbers & Facts": {
    "Location": "Revere, MA",
    "Job Type": "Temporary, Part-time",
    "Industry": "Transport and Storage - Materials",
    "Salary": "$18.50 Per Hour",
    "Company Size": "10,000 employees or more",
    "Year Founded": "1994",
    "Website": "http://Amazon.com/militaryroles"
  },
  "About Company": "At Amazon, we don\u2019t wait for the next big idea to present itself. We envision the shape of impossible things and then we boldly make them reality. So far, this mindset has helped us achieve some incredible things. Let\u2019s build new systems, challenge the status quo, and design the world we want to live in. We believe the work you do here will be the best work of your life. \nWherever you are in your career exploration, Amazon likely has an opportunity for you. Our research scientists and engineers shape the future of natural language understanding with Alexa. Fulfillment center associates around the globe send customer orders from our warehouses to doorsteps. Product managers set feature requirements, strategy, and marketing messages for brand new customer experiences. And as we grow, we\u2019ll add jobs that haven\u2019t been invented yet. \nIt\u2019s Always Day 1 \nAt Amazon, it\u2019s always \u201cDay 1.\u201d Now, what does this mean and why does it matter? It means that our approach remains the same as it was on Amazon\u2019s very first day \u2013 to make smart, fast decisions, stay nimble, invent, and stay focused on delighting our customers. In our 2016 shareholder letter, Amazon CEO Jeff Bezos shared his thoughts on how to keep up a Day 1 company mindset. \u201cStaying in Day 1 requires you to experiment patiently, accept failures, plant seeds, protect saplings, and double down when you see customer delight,\u201d he wrote. \u201cA customer-obsessed culture best creates the conditions where all of that can happen.\u201d You can read the full letter here \nOur Leadership Principles\nOur Leadership Principles help us keep a Day 1 mentality. They aren\u2019t just a pretty inspirational wall hanging. Amazonians use them, every day, whether they\u2019re discussing ideas for new projects, deciding on the best solution for a customer\u2019s problem, or interviewing candidates. To read through our Leadership Principles from Customer Obsession to Bias for Action, visit https://www.amazon.jobs/principles"
}

Scrape Monster Jobs with Crawlbase

Extracting job postings from Monster.com can revolutionize your approach to gathering employment data. This guide walked you through setting up a Python environment crafting a Monster.com page scraper, handling scroll-based pagination, and storing your information.

For sites like Monster.com that rely on JavaScript, tools such as the Crawlbase Crawling API prove invaluable. You can tailor this method to suit your requirements, whether you’re building a personal job-tracking system or amassing data for a large-scale project. Keep expanding your knowledge and remember that effective web scraping hinges on selecting suitable tools and techniques.

To broaden your web scraping skills, think about checking out our next guides on extracting data from other key websites.

📜 How to Scrape Bloomberg
📜 How to Scrape Wikipedia
📜 How to Scrape Google Finance
📜 How to Scrape Google News
📜 How to Scrape Clutch.co

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy Scraping!

Frequently Asked Questions

Q. Can I scrape job listings from Monster.com using basic Python libraries?

You might try to scrape Monster.com using basic Python libraries like requests and BeautifulSoup. However, the site relies on JavaScript to display content. This makes it hard for simple scraping methods to grab job listings well. To handle JavaScript and changing content, we suggest using Crawlbase Crawling API.

Q. How do I handle pagination while scraping Monster.com?

Monster.com loads more jobs as you scroll down the page. This is called scroll-based pagination. To handle this, you can use the scroll and scroll_interval parameters in the Crawlbase Crawling API. This method makes sure your scraper acts like a real user scrolling and gets as many job listings as possible.

Q. Is scraping Monster.com legal?

You need to check the site’s terms of service to be certain about what you can scrape legally. It’s also important to scrape responsibly by respecting robots.txt rules and avoiding excessive requests that could strain their servers.

What is AI Data Extraction, And How Does It Work?

2024-08-23T14:25:50.000Z

Intelligence nowadays has surpassed only human reasoning, as more businesses and individuals are relying on artificial intelligence and machine learning to make reliable decisions. Recent studies by Forbes reveal that over 60% of business owners say AI will increase productivity. More often, professionals use these systems to determine potential outcomes and increase accuracy.

The world of web scraping has also seen a potential increase in AI data extraction. Scrapers now utilize these AI solutions to complete all kinds of scraping activities. For instance, Crawlbase’s Smart Proxy relies on AI to guarantee fast and accurate extraction results.

This article will look into the fundamentals of AI data extraction, how it works, and how your business can leverage it for your web scraping needs.

What is AI Data Extraction?

Artificial intelligence data extraction is the process of automating information retrieval from multiple sources to save time and reduce errors. Without the need for human interaction, an AI-powered data extraction tool can identify and extract data from documents, phone numbers, addresses, or names, from different fields. This is made possible by AI’s use of machine learning and natural language processing to gather, process, and analyze data to extract valuable information.

Traditional Data Extraction vs AI-powered Methods

Before now, information extraction was done through spreadsheets or the old-fashioned way of pen and paper. It usually requires a lot of resources and is error-prone. Most times, extracting data from documents manually is difficult due to the limited computing resources to deliver optimal results.

Automated data extraction, on the other hand, ensure that each data field is being scraped accurately and in a timely manner, eliminating redundancies. Furthermore, artificial intelligence is capable of scraping data from various unstructured sources, including chats, emails, and more.

How AI Data Extraction Works

Artificial intelligence mimics human behavior on the internet, making it easier to extract from several sources without being flagged. In the past, people would manually transfer the content from a website into the appropriate computer file after perusing it. AI extraction software gathers data via a number of procedures and enhances the quality of scanned pictures or words.

Here’s an overview of how AI data extraction works:

Data Collection

This involves collating data from a wide range of sources, including structured, unstructured, and semi-structured. These can determine the general data presentation. At this stage, information is cleaned to remove errors and inconsistencies. After that, the data is formatted into content types that are easily understandable for the system to extract.

Data Analysis

This is where the action takes place. First, raw data is transformed into numerical values that machine learning models can understand. These data are then deployed into ML models, which are pre-tested to deliver vast datasets and recognize patterns. Each model is evaluated based on performance to ensure accuracy and reliability.

Data Extraction

At this stage, the model is ready to extract from the datasets. AI analyzes the desired information based on the identified patterns and pulls the data points. Lastly, the extracted data goes through quality checks to ensure data integrity.

Benefits of AI Data Extraction

Artificial intelligence ensures the reliability and accuracy of your data in general. Here are some advantages of AI data extraction:

Ability to handle large volumes of data: AI data extraction tools can efficiently manage the gathering of information from several sources in a matter of minutes, increasing the pace of extraction. Also, they are able to adapt to the ever-changing web pages with little or no human intervention.
Scalability: Since it can handle large volumes of data, it saves time and effort that could be used to focus on other innovative activities. Businesses can scale back and up on the number of resources devoted to extracting information by adjusting the parameters.
Data accuracy and consistency: Through deep learning, AI data scrapers are trained to perform extraction tasks, which ensures high degrees of accuracy. Compared to manual data extraction methods, these systems produce consistent results.
Maximizes synergistic workflow: AI data extraction maximizes team accessibility. As a result, team members from anywhere may access data and submit reports. AI data extraction enables dynamic collaboration without requiring physical proximity. Additionally, by maximizing the benefits of an AI RFP, organizations can ensure that the chosen solution integrates seamlessly with existing workflows and enhances overall efficiency.

Legal and Privacy Concerns of AI Data Extraction

Despite being an excellent choice for data extraction, there are concerns about how data is being handled and managed by AI systems. Since most AI data scrapers are third-party tools that are being integrated for the sole reason of extraction, there have been some gray areas on whether sensitive information is being exposed.

To mitigate this, it is best to pick data scrapers that are in accordance with privacy rules like GDPR and CCPA. Also, you can implement internal regulations to ensure proper use of data within your organization.

Applications of AI Data Extraction

Artificial intelligence is revolutionizing industrial transformation through its efficiency and reliability. Here are a few real-world applications for AI data extraction:

Finance

AI-driven data extraction has revolutionized the financial services sector, especially in the area of fraud detection. AI technologies support fraud prevention efforts by quickly identifying fraudulent activity and trends by closely examining real-time transaction data. Financial organizations have avoided possible losses of millions of dollars because of this priceless technology. Additionally, some AI models in finance utilize high-performance hardware like H100 GPUs to accelerate data processing and model training for real-time analytics. Personalized services are also facilitated by AI-driven data extraction. Financial institutions increase client happiness and loyalty by providing customized financial planning and investment advice based on the analysis of customer data.

Healthcare

AI-powered data extraction is essential in the field of healthcare. Healthcare providers improve diagnostic and treatment results by revealing patterns from vast amounts of patient data. The capacity of AI to evaluate medical pictures such as MRIs, CT scans, and X-rays is a prominent example. These devices identify minute irregularities, facilitating quicker and more precise diagnosis and, eventually, enhancing patient welfare.

Web Scraping

Ultimately, the essence of data extraction with AI tools is to get information from other websites for your business growth. Websites are the primary source for scraping, and AI makes sure it’s done accurately. Crawlbase’s Crawling API integrates with your existing system easily, providing you with a smooth web scraping experience. To optimize your web scraping process, tools like our Smart Proxy alter the IP addresses of every request to maximize the effectiveness of data extraction.

Use Crawlbase’s Smart Proxy to Optimize your Data Extraction

Smart Proxy uses advanced artificial intelligence to allocate your connection requests to a randomly rotating IP address in a pool of proxies before reaching the target website. You can rely on its millions of residential and data center proxies.

Smart Proxy combines machine learning and artificial intelligence to circumvent CAPTCHAs and blocks; making it more successful than a standard proxy at avoiding banned requests and bans. It will also enable you to connect to a proxy network several times using a single node. The ability to remain anonymous and make a lot more requests without getting blocked while visiting websites than if you were using a single proxy is the main advantage of utilizing this kind of proxy pool.

Groupon Scraper: Find the Hottest Deals and Coupons with Python

2024-08-20T10:00:00.000Z

If you’re looking for great deals on products, experiences, and coupons, Groupon is a top platform. With millions of active users and thousands of daily deals, Groupon helps people save money while enjoying activities like dining, travel, and shopping. By scraping Groupon, you can access valuable data on these deals, helping you stay updated on the latest offers or even build your own deal-tracking application.

In this blog, we’ll explore how to build a powerful Groupon Scraper in Python to find the hottest deals and coupons. Given that Groupon uses JavaScript to dynamically render its content, simple scraping methods won’t work efficiently. To handle this, we’ll leverage the Crawlbase Crawling API, which seamlessly deals with JavaScript rendering and other challenges.

Let’s dive in and learn how to Scrape Groupon for deals and coupons, step by step.

Why Scrape Groupon Deals and Coupons?
Key Data Points to Extract from Groupon
Crawlbase Crawling API for Groupon Scraping

Why Use the Crawlbase Crawling API?
Crawlbase Python Library

Setting Up Your Python Environment

Installing Python
Setting Up a Virtual Environment
Installing Required Libraries
Choosing the Right IDE

Scraping Groupon Deals

Understanding Groupon’s Website Structure
Writing the Groupon Scraper
Handling Pagination
Storing Data in a JSON File
Complete Code Example

Scraping Groupon Coupons

Inspecting the HTML Structure
Writing the Groupon Coupon Scraper
Storing Data in a JSON File
Complete Code Example

Final Thoughts
Frequently Asked Questions

Why Scrape Groupon Deals and Coupons?

Scraping Groupon deals and coupons helps you keep track of the newest discounts and offers. Groupon posts many deals each day, making it hard to check them all by hand. A good Groupon Scraper does this job for you, gathering and studying offers in areas like food, travel, electronics, and more.

Through Groupon Scraping, you can pull out essential info such as what the deal is, how much it costs, how big the discount is, and when it ends. This has benefits for businesses that want to watch what their rivals offer, developers creating a site that lists deals, or anyone who just wants to find the best bargains.

We aim to scrape Groupon deals and coupons productively, pulling out all the essential info while tackling issues like content that loads on its own. Because Groupon relies on JavaScript to show its content, regular scraping methods need help getting the data. This is where our solution, powered by the Crawlbase Crawling API, comes in handy. It lets us collect deals without breaking a sweat by getting around these common roadblocks.

In the following parts, we’ll look at the key pieces of info to pull from Groupon and get our setup ready for a smooth data collection process.

Key Data Points to Extract from Groupon

When you’re using a Groupon Scraper, you need to pinpoint the critical data that makes your scraping work count. Groupon has tons of deals in different categories, and pulling out the correct info can help you get the most from your scraping project. Here’s what you should focus on when you scrape Groupon:

Deal Titles: The name or title of a deal grabs attention first. It gives a quick idea of what’s on offer.
Deal Descriptions: In-depth descriptions offer more details about the product or service, helping people understand what the offer includes.
Original and Discounted Prices: These play a crucial role in understanding the available savings. By getting both the original price and the discounted price, you can work out the percentage of savings.
Discount Percentage: Many Groupon deals show the percentage of discounts right away. Getting this data point saves you time in figuring out the savings yourself.
Deal Expiry Date: Knowing when a deal ends helps to filter out old offers. Getting the expiry date makes sure you look at active deals.
Deal Location: Certain offers apply to specific areas. Getting location info lets you sort deals by region, which helps a lot with local marketing efforts.
Deal Category: Groupon puts deals into groups like food, travel, electronics, and so on. Grabbing category details makes it simple to break down the deals for study or display.
Ratings and Reviews: What customers say and how they score deals shows how popular and trustworthy an offer is. This info proves helpful in judging the quality of deals.

By zeroing in on these key bits of data, you can make sure your Groupon Scraping gives you info you can use and that matters. The next parts will show you how to set up your tools and build a scraper that can pull deals from Groupon in a good way.

Crawlbase Crawling API for Groupon Scraping

Working on a Groupon Scraper project can be tough when you need to deal with content that changes and JavaScript that loads stuff. Groupon’s website uses a lot of JavaScript to show deals and offers, so you will need more than just making simple requests to get you the data you want. This is where the Crawlbase Crawling API comes in handy. The Crawlbase Crawling API helps you avoid these issues and extract data from Groupon without running into problems with JavaScript loading, CAPTCHA, or IP blocking.

Why Use the Crawlbase Crawling API?

Handle JavaScript Rendering: The biggest hurdle when you grab deals from Groupon is to handle content that JavaScript creates. Crawlbase’s API takes care of JavaScript, which allows you to pull data.
Avoid IP Blocking and CAPTCHAs: If you scrape too much, Groupon might stop your IP or throw up CAPTCHAs. Crawlbase changes IPs on its own and beats CAPTCHAs, so you can keep pulling Groupon data non-stop.
Easy Integration: You can add the Crawlbase Crawling API to your Python code without much trouble. This lets you focus on getting the data you need, while the API handles the tricky stuff in the background.
Scalable Scraping: Crawlbase offers flexible choices to handle Groupon scraping projects of any size. You can use it to gather small datasets or to carry out large-scale data collection efforts.

Crawlbase Python Library

Crawlbase offers its own Python library to help its customers. You need an access token to authenticate when you use it. You can get this token after you create an account.

Here’s an example function that shows how to use the Crawling API from the Crawlbase library to send requests.

from crawlbase import CrawlingAPI

crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def make_crawlbase_request(url):
  response = crawling_api.get(url)

  if response['headers']['pc_status'] == '200':
    html_content = response['body'].decode('utf-8')
    return html_content
  else:
    print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
    return None

Note: Crawlbase offers two token types: a Normal Token for static sites and a JavaScript (JS) Token for dynamic or browser-based requests. For Groupon, you’ll need a JS Token. You can start with 1,000 free requests, no credit card needed. Check out the Crawlbase Crawling API docs here.

Next up, we’ll walk you through setting up Python and building Groupon scrapers that uses the Crawlbase Crawling API to handle JavaScript and other scraping challenges. Let’s jump into the setup process.

Setting Up Your Python Environment

Before we start writing the Groupon Scraper, we need to create a solid Python setup. Follow the following steps.

Installing Python

First, you’ll need Python on your computer to scrape Groupon. You can get the newest version of Python from python.org.

Setting Up a Virtual Environment

We suggest using a virtual environment to keep different projects from clashing. To make a virtual environment, run these commands:

# Create a virtual environment
python -m venv groupon_env

# Activate the virtual environment
# On Windows:
groupon_env\Scripts\activate

# On macOS/Linux:
source groupon_env/bin/activate

This keeps your project’s dependencies separate and makes them easier to manage.

Installing Required Libraries

Now, install the required libraries inside the virtual environment:

1	pip install crawlbase beautifulsoup4

Here’s a brief overview of each library:

crawlbase: The main library for sending requests using the Crawlbase Crawling API, which handles JavaScript rendering needed to scrape Groupon.
pandas: To store and manage the scraped data.
beautifulsoup4: To parse and navigate through the HTML structure of Groupon pages.

Choosing the Right IDE

You can write your code in any text editor, but using an Integrated Development Environment (IDE) can make coding easier. Some popular IDEs include VS Code, PyCharm, and Jupyter Notebook. These tools have features that help you code better, like highlighting syntax, completing code, and finding bugs. These features come in handy when you’re building a Groupon Scraper.

Now that you’ve set up your environment and have your tools ready, you can start writing the scraper. In the next section, we’ll create a Groupon deals scraper.

Scraping Groupon Deals

In this part, we’ll explain how to get deals from Groupon with Python and the Crawlbase Crawling API. Groupon uses JavaScript rendering and scroll-based pagination so simple scraping methods don’t work. We’ll use Crawlbase’s Crawling API, which handles JavaScript and scroll pagination without a hitch.

The URL we’ll scrape is: https://www.groupon.com/local/washington-dc

Inspecting the HTML Structure

Before writing the code, it’s crucial to inspect the HTML structure of Groupon’s deals page. This helps you determine the correct CSS selectors needed to extract the data.

Visit the URL: Open the URL in your browser.
Open Developer Tools: Right-click and select “Inspect” to open Developer Tools.

Identify Key Elements: Groupon deal listings are typically found within

elements with the class cui-content. Each deal has the following details:

Title: Found within a
tag with the class cui-udc-title.
Link: The link is contained within the href attribute of the tag.
Original Price: Displayed in a
with the class cui-price-discount-original.
Discount Price: Displayed in a
with the class cui-price-discount.
Location: Optional, usually in a with the class cui-location-name.

Writing the Groupon Scraper

We’ll begin by coding a simple function to get the deal info from the page. We’ll use the Crawlbase Crawling API to handle dynamic content loading because Groupon relies on JavaScript for rendering.

Here’s the code:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize CrawlingAPI with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_TOKEN'})

def scrape_groupon_with_pagination(base_url):
    options = {
        'ajax_wait' : 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(base_url, options)
    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')

        soup = BeautifulSoup(html_content, 'html.parser')
        deals = soup.find_all('div', class_='cui-content')
        all_deals = []

        for deal in deals:
            title = deal.find('div', class_='cui-udc-title').text.strip() if deal.find('div', class_='cui-udc-title') else ''
            link = deal.find('a')['href'] if deal.find('a') else ''
            original_price = deal.find('div', class_='cui-price-original').text.strip() if deal.find('div', class_='cui-price-original') else ''
            discounted_price = deal.find('div', class_='cui-price-discount').text.strip().encode("ascii", "ignore").decode("utf-8") if deal.find('div', class_='cui-price-discount') else ''
            location = deal.find('span', class_='cui-location-name').text.strip() if deal.find('span', class_='cui-location-name') else ''

            all_deals.append({
                'title': title,
                'original_price': original_price,
                'discounted_price': discounted_price,
                'link': link,
                'location': location
            })

        return all_deals
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

Groupon uses scroll-based pagination to load additional deals dynamically. To capture all the deals, we’ll leverage the scroll and scroll_interval options in the Crawlbase Crawling API.

scroll=true: Enables scroll-based pagination.
scroll_interval=10: Sets the scroll time to 10 seconds (60 max allowed).

Here’s how you can integrate it:

def scrape_groupon_with_pagination(url):
    options = {
        'ajax_wait' : 'true',
        'scroll': 'true',
        'scroll_interval': '10'
    }

    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')

        soup = BeautifulSoup(html_content, 'html.parser')
        deals = soup.find_all('div', class_='cui-content')
        all_deals = []

        for deal in deals:
            title = deal.find('div', class_='cui-udc-title').text.strip() if deal.find('div', class_='cui-udc-title') else ''
            link = deal.find('a')['href'] if deal.find('a') else ''
            original_price = deal.find('div', class_='cui-price-original').text.strip() if deal.find('div', class_='cui-price-original') else ''
            discounted_price = deal.find('div', class_='cui-price-discount').text.strip().encode("ascii", "ignore").decode("utf-8") if deal.find('div', class_='cui-price-discount') else ''
            location = deal.find('span', class_='cui-location-name').text.strip() if deal.find('span', class_='cui-location-name') else ''

            all_deals.append({
                'title': title,
                'original_price': original_price,
                'discounted_price': discounted_price,
                'link': link,
                'location': location
            })

        return all_deals
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

In this function, we’ve added scroll-based pagination handling using Crawlbase’s options, ensuring max available deals are captured.

Storing Data in a JSON File

Once you’ve collected the data, it’s easy to store it in a JSON file:

import json

def save_to_json(data, filename='groupon_deals.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

# Example usage after scraping
if deals:
    save_to_json(deals)

Complete Code Example

Here’s the full code combining everything discussed:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize CrawlingAPI with your access token
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_TOKEN'})

def scrape_groupon_with_pagination(url):
    options = {
        'ajax_wait' : 'true',
        'scroll': 'true',
        'scroll_interval': '60'
    }

    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')

        soup = BeautifulSoup(html_content, 'html.parser')
        deals = soup.find_all('div', class_='cui-content')
        all_deals = []

        for deal in deals:
            title = deal.find('div', class_='cui-udc-title').text.strip() if deal.find('div', class_='cui-udc-title') else ''
            link = deal.find('a')['href'] if deal.find('a') else ''
            original_price = deal.find('div', class_='cui-price-original').text.strip() if deal.find('div', class_='cui-price-original') else ''
            discounted_price = deal.find('div', class_='cui-price-discount').text.strip().encode("ascii", "ignore").decode("utf-8") if deal.find('div', class_='cui-price-discount') else ''
            location = deal.find('span', class_='cui-location-name').text.strip() if deal.find('span', class_='cui-location-name') else ''

            all_deals.append({
                'title': title,
                'original_price': original_price,
                'discounted_price': discounted_price,
                'link': link,
                'location': location
            })

        return all_deals
    else:
        print(f"Failed to fetch data. Status code: {response['headers']['pc_status']}")
        return None

def save_to_json(data, filename='groupon_deals.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Data saved to {filename}")

if __name__ == "__main__":
    url = 'https://www.groupon.com/local/washington-dc'
    deals = scrape_groupon_with_pagination(url)

    if deals:
        save_to_json(deals)

Test the Scraper:

Create a new file named groupon_deals_scraper.py, copy the code provided into this file, and save it. Run the Script using Following command:

1	python groupon_deals_scraper.py

You should see output similar to the example below in JSON file.

[
    {
        "title": "Chimney Pro",
        "original_price": "$400",
        "discounted_price": "$69",
        "link": "https://www.groupon.com/deals/chimney-pro-1-6",
        "location": ""
    },
    {
        "title": "Spa World",
        "original_price": "$40",
        "discounted_price": "$35",
        "link": "https://www.groupon.com/deals/spa-world-26",
        "location": "Centreville"
    },
    {
        "title": "Kings Dominion",
        "original_price": "$79.99",
        "discounted_price": "$42.99",
        "link": "https://www.groupon.com/deals/gl-kings-dominion-amusement-park",
        "location": "Kings Dominion"
    },
    {
        "title": "30% Off First 5 Weeks + Free Shipping (Blue Apron Coupon)",
        "original_price": "",
        "discounted_price": "",
        "link": "https://www.groupon.com/deals/cpn-blueapron-q3sl",
        "location": ""
    },
    {
        "title": "Valvoline Instant Oil Change - VA",
        "original_price": "$50.99",
        "discounted_price": "$39.99",
        "link": "https://www.groupon.com/deals/valvoline-instant-oil-change-dc-4",
        "location": "Multiple Locations"
    },
    .... more
]

Scraping Groupon Coupons

In this part, we’ll learn how to get coupons from Groupon with Python and the Crawlbase Crawling API. Groupon’s coupon page looks a bit different from its deals page so we need to look at the HTML structure. We’ll use the Crawlbase API to get coupon titles descriptions when they expire, and their links.

We’ll scrape this URL: https://www.groupon.com/coupons/amazon

Inspecting the HTML Structure

To scrape Groupon coupons effectively, it’s essential to identify the key HTML elements that hold the data:

Visit the URL: Open the URL in your browser.

Open Developer Tools: Right-click on the webpage and choose “Inspect” to open the Developer Tools.

Locate the Coupon Containers: Groupon’s coupon listings are usually within

tags with the class coupon-offer-tile. Each coupon block contains:

Title: Found inside an
element with the class coupon-tile-title.
Callout: The Callout is within the
element with the class coupon-tile-callout.
Description: Usually found in a
with the class coupon-tile-description.
Coupon Type: Found inside a tag with the class coupon-tile-type.

Writing the Groupon Coupon Scraper

We’ll write a function that uses the Crawlbase Crawling API to handle dynamic content rendering and pagination while scraping the coupon data. Here’s the implementation:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize the Crawlbase CrawlingAPI with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_groupon_coupons(url):
    options = {
        'ajax_wait' : 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')

        soup = BeautifulSoup(html_content, 'html.parser')
        coupons = soup.select('li.coupons-list-row > div.coupon-offer-tile')

        scraped_coupons = []
        for coupon in coupons:
            title = coupon.find('h2', class_='coupon-tile-title').text.strip().encode("ascii", "ignore").decode("utf-8") if coupon.find('h2', class_='coupon-tile-title') else ''
            callout = coupon.find('div', class_='coupon-tile-callout').text.strip() if coupon.find('div', class_='coupon-tile-callout') else ''
            description = coupon.find('p', class_='coupon-tile-description').text.strip().encode("ascii", "ignore").decode("utf-8") if coupon.find('p', class_='coupon-tile-description') else ''
            type = coupon.find('span', class_='coupon-tile-type').text.strip() if coupon.find('span', class_='coupon-tile-type') else ''

            scraped_coupons.append({
                'title': title,
                'callout': callout,
                'description': description,
                'type': type
            })

        return scraped_coupons
    else:
        print(f"Failed to retrieve data. Status code: {response['headers']['pc_status']}")
        return None

Storing Data in a JSON File

Once you have the coupon data, you can store it in a JSON file for easy access and analysis:

def save_coupons_to_json(data, filename='groupon_coupons.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Coupon data saved to {filename}")

Complete Code Example

Here is the complete code for scraping Groupon coupons:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

# Initialize the Crawlbase CrawlingAPI with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_groupon_coupons(url):
    options = {
        'ajax_wait' : 'true',
        'page_wait': '5000'
    }

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')

        soup = BeautifulSoup(html_content, 'html.parser')
        coupons = soup.select('li.coupons-list-row > div.coupon-offer-tile')

        scraped_coupons = []
        for coupon in coupons:
            title = coupon.find('h2', class_='coupon-tile-title').text.strip().encode("ascii", "ignore").decode("utf-8") if coupon.find('h2', class_='coupon-tile-title') else ''
            callout = coupon.find('div', class_='coupon-tile-callout').text.strip() if coupon.find('div', class_='coupon-tile-callout') else ''
            description = coupon.find('p', class_='coupon-tile-description').text.strip().encode("ascii", "ignore").decode("utf-8") if coupon.find('p', class_='coupon-tile-description') else ''
            type = coupon.find('span', class_='coupon-tile-type').text.strip() if coupon.find('span', class_='coupon-tile-type') else ''

            scraped_coupons.append({
                'title': title,
                'callout': callout,
                'description': description,
                'type': type
            })

        return scraped_coupons
    else:
        print(f"Failed to retrieve data. Status code: {response['headers']['pc_status']}")
        return None

def save_coupons_to_json(data, filename='groupon_coupons.json'):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
    print(f"Coupon data saved to {filename}")

if __name__ == "__main__":
    url = 'https://www.groupon.com/coupons/amazon'
    coupons = scrape_groupon_coupons(url)

    if coupons:
        save_coupons_to_json(coupons)

Test the Scraper:

Save the code to a file named groupon_coupons_scraper.py. Run the script using the following command:

1	python groupon_coupons_scraper.py

After running the script, you should find the coupon data saved in a JSON file named groupon_coupons.json.

[
    {
        "title": "Amazon Promo Code",
        "callout": "Promo Code",
        "description": "Click here and save with coupons and promo codes on household, beauty, and, well, EVERYTHING else Amazon sells",
        "type": "Coupon Code"
    },
    {
        "title": "Amazon Prime Exclusive Promo Codes",
        "callout": "Up to 80% Off",
        "description": "A ton of Amazon promo codes, coupons, and more are right this way. New deals added daily!",
        "type": "Coupon Code"
    },
    {
        "title": "Up to 65% OFF Amazon Promo Code",
        "callout": "Up to 65% Off",
        "description": "Save up to 65% on daily  deals. Click here for the most up-to-date listings and availability.",
        "type": "Coupon Code"
    },
    {
        "title": "Spend $50, Save 15%",
        "callout": "15% Off",
        "description": "This one is dead simple. Spend $50 on Amazon products and take 15% off your purchase total. No promo code required!",
        "type": "Promo"
    },
    {
        "title": " & above | Amazon Promo & Coupon Codes",
        "callout": "Promo Code",
        "description": "A big bunch of vetted and accurate promo codes for a huge variety of top-rated products. This is the real deal!",
        "type": "Coupon Code"
    },
    .... more
]

Final Thoughts

Building a Groupon scraper helps you stay in the loop about the best deals and coupons. Python and the Crawlbase Crawling API let you scrape Groupon pages without much trouble. You can handle dynamic content and pull out useful data.

This guide showed you how to set up your environment, write the Groupon deals and coupons scraper, deal with pagination, and save your data. A well-designed Groupon scraper can automate the process if you want to track deals in a specific place or find the newest coupons.

If you’re looking to expand your web scraping capabilities, consider exploring our following guides on scraping other important websites.

📜 How to Scrape Google Finance
📜 How to Scrape Google News
📜 How to Scrape Google Scholar Results
📜 How to Scrape Google Search Results
📜 How to Scrape Google Maps
📜 How to Scrape Yahoo Finance
📜 How to Scrape Zillow

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy Scraping!

Frequently Asked Questions

Q. Is scraping Groupon legal?

Scraping Groupon doesn’t break the rules if you do it for yourself and stick to what the site allows. But make sure to look at Groupon’s rules to check if what you’re doing is okay. If you want to scrape Groupon data for commercial purposes, you should ask the website first so you don’t get into trouble.

Q. Why use the Crawlbase Crawling API instead of simpler methods?

Groupon depends a lot on JavaScript to show content. Regular scraping tools like requests and BeautifulSoup can’t handle this. The Crawlbase Crawling API helps get around these problems. It lets you grab deals and coupons even when there’s JavaScript and you need to scroll to see more items.

Q. How can I store scraped Groupon data?

You have options to keep Groupon data you’ve scraped in different formats like JSON, CSV, or even a database. In this guide, we’ve focused on saving data in a JSON file because it’s easy to handle and works well for most projects. JSON also keeps the structure of the data intact, which makes it simple to analyze later.

How to Efficiently Scrape Emails from Websites

2024-08-14T14:25:50.000Z

In our digital world, getting hold of the correct contact details can make a big difference for your company. If you want to grow your connections, get in touch with possible customers, or do market studies, learning how to pull emails from websites can give you a leg up. This handy method lets you collect valuable information, which opens doors to new chances for growth and getting your message out there.

This guide shows you how to scrape emails from websites. It covers everything from the basics to advanced methods.

What is Email Scraping?

Email scraping is an automated approach to gathering email addresses from various online sources. This involves using specialized software tools called email scrapers to pull out contact information from websites, social media platforms, forums, and other digital spaces. These tools scan web pages to look for patterns that look like email addresses, such as “name@example.com,” and put them together into a list.

Benefits of Email Scraping

Email scraping gives businesses and marketers several plus points:

Saves time: It makes collecting email addresses automatic, helping you build focused contact lists fast.
Find leads: You can gather lots of potential client contacts.
Helps with market research: It gives you useful data to study trends and how consumers act.
Reaches specific groups: By pulling out relevant info, you can aim your marketing at particular audiences.

Common Use Cases for Scraping Emails from Websites

Email scraping has many uses across different industries:

Marketing campaigns: Create email lists to target specific groups and send cold emails.
Lead generation: Find and gather contact details of potential customers.
Market intelligence: Collect data to examine industry shifts and what competitors are doing.
Customer engagement: Find mentions of your brand on social media to interact with users.
Sales acceleration: Streamline the process of discovering and reaching out to prospects freeing up sales teams to focus on selling instead of manual work.

How to Set Up Your Email Scraping Environment

Pick a Programming Language

To begin your email scraping adventure, you need to pick a good programming language. Python is the best option for web scraping because it’s easy to use, flexible, and has lots of helpful libraries. It’s also pretty fast and doesn’t use up too many resources, thanks to its dynamic typing.

Essential Libraries and Tools to Scrape Emails from Websites

After you’ve decided on Python, you’ll need to get some essential libraries to make email scraping easier:

BeautifulSoup: A great tool to break down HTML and XML documents.
Requests: The go-to way in Python to send HTTP requests.
Scrapy: A complete package to build web crawlers.
Selenium: Comes in handy to scrape websites that change a lot and to mimic how a browser acts.

These libraries give you the tools you need to pull email addresses from websites.

How to Get your Workspace Ready

To set up your workspace:

Get Python: Go to the official website, download the newest version, and install it.
Get pip3: This is the tool that installs packages for Python 3.
Pick an IDE: Choose a text editor or IDE like Visual Studio Code, PyCharm, or Sublime Text.
Make a virtual environment: Use the “venv” module to create a separate space for your project.
Install the libraries you need: Use pip3 to add the necessary libraries to your virtual environment.

Here’s how to install BeautifulSoup:

1	pip install beautifulsoup4

How to Put Email Scraping Methods into Action

To efficiently scrape emails from websites,, you need to mix several methods. Let’s look at the key steps to build an email scraping solution that works.

How to Break Down HTML with BeautifulSoup

BeautifulSoup is a strong Python library to break down HTML content. To use it well:

Set up BeautifulSoup with pip: pip install beautifulsoup4
Bring the library into your script: from bs4 import BeautifulSoup
Break down the HTML content: soup = BeautifulSoup(response.text, 'html.parser')

BeautifulSoup makes it easy to search and navigate HTML structures, which is great for pulling out specific elements.

HTTP requests

To get web pages, you need to make HTTP requests. Python’s Requests library works well for this:

Install Requests: pip install requests
Import the library: import requests
Send a GET request: response = requests.get(url)

This gets the HTML content of the webpage you want, which you can then break down with BeautifulSoup.

How to Extract out Email Addresses with Regex

Regular expressions (regex) are key to finding email patterns in text. Here’s a basic regex pattern to get emails:

1
2
3

import re
email_pattern = r'[\w.-]+@[\w.-]+.\w+'
emails = re.findall(email_pattern, text)

This pattern looks for sequences that match common email structures. You can tweak it more to boost accuracy or handle specific cases.

By combining these methods, you can build a robust email scraping tool. Keep in mind to follow website terms of service and legal issues when you set up your scraper.

Best Practices and Legal Issues of Scraping Emails from Websites

Ethical scraping rules: When you’re scraping emails from websites, it’s key to stick to ethical rules to make sure you’re collecting data. Always honor what website owners want and their rules. Don’t take emails from private places or areas that need passwords, as this is against the law and can get you in big trouble. Instead, look at information that’s out in the open, but keep in mind laws about data privacy.

To keep things ethical:

Ask for permission when you can
Use good tools and services for scraping
Don’t scrape too often to avoid putting too much stress on servers
Don’t use the emails you get to send spam or lots of unwanted emails

Paying attention to robots.txt: The robots.txt file plays a key role in ethical web scraping. It tells web crawlers which website sections they can crawl. To follow robots.txt rules:

Get the file by sending an HTTP request to the root domain + “/robots.txt”
Read and study its contents to grasp crawling rules
Look for “Disallow” or “Allow” rules for your user agent
Check for listed crawl-rate limits or visit times
Make sure your scraping program follows these rules

If you ignore robots.txt, your scraper might get blocked or face legal issues.

Legal implications of email scraping : The law around email scraping isn’t clear-cut. It depends on things like where you get the emails, why you’re scraping them, and what laws apply where you are. In general, it’s okay to scrape email addresses that are out in the open for anyone to see. But you need to think about privacy laws and whether people have said it’s okay to use their emails.

Here are the primary legal things to keep in mind:

Follow privacy laws like GDPR and the CAN-SPAM Act
Don’t use scraped emails to send spam or unwanted ads
Remember that breaking a website’s rules could get you in trouble with the law
Keep in mind that taking people’s emails without asking might invade their privacy

Scrape Emails From Other Websites with Crawlbase

Email scraping has emerged as a powerful tool to gather valuable contact information efficiently. This guide has explored the fundamentals of email scraping, from setting up the right environment to implementing effective techniques and navigating legal considerations. By leveraging tools like BeautifulSoup and regex patterns, businesses can streamline their data collection processes and open up new avenues for growth and communication.

Crawlbase enables you to scrape emails from other websites with ease. We provide businesses and individuals with innovative web scraping products like Smart Proxy, Crawler, and Crawling API. Sign up now to start scraping websites with ease.

FAQs

Is it legal to scrape websites?

Web scraping isn’t against the law, and many companies use it to collect data to analyze. But in some cases other laws or rules might make web scraping illegal.

Can ChatGPT be used to scrape email addresses?

ChatGPT can work as an email parser to get email addresses. To use ChatGPT for this, you need to have a paid OpenAI/ChatGPT account because using the app in a Zap costs a small amount for each request.

Can you scrape data from websites?

Yes, you can scrape data that’s out in the open on websites, but there are some rules to follow. It’s worth pointing out that web scraping isn’t the same as stealing data. In fact, many companies rely on it to run their business.

How can I use Python to scrape email addresses from a website?

To scrape email addresses from a website with Python, here’s what you need to do:

Step 1: Get the libraries you need and install them.
Step 2: Bring in the libraries and start a session.
Step 3: Grab buttons from the website.
Step 4: Find and pull out email addresses from the website.
Step 5: Look at how to use it with an example. Also, you might want to check out the top five Python libraries that are key for web scraping in 2024.

Blog | Crawlbase

How to extract Foursquare Data in Easy Steps

Table of Contents

Why Extract Data from Foursquare?

Key Data Points to Extract from Foursquare

Crawlbase Crawling API for Foursquare Scraping

Crawlbase Python Library

Setting Up Your Python Environment

Installing Python and Required Libraries

Choosing an IDE

Scraping Foursquare Search Listings

Inspecting the HTML for Selectors

Writing the Foursquare Search Listings Scraper

Handling Pagination

Storing Data in a JSON File

Complete Code Example

Scraping Foursquare Venue Details

Inspecting the HTML for Selectors

tag with the class .venueName.

Writing the Foursquare Venue Details Scraper

Storing Data in a JSON File

Complete Code Example

Final Thoughts

Frequently Asked Questions

Q. Is it legal to scrape data from websites?

Q. How can I scrape venue details from Foursquare?

Q. How do I handle pagination while scraping Foursquare?

Scrape OpenSea Data with Python

Table of Contents

Why Scrape OpenSea for NFT Data?

What Data Can You Extract From OpenSea?

OpenSea Scraping with Crawlbase Crawling API

Why Use Crawlbase Crawling API for OpenSea

Crawlbase Python Library

Setting Up Your Python Environment

Installing Python and Required Libraries

Scraping OpenSea Collection Pages

Inspecting the HTML for CSS Selectors

Writing the Collection Page Scraper

Handling Pagination in Collection Pages

Storing Data in a CSV File

Complete Code Example

Scraping OpenSea NFT Detail Pages

Inspecting the HTML for CSS Selectors

tag with class item--title.

Writing the NFT Detail Page Scraper

Storing Data in a CSV File

Complete Code Example

Optimize OpenSea NFT Data Scraping

Frequently Asked Questions

Q. Why should I web scrape OpenSea?

Q. Is it legal to scrape data from OpenSea?

Q. What tools do I need to start scraping OpenSea?

How to scrape Gumtree Data in Easy Steps

Table of Contents

Why Scrape Gumtree Data?

Key Data Points to Extract from Gumtree

Setting Up Your Python Environment

Installing Python and Required Libraries

Choosing an IDE

Scraping Gumtree Search Listings

Inspecting the HTML for CSS Selectors

Writing the Search Listings Scraper

Handling Pagination in Gumtree

Storing Data in a CSV File

Complete Code Example

Scraping Gumtree Product Pages

Inspecting the HTML for CSS Selectors

tag with the attribute data-q="vip-title".

tag with the attribute data-q="ad-price".

tag with the class seller-rating-block-name.

Writing the Product Page Scraper

Storing Data in a CSV File

Complete Code Example

Optimizing Scraping with Crawlbase Smart Proxy

Benefits of Crawlbase Smart Proxy

Integrating Crawlbase Smart Proxy

Optimize Gumtree Scraping with Crawlbase

Frequently Asked Questions

Q. Is it legal to scrape data from Gumtree?

tag with the class `.venueName`.

tag with class `item--title`.

tag with the attribute `data-q="vip-title"`.

tag with the attribute `data-q="ad-price"`.

tag with the class `seller-rating-block-name`.

tag with the attribute `data-testid="lblPDPDetailProductName"`.