Forbes is a business and financial news site with great information on industries, companies, and people around the world. Forbes gets millions of visits every month. They have billionaire rankings, business trends, and analysis. Forbes uses JavaScript to load their content dynamically so it’s a bit tricky to scrape with traditional tools.
This tutorial will show you how to scrape Forbes data using Puppeteer, a headless browser. Once you get the basics down, we’ll cover how to use the Crawlbase Crawling API to optimize your data extraction. With these tools, you can collect Forbes data for research, analysis, or personal projects.
Here’s a short tutorial on how to scrape Forbes for billionaire rankings:
Table of Contents
- Why Scrape Data From Forbes?
- Key Data Points to Scrape from Forbes
- Setting Up Your Scraping Environment
- Installing Puppeteer
- Setting Up Your Project
- Installing Required Libraries
- Inspecting the HTML Structure
- Writing the Puppeteer Scraper
- Storing Data in a JSON File
- Introduction to Crawlbase Crawling API
- How to Use Crawlbase with Forbes
- Code Example with Crawlbase
Why Scrape Data from Forbes?
There is no denying that Forbes has a wealth of business, financial, and lifestyle related information. Scraping Forbes data does allow you to follow several aspects, such as the most current trends in business or the analysis of the billionaires’ wealth. Here are some key reasons to scrape data from Forbes:
- Billionaire Rankings: Forbes is a name everyone is familiar with its global billionaire rankings. This data can be scraped to see how wealth has evolved over time.
- Company Information: For looking at how a business is doing, Forbes has the best profiles on companies.
- Industry Insights: Forbes provide articles on various sectors including technology, finance, healthcare and more. Scrape data to follow specific industries and trends.
- Financial News: Forbes publishes real-time news and and updates on the world economy and markets. Scrape this data to keep track of significant financial events.
Key Data Points to Scrape from Forbes
While Scraping Forbes, you may want to extract many data points. Some of the essential data points you need to look at are:
- Billionaire Profiles: Forbes provides in-depth biographies of the wealthiest individuals on the planet. These profiles contain wealth source, industry, net worth, and country of origin.
- Company Profiles: Forbes provides comprehensive data about businesses, such as revenue, headcount, and sector. Use this data to compare businesses or keep an eye on particular industries over time.
- Top Lists: Forbes is well-known for its “Top” lists, which include the top 100 billionaires, the top multinational corporations, and the top startups.
- Articles and News: Forbes features breaking news and in-depth articles on business, finance, and lifestyle. To keep up with the most recent news, trends, and expert opinions from the sector, scrape Forbes articles.
- Market Data: Financial information such as stock prices, market trends, and economic projections are available on the website. To keep track of the financial markets and gain real-time insights, scrape Forbes market data.
Setting Up Your Scraping Environment
To scrape Forbes data, we need to set up project environment. We need to install Node.js, Puppeteer, and other required libraries. Follow following steps.
Installing Puppeteer
Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium, perfect for scraping dynamic content like Forbes. To install Puppeteer, follow these steps:
- Make sure Node.js is installed on your system. You can download it from Node.js official website.
- Once you have Node.js, open your terminal and run the following command to install Puppeteer:
1 | npm install puppeteer |
This command will install Puppeteer along with Chromium, which Puppeteer uses to run a headless browser for scraping websites.
Setting Up Your Project
Puppeteer is installed. Now set up your project folder and initialize Node.js. Follow these steps:
- Create a new directory for your project:
1 | mkdir forbes-scraper |
- Initialize a new Node.js project by running the following command:
1 | npm init -y |
This command will create a package.json
file, which manages your project dependencies.
This completes the setup for your Forbes scraping environment. Next, we’ll dive into writing the Puppeteer scraper.
Scraping Forbes with Puppeteer
Now that we have our environment set up, we’ll start scraping Forbes with Puppeteer. In this section, we’ll inspect the HTML, write the scraper, handle dynamic content, and store the scraped data in a JSON file. For this example, we’ll be scraping the Forbes Worlds Billionaires List 2024.
Inspecting the HTML Structure
Before we write the scraper, let’s inspect the Forbes website’s HTML. This will help us identify the key elements that contain the data.
Inspecting the Billionaires List Page
- Visit the Page: Go to the Forbes World’s Billionaires List.
- Open Developer Tools: Right-click anywhere on the page and select “Inspect” or press
Ctrl+Shift+I
to open Developer Tools.
- Look for Key Elements:
- Billionaire Names/Links: Typically contained in
<a>
tags with classes likecolor-link
. This is where you get the link to each billionaire’s profile.
Scraping Each Billionaire’s Profile
- Navigate to a Profile: Click on a link from the list to open the billionaire’s profile page.
- Open Developer Tools: Right-click anywhere on the page and select “Inspect” or press
Ctrl+Shift+I
to open Developer Tools.
- Key Elements to Look For:
- Rank: Look for the rank, typically inside a
<div>
or<span>
with a class likelistuser-item__list--rank
. - Name: Usually inside a header tag, like
<h1>
with a class likelistuser-header__name
. - Organization: Found in either an
<a>
or<span>
element with organization-related classes. - Net Worth: Typically inside a
<div>
with classes likeprofile-info__item-value
. - Biography: Often found inside an unordered list (
<ul>
) element. - Additional Data: Titles and texts could be found in elements with classes like
profile-stats__title
andprofile-stats__text
.
Writing the Puppeteer Scraper
Now, we can write the Puppeteer scraper. The following code demonstrates how to launch Puppeteer, open the Forbes page, and scrape key data points.
Example Code:
1 | const puppeteer = require('puppeteer'); |
Storing Data in a JSON File
Once the data is scraped, we need to save it in a structured format like JSON for later use.
Example Code:
1 | async function saveDataToFile(data, filename = 'forbes_billionaires.json') { |
This will store all the scraped articles in a forbes_billionaires.json
file, making the data easy to access and use in the future.
Complete Code Example
Here’s the complete code that combines all the steps:
1 | const puppeteer = require('puppeteer'); |
Example Output:
1 | [ |
In the next section, we’ll discuss how to optimize Forbes scraping using Crawlbase Crawling API.
Optimize Forbes Scraping with Crawlbase Crawling API
Puppeteer is great for scraping dynamic websites but slow when dealing with big data or JavaScript heavy pages like Forbes. To optimize scraping and performance, we can use the Crawlbase Crawling API, which simplifies handling JavaScript-rendered content and gives more control and efficiency.
Introduction to Crawlbase Crawling API
Crawlbase Crawling API bypasses common web scraping challenges like CAPTCHAs, dynamic content loading and complex HTML structures. For scraping Forbes Crawlbase offers a streamlined solution by handling JavaScript rendering directly, making it a more efficient alternative to Puppeteer for big scraping projects.
Why use Crawlbase for Forbes scraping?
- Handles dynamic content: Optimized for JavaScript heavy pages like Forbes.
- Improved speed and scalability: No need for headless browsers, faster scraping.
- Simplifies the process: Easy API calls to scrape data, built in CAPTCHAs and anti-scraping mechanisms.
How to Use Crawlbase with Forbes
To scrape Forbes using Crawlbase, you need to sign up and get your API token. Here’s how to get started:
- Sign up for Crawlbase: Create an account on Crawlbase and get your API token. You need JS Token for Forbes.
- Install Crawlbase Library: In your Node.js environment, install the Crawlbase Crawling API library using:
1 | npm install crawlbase |
- Set up your request: Initialize the Crawlbase API with your token and make GET requests to scrape Forbes data.
Code Example with Crawlbase
Here’s a code example using the Crawlbase JavaScript library to scrape Forbes data more efficiently:
Example Code:
1 | const { CrawlingAPI } = require('crawlbase'); |
Explanation of the Code:
- Initialize Crawlbase:
CrawlingAPI
is initialized with your Crawlbase token to access the API for scraping. - Get request: We use
api.get()
to scrape the Forbes URL. We useajax_wait
andpage_wait
to make sure all dynamic content loads. - HTML Parsing: We use
cheerio
to parse the HTML and extract key data points. - Data Storage: The extracted data is saved to a JSON file.
This way scraping Forbes is more efficient, Crawlbase handles JavaScript rendering and complex content structures.
Optimize Forbes Scraping with Crawlbase
Whether you’re analyzing business trends, financial news or top company rankings, scraping data from Forbes can be very useful. While tools like Puppeteer are great for handling JavaScript rendered pages they are time consuming and resource heavy. Using Crawlbase Crawling API simplifies the process and makes scraping dynamic content faster.
Follow this guide to scrape Forbes data and scale your projects with Crawlbase. This method is a reliable and optimized way to scrape websites like Forbes. If you’re looking to expand your web scraping capabilities, consider exploring our following guides on scraping other important websites.
📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape Clutch.co
If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy scraping!
Frequently Asked Questions
Q. Can I extract data from Forbes?
Yes, it is possible to extract data from Forbes. Scraping any website, including Forbes, should be done in compliance with their terms of service. Always check the website’s robots.txt
file and ensure you are not violating any terms regarding data extraction. Using APIs like Crawlbase helps you scrape efficiently while adhering to best practices.
Q. Why should I use Crawlbase Crawling API instead of Puppeteer for scraping Forbes?
While Puppeteer is a powerful tool for handling JavaScript rendering, it can be slow and resource-intensive. Crawlbase Crawling API simplifies the process by offering pre-configured options for handling dynamic content, which speeds up scraping and reduces the effort needed to manage browser sessions manually.
Q. How can I handle dynamic content on Forbes when scraping?
Forbes uses JavaScript to load much of its content dynamically. Using Puppeteer or Crawlbase Crawling API with options like ajax_wait
and page_wait
, you can ensure the content is fully loaded before scraping. This ensures you capture all relevant data from the page.