Web crawling refers to how search engines like Google explore the web to index information while scraping involves extracting specific data from websites.
This is a hands-on article, so if you want to follow it, make sure that you have an account in Crawlbase. It’s straightforward to obtain it, and free. So go ahead and create one here.
Upon registering in Crawlbase, you will see that we don’t have any complex interface where you add the URLs that you want to crawl. We created a simple and easy to use API that you can call at any time. Learn more about Crawling API here.
So let’s say we want to crawl and scrape the information of the following page which is created entirely in React js. This will be the URL that we will use for demo purposes:
If you try to load that URL from your console or terminal, you will see that you don’t get all the HTML code from the page. That is because the code is rendered on the client side by React, so with a regular curl command, where there is no browser, that code is not being executed.
You can do the test with the following command in your terminal:
For this tutorial, we will use the following demo token:
5aA5rambtJS2 but if you are following the tutorial, make sure to get yours from the my account page.
First, we need to make sure that we escape the URL so that if there is any special character, it won’t collide with the rest of the API call.
For example, if we are using Ruby, we could do the following:
This will bring back the following:
The Crawlbase API will do that for us. We just have to do a request to the following URL:
So you will need to replace YOUR_TOKEN with your token (remember, for this tutorial, we will use the following:
5aA5rambtJS2) and THE_URL will have to be replaced by the URL we just encoded.
Let’s do it in ruby!
<html lang="en" class="gr__ahfarmer_github_io">
There is now, only one part missing which is extracting the actual content from the HTML.
This can be done in many different ways, and it depends on the language you are using to code your application. We always suggest using one of the many available libraries that are out there.
Here you have some open source libraries that can help you do the scraping part with the returned HTML:
- Copy Human Behavior: Mimic human browsing behavior by incorporating delays between requests and interactions to avoid being flagged as a bot.
Websites implement measures to deter scraping, including CAPTCHAs, IP blocking, or user-agent detection. To bypass these, rotate IP addresses, mimic human behavior, and use proxy servers to prevent getting blocked. Implementing delays and limiting request frequencies also help avoid detection.
The future belongs to AI-powered tools like Crawlbase, enabling more efficient scraping, better handling of dynamic elements, and enhanced compliance with legal norms.
Following best practices remains a prerequisite. Leveraging sophisticated tools such as Crawlbase, staying updated on legal boundaries, and maintaining ethical conduct will ensure successful scraping. Adapting to technological advancements and evolving ethical standards is the fundamental principle here.