Crawling websites is not an easy task, especially when you start doing it in thousands or millions of requests, your server will begin to suffer and will get blocked.
As you probably know, Crawlbase (formerly ProxyCrawl) can help you to avoid this situation, but on this article, we are not going to talk about that, but instead, we are going to check how you can easily scrape and crawl any website.
This is a hands-on tutorial so if you want to follow it, make sure that you have a working account in Crawlbase (formerly ProxyCrawl). It’s free so go ahead and create one here.
The first thing that you will notice when registering in Crawlbase (formerly ProxyCrawl) is that we don’t have any fancy interface where you add the URLs that you want to crawl. No, as we want you to have complete freedom. Therefore we created an API that you can call.
So let’s say we want to crawl and scrape the information of the iPhone X on Amazon.com, at the date of today, that would be the product URL:
How can we do to crawl Amazon securely from crawlbase?
For this tutorial, we will use the following demo token:
caA53amvjJ24 but if you are following the tutorial, make sure to get yours from the my account page.
The Amazon URL has some special characters, so we have to make sure that we encode it properly, for example, if we are using Ruby, we could do the following:
This will return the following:
Great! We have our URL ready to be scraped with Crawlbase (formerly ProxyCrawl).
The next thing that we have to do is to make the actual request.
The Crawlbase (formerly ProxyCrawl) API will help us on that. We just have to do a request to the following URL:
So we just have to replace YOUR_TOKEN with our token (which is caA53amvjJ24 for demo purposes) and THE_URL for the URL we just encoded.
Let’s do it in ruby!
Done. We have made our first request to Amazon via Crawlbase (formerly ProxyCrawl). Secure, anonymous and without getting blocked!
Now we should have the html from amazon back, if should look something like this:
<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->
So now there is only one part missing which is extracting the actual content.
This can be done in a million different ways, and it always depends on the language you are programming. We always suggest using one of the many available libraries that are out there.
Here you have some that can help you do the scraping part with the returned HTML:
We hope you enjoyed this tutorial and we hope to see you soon in Crawlbase (formerly ProxyCrawl). Happy crawling!