Web crawling, alternatively referenced as web spidering or screen scraping, software developers define it as “writing software to iterate on a set of web pages to extract content”, is a great tool for extracting data from the web for various reasons.
Using a web crawler, you can scrape data from a set of articles, mine a large blog post or scrape quantitative data from Amazon for price monitoring and machine learning, overcome the inability to get content from sites that have no official API, or simply to build your own prototype for the next better web.
In this tutorial, We will teach you the basics of crawling and scraping using Crawlbase (formerly ProxyCrawl) and Scrapy. As an example, we will use Amazon search result pages to extract product ASIN URLs and titles. When this tutorial is completed, you’ll hopefully have a fully functional web scraper that runs through a series of pages on Amazon and extracts data from each page and prints it to your screen.
The scraper example can be easily extended and be used as a solid layer for your personal projects on crawling and scraping data from the web.
Prerequisites
To complete this tutorial successfully, you’ll need a Crawlbase (formerly ProxyCrawl) API free token for scraping web pages anonymously, and Python 3 installed in your local machine for development.
Step 1 — Creating the Amazon basic Scraper
Scrapy is a Python scraping library; it includes most of the common tools that will help us when scraping. It speeds up the scraping process and it is maintained by an open source community that loves scraping and crawling the web.
Crawlbase (formerly ProxyCrawl) has a pyhton scraping library; combined with scrapy, we gurantee that our crawler runs anonymously on big scale without being blocked by sites. Crawlbase (formerly ProxyCrawl) API is a powerful thin layer that acts on top of any site as a thin middleware.
Crawlbase (formerly ProxyCrawl) & Scrapy have python packages on PyPI (known as pip). PyPI, the Python Package manager, is maintained by the python community as a repository for various libraries that developers need.
Install Crawlbase (formerly ProxyCrawl) and Scrapy with the following commands:
1 | pip install proxycrawl |
1 | pip install scrapy |
Create a new folder for the scraper:
1 | mkdir amazon-scraper |
Navigate to the scraper directory that you created above:
1 | cd amazon-scraper |
Create a Python file for your scraper. All the code of this tutorial will be placed in this file. We use the Touch command in the console for this tutorial, you can use any other editor that you prefer.
1 | touch myspider.py |
Let us create our first basic scrapy spider AmazonSpider
that inherits from scrapy.Spider
. As per scrapy documentation, subclasses has 2 required attributes. The name
which is a name for our spider and a list of URLs start_urls
, we will use one URL for this example. We also import the Crawlbase (formerly ProxyCrawl) API so that we can build the URLs that will go through the Crawlbase (formerly ProxyCrawl) API instead of going directly to the original sites to avoid blocks and captcha pages.
Paste the follow code in myspider.py
1 | import scrapy |
Run the scraper which does not extract data yet but you should have it pointed to Crawlbase (formerly ProxyCrawl) API Endpoint and getting Crawled 200
from Scrapy.
1 | scrapy runspider myspider.py |
The result should be something like this, notice that the request to Amazon result page over Crawlbase (formerly ProxyCrawl) passed with 200 response code.
1 | 2018-07-09 02:05:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot) |
Since we did not write the parser yet, the code simply loaded the start_urls
which is just one URL to Amazon search results over Crawlbase (formerly ProxyCrawl) API and returned the result to the default parser which does not do anything by default. It is time now to do our next step and write our simple parser to extract the data we need.
Step 2 - Scraping Amazon ASIN URLs and Titles
Let us enhance the myspider
class with a simple parser that extracts the URLs and Titles of all ASIN products on the search result page. For that we need to know which css selectors we need to use to instruct scrapy to fetch the data from those tags. At the time of writing this tutorial. the ASIN URLs are found in the .a-link-normal
css selector
Enhancing our spider class with our simple parser will give us the following code:
1 | import scrapy |
Running scrapy again should print us some nice URLs of ASIN pages and their titles. 😄
1 | ... |
Result & Summaries
In this tutorial, we learnt the fundamentals of web crawling and scraping and we used Crawlbase (formerly ProxyCrawl) API combined with scrapy to keep our scraper undetected by sites that might block our requests. We also learnt the basics of how to extract Amazon product pages from Amazon search result pages. You can change the example to any site that you require and start crawling thousands of pages anonymously with Scrapy & Crawlbase (formerly ProxyCrawl).
We hope you enjoyed this tutorial and we hope to see you soon in Crawlbase (formerly ProxyCrawl). Happy Crawling and Scraping!