Web crawling, alternatively referenced as web spidering or screen scraping, software developers define it as “writing software to iterate on a set of web pages to extract content,” is a great tool for extracting data from the web for various reasons.

Using a web page crawler, you can crawl and scrape data from a set of articles, mine a large blog post or scrape quantitative data from Amazon for price monitoring and machine learning, overcome the inability to get content from sites that have no official API, or simply to build your own prototype for the next better web.

In this tutorial, we will teach you the basics of using Crawlbase and Scrapy to crawl and scrape web pages. As an example, we will use Amazon search result pages to extract product ASIN URLs and titles. When this tutorial is completed, you’ll hopefully have a fully functional web scraper that runs through a series of pages on Amazon, extracts data from each page, and prints it to your screen.

The scraper example can be easily extended and used as a solid layer for your personal projects on crawling and scraping data from the web.

Blog Objectives:

  • Get a know-how about Scrapy framework, its features, architecture and operations.
  • Learn to create your own Amazon scraper in Python Scrapy using Crawlbase.
  • Learn the basics of how to extract Amazon product pages from Amazon search result pages.

Prerequisites

To complete this tutorial successfully, you’ll need a Crawlbase API free token for scraping web pages anonymously, and Python 3 installed in your local machine for development.

Step 1 — Creating the Amazon Basic Scraper

Scrapy is a Python scraping library; it includes most of the common tools that will help us when scraping. It speeds up the scraping process and it is maintained by an open source community that loves scraping and crawling the web.

Crawlbase has a Python scraping library; combined with scrapy, we guarantee that our crawler runs anonymously on big scale without being blocked by sites. Crawlbase API is a powerful thin layer that acts on top of any site as a thin middleware.

Crawlbase & Scrapy have Python packages on PyPI (known as pip). PyPI, the Python Package manager, is maintained by the Python community as a repository for various libraries that developers need.

Install Crawlbase and Scrapy with the following commands:

1
pip install crawlbase
1
pip install scrapy

Create a new folder for the scraper:

1
mkdir amazon-scraper

Navigate to the scraper directory that you created above:

1
cd amazon-scraper

Create a Python file for your scraper. All the code of this tutorial will be placed in this file. We use the Touch command in the console for this tutorial, you can use any other editor that you prefer.

1
touch myspider.py

Let us create our first basic scrapy spider AmazonSpider that inherits from scrapy.Spider. As per scrapy documentation, subclasses has 2 required attributes. The name which is a name for our spider and a list of URLs start_urls, we will use one URL for this example. We also import the Crawlbase API. It lets us build URLs that will go through the Crawlbase API instead of to the original sites. This avoids blocks and captcha pages.

Paste the follow code in myspider.py

1
2
3
4
5
6
7
8
9
10
11
import scrapy

from crawlbase.crawlbase_api import CrawlingAPI

# Get the API token from Crawlbase and replace it with YOUR_TOKEN
api = CrawlingAPI({ 'token': 'YOUR_TOKEN' })

class AmazonSpider(scrapy.Spider):
name = 'amazonspider' # Amazon search result URLs to extract ASIN pages and titles
urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=cold-brew+coffee+makers&page=1'] # make the URLs go through Crawlbase API
start_urls = list(map(lambda url: api.buildURL({ 'url': url }), urls))

Run the scraper which does not extract data yet but you should have it pointed to Crawlbase API Endpoint and getting Crawled 200 from Scrapy.

1
scrapy runspider myspider.py

The result should be something like this, notice that the request to Amazon result page over Crawlbase passed with 200 response code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2018-07-09 02:05:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-09 02:05:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.15 (default, May 1 2018, 16:44:37) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-16.7.0-x86_64-i386-64bit
2018-07-09 02:05:23 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-07-09 02:05:23 [scrapy.middleware] INFO: Enabled extensions:
...
2018-07-09 02:05:23 [scrapy.core.engine] INFO: Spider opened
2018-07-09 02:05:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-09 02:05:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-09 02:05:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.crawlbase.com/?token=TOKEN_HIDDEN&url=https%3A%2F%2Fwww.amazon.com%2Fs%2Fref%3Dnb_sb_noss%3Furl%3Dsearch-alias%253Daps%26field-keywords%3Dcold-brew%2Bcoffee%2Bmakers%26page%3D1&> (referer: None)
NotImplementedError: AmazonSpider.parse callback is not defined
2018-07-09 02:05:25 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-09 02:05:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 390,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 146800,
'downloader/response_count': 1,
...
'start_time': datetime.datetime(2018, 7, 8, 23, 5, 23, 681766)}
2018-07-09 02:05:25 [scrapy.core.engine] INFO: Spider closed (finished)

Since we did not write the parser yet, the code simply loaded the start_urls which is just one URL to Amazon search results over Crawlbase API and returned the result to the default parser which does not do anything by default. It is time now to do our next step and write our simple parser to extract the data we need.

Step 2 - Scraping Amazon ASIN URLs and Titles

Let us enhance the myspider class with a simple parser that extracts the URLs and Titles of all ASIN products on the search result page. For that we need to know which css selectors we need to use to instruct Scrapy to fetch the data from those tags. At the time of writing this tutorial. the ASIN URLs are found in the .a-link-normal css selector

Enhancing our spider class with our simple parser will give us the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import scrapy

from crawlbase.crawlbase_api import CrawlingAPI

# Get the API token from Crawlbase and replace it with YOUR_TOKEN
api = CrawlingAPI({ 'token': 'YOUR_TOKEN' })

class AmazonSpider(scrapy.Spider):
name = 'amazonspider' # Amazon search result URLs
urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=cold-brew+coffee+makers&page=1'] # make the URLS to go through Crawlbase API
start_urls = list(map(lambda url: api.buildURL({ 'url': url }), urls))

def parse(self, response):
for link in response.css('li.s-result-item'):
yield {
'url': link.css('a.a-link-normal.a-text-normal ::attr(href)').extract_first(),
'title': link.css('a.a-link-normal ::text').extract_first()
}

Running scrapy again should print us some nice URLs of ASIN pages and their titles. 😄

1
2
3
4
5
6
...
2018-07-09 04:01:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.crawlbase.com/?token=TOKEN_HIDDEN&url=https%3A%2F%2Fwww.amazon.com%2Fs%2Fref%3Dnb_sb_noss%3Furl%3Dsearch-alias%253Daps%26field-keywords%3Dcold-brew%2Bcoffee%2Bmakers%26page%3D1&>
{'url': 'https://www.amazon.com/Airtight-Coffee-Maker-Infuser-Spout/dp/B01CTIYU60/ref=sr_1_5/135-1769709-1970912?ie=UTF8&qid=1531098077&sr=8-5&keywords=cold-brew+coffee+makers', 'title': 'Airtight Cold Brew Iced Coffee Maker and Tea Infuser with Spout - 1.0L/34oz Ovalware RJ3 Brewing Glass Carafe with Removable Stainless Steel Filter'}
2018-07-09 04:01:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.crawlbase.com/?token=TOKEN_HIDDEN&url=https%3A%2F%2Fwww.amazon.com%2Fs%2Fref%3Dnb_sb_noss%3Furl%3Dsearch-alias%253Daps%26field-keywords%3Dcold-brew%2Bcoffee%2Bmakers%26page%3D1&>
{'url': 'https://www.amazon.com/KitchenAid-KCM4212SX-Coffee-Brushed-Stainless/dp/B06XNVZDC7/ref=sr_1_6/135-1769709-1970912?ie=UTF8&qid=1531098077&sr=8-6&keywords=cold-brew+coffee+makers', 'title': 'KitchenAid KCM4212SX Cold Brew Coffee Maker, Brushed Stainless Steel'}
...

The Architecture of Python Scrapy Web Scraper Framework

Scrapy web scraper is taken as a robust and open-source web crawling framework carefully created in Python.
Web scrapers and developers alike love Scrapy. It guides them in data extraction and scraping from websites with unmatched efficiency.

When a project is started, several files come into play to engage with Scrapy’s principal components. Observing the Scrapy framework shows that its core is the engine. The engine runs four key parts:

  • Spiders
  • Item Pipelines
  • The Downloader
  • The Scheduler

Let’s divide their working into simple steps:

  1. In the initial phase of Scrapy web scraping, communication is facilitated through Spiders. These Spiders act as classes that define various scraping methods. Users invoke these methods. They allow the Scrapy web scraper to send requests to the engine. The requests include the URLs to be scraped and the desired information for extraction. The Scrapy engine is crucial. It controls the data flow and triggers key events in scraping. It serves as the manager of the entire operation.
  2. Once a request is received by the engine, it is directed to the Scheduler, which manages the order of tasks to be executed. If multiple URLs are provided, the Scheduler enqueues them for processing.
  3. The engine also receives requests from the Scheduler, which has pre-arranged the tasks. These requests are then sent to the Downloader module. The Downloader’s job is to fetch the HTML code of the specified web page and convert it into a Response object.
  4. After that, the Spider gets the Response object. It uses it to run specific scraping methods defined in the Spider class. Afterward, the processed data moves to the ItemPipeline module. There, it undergoes many transformation steps. These steps may include cleaning, data validation, or inserting into a database.

To summarize, the Scrapy framework has key parts. It includes Spiders, which define scraping methods. It also has the Scrapy Engine, which controls data flow. There’s the Scheduler, which manages task order. And the Downloader, which fetches web page content. Lastly, there’s the ItemPipeline, which applies steps to extracted data. Each component plays a vital role in ensuring an efficient and organized Scrapy web scraping process.

Why Scrapy Should Be Your First Choice?

When you have to crawl a web page, Scrapy web scraper emerges as the unrivaled choice for several compelling reasons. Let’s discuss the features of Scrapy and understand why it should be your go-to web scraping tool.

Versatility

What sets Scrapy Web Scraper apart is its innate versatility and flexibility. The framework gracefully overcomes the obstacles of web pages. It also solves the complexities of data extraction. It acts as a reliable companion for those who eat and breathe web scraping, making the process both accessible and efficient.

Easy Web Page Crawling

Scrapy framework excels in its ability to crawl a web page with utmost precision. Scrapy provides a simple way to crawl a web page. It lets developers focus on extraction, not hard tech.

Whether you’re scraping through the complicated structure of a website or extracting data, Scrapy provides a smooth experience. It simplifies the complex task of traversing web pages, making it an ideal companion for both beginners and expert developers.

Python Simplification

In the society of programming languages, Python is known for its simplicity and readability. Scrapy, being a Python-based framework, inherits these qualities. This makes it an excellent choice for those starting web scraping. Python code in Scrapy is familiar. This makes learning it easier and the environment better for crafting scraping logic.

Scrapy provides such amazing functionalities that simplify the way you crawl a web page. From initiating requests to parsing and storing data, Scrapy streamlines each step with finesse. It provides a full framework. It is for building scalable and efficient spiders. The spiders are customized for diverse scraping projects.

Anonymity with Scrapy Proxy Support

Maintaining anonymity in web scraping is crucial, and Scrapy excels in this aspect. With built-in proxy support, Scrapy ensures that your crawler remains incognito. This feature is invaluable when dealing with IP bans or CAPTCHAs, providing a robust solution to overcome potential obstacles during the scraping process. This Scrapy proxy support adds an extra layer of resilience to your scraping process.

Powerful Web Scraping Capabilities

Scrapy web scraper is synonymous with powerful web scraping capabilities. It streamlines the entire process, from sending requests to parsing and storing data. Its efficiency and reliability make it indispensable. It extracts data from diverse websites and performs well. It does not matter if you are a beginner developer or you have a 10-year experience in the field, Scrapy proxy empowers you with the tools needed to navigate the huge bulk of web data with great ease. Its architecture is component-based. It encourages customization, letting developers create scraping workflows to meet project requirements.

Experience Next-level Scraping with Crawlbase

In this tutorial, we learned the basics of web crawling and scraping. We used Crawlbase API with Scrapy to keep our scraper hidden from sites that might block our requests. We also learned the basics of how to extract Amazon product pages from Amazon search result pages. You can change the example to any site that you require and start crawling thousands of pages anonymously with Scrapy & Crawlbase.

We hope you enjoyed this tutorial and we hope to see you soon in Crawlbase. Happy Crawling and Scraping!