Web crawling, alternatively referenced as web spidering or screen scraping, software developers define it as “writing software to iterate on a set of web pages to extract content,” is a great tool for extracting data from the web for various reasons.
Using a web crawler, you can crawl a web page to scrape data from a set of articles, mine a large blog post or scrape quantitative data from Amazon for price monitoring and machine learning, overcome the inability to get content from sites that have no official API, or simply to build your own prototype for the next better web.
In this tutorial, We will teach you the basics of crawling and scraping using Crawlbase and Scrapy. As an example, we will use Amazon search result pages to extract product ASIN URLs and titles. When this tutorial is completed, you’ll hopefully have a fully functional web scraper that runs through a series of pages on Amazon, extracts data from each page, and prints it to your screen.
The scraper example can be easily extended and used as a solid layer for your personal projects on crawling and scraping data from the web.
- Get a know-how about Scrapy framework, its features, architecture and operations.
- Learn to create your own Amazon scraper in Python Scrapy using Crawlbase.
- Learn the basics of how to extract Amazon product pages from Amazon search result pages.
To complete this tutorial successfully, you’ll need a Crawlbase API free token for scraping web pages anonymously, and Python 3 installed in your local machine for development.
Scrapy is a Python scraping library; it includes most of the common tools that will help us when scraping. It speeds up the scraping process and it is maintained by an open source community that loves scraping and crawling the web.
Crawlbase has a Python scraping library; combined with scrapy, we guarantee that our crawler runs anonymously on big scale without being blocked by sites. Crawlbase API is a powerful thin layer that acts on top of any site as a thin middleware.
Crawlbase & Scrapy have Python packages on PyPI (known as pip). PyPI, the Python Package manager, is maintained by the Python community as a repository for various libraries that developers need.
Install Crawlbase and Scrapy with the following commands:
pip install crawlbase
pip install scrapy
Create a new folder for the scraper:
Navigate to the scraper directory that you created above:
Create a Python file for your scraper. All the code of this tutorial will be placed in this file. We use the Touch command in the console for this tutorial, you can use any other editor that you prefer.
Let us create our first basic scrapy spider
AmazonSpider that inherits from
scrapy.Spider. As per scrapy documentation, subclasses has 2 required attributes. The
name which is a name for our spider and a list of URLs
start_urls, we will use one URL for this example. We also import the Crawlbase API so that we can build the URLs that will go through the Crawlbase API instead of going directly to the original sites to avoid blocks and captcha pages.
Paste the follow code in
Run the scraper which does not extract data yet but you should have it pointed to Crawlbase API Endpoint and getting
Crawled 200 from Scrapy.
scrapy runspider myspider.py
The result should be something like this, notice that the request to Amazon result page over Crawlbase passed with 200 response code.
2018-07-09 02:05:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
Since we did not write the parser yet, the code simply loaded the
start_urls which is just one URL to Amazon search results over Crawlbase API and returned the result to the default parser which does not do anything by default. It is time now to do our next step and write our simple parser to extract the data we need.
Let us enhance the
myspider class with a simple parser that extracts the URLs and Titles of all ASIN products on the search result page. For that we need to know which css selectors we need to use to instruct scrapy to fetch the data from those tags. At the time of writing this tutorial. the ASIN URLs are found in the
.a-link-normal css selector
Enhancing our spider class with our simple parser will give us the following code:
Running scrapy again should print us some nice URLs of ASIN pages and their titles. 😄
Scrapy web scraper is taken as a robust and open-source web crawling framework carefully created in Python. Loved by web scrapers and developers alike, Scrapy is the one guiding them in data extraction and web scraping from websites with unparalleled efficiency.
When a project is started, several files come into play to engage with Scrapy’s principal components. Observing the Scrapy framework shows that its central element, the engine, takes care of the functioning of four key components:
- Item Pipelines
- The Downloader
- The Scheduler
Let’s divide their working into simple steps:
- In the initial phase of Scrapy web scraping, communication is facilitated through Spiders. These Spiders act as classes that define various scraping methods. Users invoke these methods, allowing Scrapy web scraper to transmit requests to the engine, including the URLs to be scraped and the desired information for extraction. The Scrapy engine plays a crucial role in controlling the entire data flow and triggering essential events throughout the scraping process. It serves as the manager of the entire operation.
- Once a request is received by the engine, it is directed to the Scheduler, which manages the order of tasks to be executed. If multiple URLs are provided, the Scheduler enqueues them for processing.
- The engine also receives requests from the Scheduler, which has pre-arranged the tasks. These requests are then sent to the Downloader module. The Downloader’s job is to fetch the HTML code of the specified web page and convert it into a Response object.
- Subsequently, the Response object is handed over to the Spider, where specific scraping methods defined in the Spider class are invoked. Afterward, the processed data moves on to the ItemPipeline module, where successive transformation steps occur. These steps may include cleaning, data validation, or inserting into a database.
To summarize, the key components in the Scrapy framework include Spiders (defining scraping methods), the Scrapy Engine (controlling data flow), Scheduler (managing task order), Downloader (fetching web page content), and ItemPipeline (applying transformation steps to extracted data). Each component plays a vital role in ensuring an efficient and organized Scrapy web scraping process.
When you have to crawl a web page, Scrapy web scraper emerges as the unrivaled choice for several compelling reasons. Let’s discuss the features of Scrapy and understand why it should be your go-to web scraping tool.
What sets Scrapy web scraper apart is its innate versatility and flexibility. The framework gracefully overcomes the obstacles of web pages and resolves the complexities of data extraction. It acts as a reliable companion for those who eat and breathe web scraping, making the process both accessible and efficient.
Scrapy framework excels in its ability to crawl a web page with utmost precision. Scrapy provides a straightforward and user-friendly approach to crawl a web page, allowing developers to focus on the extraction logic rather than difficult technical details.
Whether you’re scraping through the complicated structure of a website or extracting data, Scrapy provides a smooth experience. It simplifies the complex task of traversing web pages, making it an ideal companion for both beginners and expert developers.
In the society of programming languages, Python is known for its simplicity and readability. Scrapy, being a Python-based framework, inherits these qualities. This makes it an excellent choice for those starting web scraping. The familiarity of Python code in Scrapy ensures a smoother learning curve and a more expressive environment for crafting scraping logic.
Scrapy provides such amazing functionalities that simplify the way you crawl a web page. From initiating requests to parsing and storing data, Scrapy streamlines each step with finesse. It offers a comprehensive framework for building scalable and efficient spiders customized for diverse scraping projects.
Maintaining anonymity in web scraping is crucial, and Scrapy excels in this aspect. With built-in proxy support, Scrapy ensures that your crawler remains incognito. This feature is invaluable when dealing with IP bans or CAPTCHAs, providing a robust solution to overcome potential obstacles during the scraping process. This Scrapy proxy support adds an extra layer of resilience to your scraping process.
Scrapy web scraper is synonymous with powerful web scraping capabilities. It streamlines the entire process, from sending requests to parsing and storing data. Its efficiency and reliability make it an indispensable tool for extracting data from diverse websites, offering a consistent and robust performance. It does not matter if you are a beginner developer or you have a 10-year experience in the field, Scrapy proxy empowers you with the tools needed to navigate the huge bulk of web data with great ease. Its component-based architecture encourages customization, allowing developers to create their scraping workflows to meet specific project requirements.
In this tutorial, we learned the fundamentals of web crawling and scraping, and we used Crawlbase API combined with Scrapy to keep our scraper undetected by sites that might block our requests. We also learned the basics of how to extract Amazon product pages from Amazon search result pages. You can change the example to any site that you require and start crawling thousands of pages anonymously with Scrapy & Crawlbase .
We hope you enjoyed this tutorial and we hope to see you soon in Crawlbase. Happy Crawling and Scraping!