Web scraping is no doubt one of the major component technologies that has aided the web to grow so big to what we have today. This is especially true regarding search engines and other big data intensive web apps. Web scrapers have become so many and of course useful today because of the availability of what we know as Open Source Web Scraping Libraries.
Basically, the web and everything related to technology as we know it has been so effected by open source projects that we can’t do without it, that is why even in web scraping, open source web scraping libraries are the way to go if you intend to build your own web scraping tool.
Having known the above, we want to review the top 5 open source web scraping libraries there are today. Of course there are gazillions of open source web scraping libraries as many keep propping up here and there, but in this post we’ll be reviewing what we think are the best ones.
Below are the five best open source web scraping libraries to follow and use.
1. Osmosis
The NodeJS based web scraping open source library by Rchipka on Github, isn’t the only Javascript/NodeJS based open source web scraping library but it’s one of the few that got into our list of five best open source web scraping libraries. That’s because it’s been proven to be one of the best the industry has at the moment. Below are the features of Osmosis NodeJS library;
Features of Osmosis web scraping library:
- HTML parser
- Fast parsing
- Very fast searching
- Small memory footprint
- HTTP request features
- Logs urls, redirects, and errors
- Cookie jar and custom cookies/headers/user agent
- Form submission, session cookies
- Single proxy or multiple proxies and handles proxy failure
- Retries and redirect limits
- HTML DOM features of Osmosis
- Load and search ajax content
- DOM interaction and events
- Execute embedded and remote scripts
- Execute code in the DOM
Some other features of Osmosis are:
- Uses native libxml C bindings.
- Doesn’t have large dependencies like jQuery, cheerio, or jsdom
- Has support for CSS 3.0 and XPath 1.0 selector hybrids
- And a lot more
Complete documentation and examples for Osmosis can be found at Github here.
2. X-ray
X-ray as the developer Matthew Mueller puts it, is the next web scraper that sees through the <html> noise. X-ray is also a Javascript based open source web scraping library with flexibility and other features that made it appealing to the most developers that choose it as their go to choice for their web scraping project. Some of it’s features as an open source web scraping library are:
- Flexible schema: X-ray has a flexible schema with support for
strings, arrays, arrays of objects, and nested object structures. - Composable: The X-ray API is completely composable, allowing you have
a great flexibility in how you scrape each webpage. - Pagination support: Paginate through websites, scraping each page.
X-ray has support for a request delay and a pagination limit. Pages scraped with X-ray can be streamed to a file, this gives you the ability to control errors on
scraped pages. - Predictable flow: Scraping with X-ray starts on one page and move to
the next easily.
Well predictable flow, following a breadth-first crawl through
each of the web pages. - Responsible: X-ray has support for concurrency, throttles, delays,
timeouts and limits this is to make your scraping responsible and well controlled.
Check out X-ray on Github
3. Nokogiri
Nokogiri is the first Ruby based open source web scraping library on our list of five best open source web scraping libraries. Nokogiri according to the developers at Nokogiri.org is a HTML, SAX, XML and Reader parser, that is capable of searching documents through XPath and CSS3 selectors.
Some of the many features of Nokogiri that has made it choice for Ruby developers when it comes to building web scrapers are:
- XML/HTML DOM parser also handles broken HTML
- XML/HTML SAX parser
- XML/HTML Push parser
- XPath 1.0 and CSS3 support for document searching
- XML/HTML builder
- XSLT transformer
Check the Nokogiri website for full tutorial and documentation.
4. Scrapy
Scrapy is the most popular Python based web scraping open source libraries. If you’ve been doing anything web scraping you should have heard about Scrapy at some point. It is the number one Python developers’ choice for web scraping, more reason it’s on our list of five best open source web scraping libraries. The Scrapy project is found at the Scrapy website and GIT too.
With the open source web scraping framework (Scrapy) you’ll sure be able to scrape the data you need from websites in the most fast and simple way using Python.
Scrapy has a huge community around it.
A rundown of the features of Scrapy are:
- Fast and powerful.
- Very big community.
- Ability to add new functions with having to touch the core.
- Portable, Scrapy is written Python but can be carried and run on Linux, Windows, BSD(unix)
- A lot of documentation found online.
With Scrapy, all that you should be concerned with is writing the rules for scraping while Scrapy does the rest of the job for you.
5. Goutte
The first PHP based open source web scraping library on our list of top 5 open source web scraping libraries. While not as popular as the rest afore mentioned open source web scraping library, Goutte is a simple web scraping library built on PHP to make web scraping simpler. Goutte is used for both web crawling and screen scraping.
Features of Goutte
- Extracts data from HTML response.
- Extracts data from XML response.
- Nice API for web crawling.
- Compatible with multiple PHP versions.
For complete tutorial, documentation and technical info check out Goutte fork on GIT.
Those are what we think, the top 5 libraries for scraping in different languages, but for sure there are more.
The good thing is that all of them work with Crawlbase (formerly ProxyCrawl), so regardless of the language or the library you choose, you will be able to use them without problems.