The internet is full of stories, many times confusing us and here is a new one, as there is not really whitehat and blackhat bots. Bots are automated computers that do something for us.
But what is really a good crawling bot? Does this even exist? Let’s take a deeper look.
Crawling bots are software created to crawl and scrape online websites in exchange of content. For example, Google is a crawling/scraping bot, and is considered a whitelist bot. Why? Good question, we might say that it’s whitelisted because it has been online for so many years and it’s providing some “good” content so people is happy with it.
But does it mean that your bot is a blackhat bot? Not really, that’s why we meant at the beginning that there is no really whitehat and blackhat bots.
Let’s take a look at what the internet society considers as “whitehat crawlers”.
In the world of whitehat scrapers, we can distinguish 3 main categories:
- Search engine crawlers
- Commercial bots
- Feed crawlers
Search engine crawlers or scrapers are mainly bots which main purpose is to aggregate content in order to show it to their clients based on a search. The most common of all them being Google:
- Googlebot: The bot that provides content to Google, the most used search engine of the internet.
- Yahoo bot: A crawling bot that provides content 24/7 to Yahoo search engine.
- Baiduspider: The leading search engine from China Baidu has also it’s own bot crawling the internet for fresh content to aggregate.
- Haosou 360 spider: The second most user search engine in China has as well it’s own bot.
- Yandex bot: Another whitelabel bot created to scrape content for Yandex search engine in Russia.
- MSN/Bing bot: The bot for Bing search engine from Microsoft.
- Google AdsBot: Google makes money with ads, therefore it has a bot which crawls the landing pages of clients for quality content to be able to display proper ads.
We can categorize commercial bots and crawlers as software from companies to collect data not related to search engines, usually to provide a service which relies on that data. A clear example is Pinterest which crawls the internet in search of data to provide for it’s service.
- Pinterest bot: As preciously mentioned, scrapes the internet in search of content to feed their database for their users to share as photos and collections.
- SEMrush bot: SEMrush company runs a bot to get data for their SEO tools, keyword research tools and graphs.
- Ahrefs bot: A software scraper run by Ahrefs, a marketing and SEO tool used as a backlink checker by millions of users.
- Alexa bot: Alexa provides data and rankings for the internet and to get that data they usually extract it from the internet with their whitehat bot.
These are often confused with commercial bots, as they are also commercial but the main difference is that the gather the content and they offer it to you later instead of building a service on that data. A clear example would be Twitter which will visit your site with the Twitter bot to get your site data and present it to the user as it is, without modification.
- Twitter bot: The famous Twitter bot from Twitter which will visit your sites after someone shares a link to it, and grab the enough information to display a small preview.
- Telegram bot: Whenever you share a link in Telegram, the famous chat application, the Telegram bot will visit your site to get some metadata to present in the chats.
- Facebook Mobile app: A small bot created to fetch sites shared in the Facebook mobile app.
- FeedBurner: If you used Feedburner in the past or any other RSS feed, you know that a feed fetcher needs to fetch the actual content of the website it’s trying to present. So this is the bot of Feedburner.
- Android Framework bot: The Android Runtime environment retrieves content for mobile apps, and this is the bot that is responsible of doing that.
So after categorizing the bots and scraping scripts, we again repeat that there is no real whitehat or blackhat, but more what is know for people and other websites.
So we can say that all the scraping and bots are whitehat, it’s just a matter of the sites allowing you to scrape their content or blocking you to do so, but don’t forget that if that is the case, Crawlbase (formerly ProxyCrawl) is here to help.