You are running a web scraper and it was working for some days and suddenly it stopped working, you looked deeper and you have found that the website you are scraping is blocking your proxy. You got frustrated and you did not know what to do.. It is all alright, we all have been there. In this blog post we are going to shed some light on things you can do to unblock your proxy and continue scraping without being detected.
There are many ways websites, like Amazon, Google, Linkedin, Zillow, name a few detect a web scraper.
Here are a few common mistakes you have been doing which you have to stop doing.
Your requests are generating same footprints, you are sending the same request headers, the same user agent, the same proxy over and over again. Common websites detect web scrapers by pattern usage, for instance, Google runs a bot test check with CAPTCHA to find out if the person behind the given IP is not a bot, normally bots do not solve captchas, not the case of Crawlbase (formerly ProxyCrawl) bots where captcha resolving becomes of no issue.
Your requests are doing lots of redirects, or even the web pages that your bot is crawling have mistakes and are running at 404 pages that never existed. If you are doing requests to sites that are running on HTTPS, try to avoid sending requests to HTTP URLs as those are likely to redirect, sending them over and over again, brings the sites to notice that you are running a bot, that obvious your bot does not want to go. Your key is that your bot stays anonymous, otherwise you have to keep replacing your proxy lists and remove what stops working which becomes a frustrating issue.
If you do not want to maintain proxies, you can use the Crawlbase (formerly ProxyCrawl) smart backconnect proxy which is designed to rotate on every request, or even if you do not want to deal with what proxies do and do not want your requests at all to be going with your scraper footprints, the Crawling API of Crawlbase (formerly ProxyCrawl) is your held, it masks all the traffic and you do not notice how the engineers behind the scenes make your requests pass any website without being noticed.
This will never work, do not do it.
curl -x proxy.crawlbase.com:9000 "https://www.similarweb.com/website/crawlbase.com"
The above curl command, sends a request to SimilarWeb to fetch the data for crawlbase.com website, it uses a proxy, we are adding
-x proxy.crawlbase.com:9000 rotating backconnect at port 9000 just as an example, you can use any proxy you have or any proxy service. The above request will not bring value, even if you send the proper request headers, cookies or even residential proxies, try it with your home residential IP, remove the proxy option and see it for yourself.
Your request is very likely to hit this ugly page which you never want to get.
How to make it work?
You will need to connect the proxy to a headless browser like Chrome headless, or firefox headless or name a few that can leverage a real browser on a headless environment. Such websites like SimilarWeb deny HTTP GET requests that are not matching a footprint of a real browser. In that case no proxy is going to help you.
What you have to do in the case of the above for instance using the Crawling API of Crawlbase (formerly ProxyCrawl) and change your CURL request to something like this.
Because the Crawling API is doing all the magic behind the scenes, running real browsers for you, retrying errors if they happen and giving you the data without you even have to worry about what complexities the Crawlbase (formerly ProxyCrawl) bot is doing in the background.
The result now is different and we get the data that we need without risking our proxies to get blocked.