NFB
NFB

Reputation: 682

Using Scrapy on a Google cache of a website

Under the heading "Avoiding getting banned", the Scrapy documentation advises:

if possible, use Google cache to fetch pages, instead of hitting the sites directly

It refers to http://www.googleguide.com/cached_pages.html, which was last updated in 2011.

I'm attempting to do that to scrape a website that requires captchas I cannot get around. However, Google then creates the same problem.

I'm causing the spider to stay on the Google cache version of the links using this middleware:

class GoogleCacheMiddleware(object):
    def process_request(self, request, spider):
    if spider.use_google_cache == True and 'googleusercontent' not in request.url:
        new_url = 'https://webcache.googleusercontent.com/search?q=cache:' + request.url
        request = request.replace(url=new_url)
        return request

In the spider itself, I crawl politely with settings such as:

custom_settings = {
    'AUTOTHROTTLE_ENABLE' :True,
    'CONCURRENT_REQUESTS' : 2, # or 1
    'DOWNLOAD_DELAY' : 8 # increased this to as much as 10
}

I've also tried using Selenium on both the original site and the Google cached version of the site. This sometimes succeeds in crawling for a few minutes and returning data, but finally lands at https://support.google.com/websearch/answer/86640, which states that Google detects "Unusual traffic" from your computer network, and requires a captcha to proceed.

It appears the Scrapy documentation is simply in conflict with Google terms of use, am I correct? Either way, is there a recommended way to either get around captchas, or accomplish scraping from a Google cache of a site in spite of this limitation?

UPDATE, 7-9-18:

When this spider runs several times over a week or more, it eventually yields complete or fuller results, evidently because the initially scraped URLs change on each crawl and succeed before the captcha kicks in. Still interested if anyone knows a solution consistent with the documentation or a specific workaround.

Upvotes: 6

Views: 7129

Answers (1)

joker91
joker91

Reputation: 79

I am not well versed with Scrapy but it seems the website must be blocking the cache view. Have you tried checking the cache with https://www.seoweather.com/google-cache-search/

You can get around the Google blocking though if you were to use proxies, preferably back-connect proxies as you'll need a lot when scraping Google.

Another option might be to try and scrape the https://archive.org/web/ version of a page? Actually, they even have an API you might be able to use https://archive.org/help/wayback_api.php

Upvotes: 2

Related Questions