Using Scrapy on a Google cache of a website

Question

Under the heading "Avoiding getting banned", the Scrapy documentation advises:

if possible, use Google cache to fetch pages, instead of hitting the sites directly

It refers to http://www.googleguide.com/cached_pages.html, which was last updated in 2011.

I'm attempting to do that to scrape a website that requires captchas I cannot get around. However, Google then creates the same problem.

I'm causing the spider to stay on the Google cache version of the links using this middleware:

class GoogleCacheMiddleware(object):
    def process_request(self, request, spider):
    if spider.use_google_cache == True and 'googleusercontent' not in request.url:
        new_url = 'https://webcache.googleusercontent.com/search?q=cache:' + request.url
        request = request.replace(url=new_url)
        return request

In the spider itself, I crawl politely with settings such as:

custom_settings = {
    'AUTOTHROTTLE_ENABLE' :True,
    'CONCURRENT_REQUESTS' : 2, # or 1
    'DOWNLOAD_DELAY' : 8 # increased this to as much as 10
}

I've also tried using Selenium on both the original site and the Google cached version of the site. This sometimes succeeds in crawling for a few minutes and returning data, but finally lands at https://support.google.com/websearch/answer/86640, which states that Google detects "Unusual traffic" from your computer network, and requires a captcha to proceed.

It appears the Scrapy documentation is simply in conflict with Google terms of use, am I correct? Either way, is there a recommended way to either get around captchas, or accomplish scraping from a Google cache of a site in spite of this limitation?

UPDATE, 7-9-18:

When this spider runs several times over a week or more, it eventually yields complete or fuller results, evidently because the initially scraped URLs change on each crawl and succeed before the captcha kicks in. Still interested if anyone knows a solution consistent with the documentation or a specific workaround.

Using Scrapy on a Google cache of a website

Answers (1)

Related Questions