Reputation: 455
I am using latest scrapy version, v1.3
I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.
I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
page = page - 1
yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)
I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
with open('debug.txt', 'wb') as outfile:
outfile.write(response.body)
html = Selector(text=response.body)
url = response.url
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
log.msg("Automated process error: %s" %url, level=log.INFO)
reason = 'Automated process error %d' %response.status
return self._retry(request, reason, spider) or response
return response
Any suggestion is appreciated.
Thanks
Mehmet
Upvotes: 3
Views: 3657
Reputation: 632
For Scrapy v2.8!
Change:
def is_cached_response_fresh(self, response, request):
To:
def is_cached_response_fresh(self, cachedresponse, request):
Another solution would be, to make a copy of 'RFC2616Policy' and add same logic from above code. i.e.,
def is_cached_response_fresh(self, cachedresponse, request):
if "refresh_cache" in request.meta:
# continue RFCRFC2616Policy code
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
# continue RFCRFC2616Policy code
return False
return True
if "refresh_cache" in request.meta:
Checking 'refresh_cache' in meta before implementing RFCRFC2616 Policy should keep cache of specific links updated all the time while keeping old cache of remaining sections of the website.
Upvotes: 1
Reputation: 455
Thanks to mizhgun, I managed to develop a solution using custom policies.
Here is what I did,
from scrapy.utils.httpobj import urlparse_cached
class CustomPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "refresh_cache" in request.meta:
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
return False
return True
And when I catch the error, (after caching occurred of course)
def parse(self, response):
html = Selector(response)
counttext = html.css('selector').extract_first()
if counttext is None:
yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)
When you add refresh_cache into meta, that can be catched in custom policy class.
Don't forget to add dont_filter otherwise second request will be filtered as duplicate.
Upvotes: 6
Reputation: 1887
Middleware responsible for requests/response caching is HttpCacheMiddleware
. Under the hood it is driven by the cache policies - special classes which dispatch what requests and responses should or shouldn't be cached. You can implement your own cache policy class and use it with the setting
HTTPCACHE_POLICY = 'my.custom.cache.Class'
More information in docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
Source code of built-in policies: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18
Upvotes: 6