How to cache Only http status 200 in scrapy?

Question

I am using scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache scrapy requests. I'd like it to only cache if status is 200. Is that the default behavior? Or do I need to specify HTTPCACHE_IGNORE_HTTP_CODES to be everything except 200?

Granitosaurus · Accepted Answer

Yes, by default HttpCacheMiddleware run a DummyPolicy for the requests. It pretty much does nothing special on it's own so you need to set HTTPCACHE_IGNORE_HTTP_CODES to everything except 200.

Here's the source for the DummyPolicy And these are the lines that actually matter:

class DummyPolicy(object):

    def __init__(self, settings):
        self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]

    def should_cache_response(self, response, request):
        return response.status not in self.ignore_http_codes

So in reality you can also extend this and override should_cache_response() to something that would check for 200 explicitly, i.e. return response.status == 200 and then set it as your cache policy via HTTPCACHE_POLICY setting.

How to cache Only http status 200 in scrapy?

Answers (2)

Related Questions