Sam Lee
Sam Lee

Reputation: 10463

How to cache Only http status 200 in scrapy?

I am using scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache scrapy requests. I'd like it to only cache if status is 200. Is that the default behavior? Or do I need to specify HTTPCACHE_IGNORE_HTTP_CODES to be everything except 200?

Upvotes: 3

Views: 1030

Answers (2)

bowman han
bowman han

Reputation: 1135

The answer is no, you do not need to do that. You should write a CachePolicy and update settings.py to enable your policy I put the CachePolicy class in the middlewares.py

from scrapy.extensions.httpcache import DummyPolicy

class CachePolicy(DummyPolicy):
   def should_cache_response(self, response, request):
       return response.status == 200

and then update the settings.py, append the following line

HTTPCACHE_POLICY = 'yourproject.middlewares.CachePolicy'

Upvotes: 3

Granitosaurus
Granitosaurus

Reputation: 21436

Yes, by default HttpCacheMiddleware run a DummyPolicy for the requests. It pretty much does nothing special on it's own so you need to set HTTPCACHE_IGNORE_HTTP_CODES to everything except 200.

Here's the source for the DummyPolicy And these are the lines that actually matter:

class DummyPolicy(object):

    def __init__(self, settings):
        self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]

    def should_cache_response(self, response, request):
        return response.status not in self.ignore_http_codes

So in reality you can also extend this and override should_cache_response() to something that would check for 200 explicitly, i.e. return response.status == 200 and then set it as your cache policy via HTTPCACHE_POLICY setting.

Upvotes: 0

Related Questions