scrapinghub: Difference DeltaFetch and HTTPCACHE_ENABLED

Question

I struggle to understand the difference between DeltaFetch and HttpCacheMiddleware. Both have the goal that I only scrape pages I haven't requested before?

Fernando C&#233;sar · Accepted Answer

They have very different purposes:

HttpCacheMiddleware

Everytime a new request is made it will fetch that data and save it locally. Everytime the same request is made again, it's fetched from disk (so local cache).

This is very useful for development, when you probably want to fetch the same page multiple times, until your script works correctly and saves the data you want correctly. With this feature you only fetch the page from the remote/origin server once.

However if the data changes, you will be working with an old copy (which is usually fine for development puprposes).

HttpCacheMiddleware docs

DeltaFetch

Deltafetch keeps a fingerprint of all requests that have already been fetched and turned into an Item (or dict). If the spider outputs a request seen before, it will be ignored.

This is useful in production, when a site has multiples links to the same content, thus avoiding requesting duplicated items.

DeltaFetch assumes there's a 1-to-1 relation between requests/links and items. So if you're crawling multiple items from the same request this can be problematic, as all requests will be ignored after the first item from that request is fetched (this is a somewhat convoluted corner case).

DeltaFetch docs

DUPEFILTER_CLASS (not mentioned but similar)

By default, scrapy will not fetch duplicated requests. You can customize what "duplicate requests" mean. For instance, maybe the query part of an url should be ignored when comparing requests.

DUPEFILTER_CLASS docs

scrapinghub: Difference DeltaFetch and HTTPCACHE_ENABLED

Answers (2)

Related Questions