Joey Coder
Joey Coder

Reputation: 3519

scrapinghub: Difference DeltaFetch and HTTPCACHE_ENABLED

I struggle to understand the difference between DeltaFetch and HttpCacheMiddleware. Both have the goal that I only scrape pages I haven't requested before?

Upvotes: 0

Views: 233

Answers (2)

Fernando César
Fernando César

Reputation: 861

They have very different purposes:

  • HttpCacheMiddleware

Everytime a new request is made it will fetch that data and save it locally. Everytime the same request is made again, it's fetched from disk (so local cache).

This is very useful for development, when you probably want to fetch the same page multiple times, until your script works correctly and saves the data you want correctly. With this feature you only fetch the page from the remote/origin server once.

However if the data changes, you will be working with an old copy (which is usually fine for development puprposes).

HttpCacheMiddleware docs

  • DeltaFetch

Deltafetch keeps a fingerprint of all requests that have already been fetched and turned into an Item (or dict). If the spider outputs a request seen before, it will be ignored.

This is useful in production, when a site has multiples links to the same content, thus avoiding requesting duplicated items.

DeltaFetch assumes there's a 1-to-1 relation between requests/links and items. So if you're crawling multiple items from the same request this can be problematic, as all requests will be ignored after the first item from that request is fetched (this is a somewhat convoluted corner case).

DeltaFetch docs

  • DUPEFILTER_CLASS (not mentioned but similar)

By default, scrapy will not fetch duplicated requests. You can customize what "duplicate requests" mean. For instance, maybe the query part of an url should be ignored when comparing requests.

DUPEFILTER_CLASS docs

Upvotes: 1

Gallaecio
Gallaecio

Reputation: 3857

The HTTP cache middleware saves the pages locally, so that the next time their URL is requested the response is loaded from disk instead of the network. You switch from network speed to disk speed.

After reading the README, I think scrapy-deltafetch does not load previous request from disk, but instead ignores them completely.

If you crawl half a website, stop, and them resume the spider, the second time the cache approach would parse all content from scratch (visited content would simply load faster), while deltafetch would only parse the remaining half.

Upvotes: 0

Related Questions