Dupefilter in Scrapy-Redis not working as expected

Question

I'm interested in using Scrapy-Redis to store scraped items in Redis. In particular, the Redis-based request duplicates filter seems like a useful feature.

To start off, I adapted the spider at https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider as follows:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    custom_settings = {'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
                       'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
                       'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}}

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item

where I generated the project using scrapy startproject tutorial at the command line and defined QuoteItem in items.py as

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Basically, I've implemented the settings in the "Usage" section of the README in the settings per-spider and made the spider yield an Item object instead of a regular Python dictionary. (I figured this would be necessary to trigger the Item Pipeline).

Now, if I crawl the spider using scrapy crawl quotes from the command line and then do redis-cli, I see a quotes:items key:

127.0.0.1:6379> keys *
1) "quotes:items"

which is a list of length 20:

127.0.0.1:6379> llen quotes:items
(integer) 20

If I run scrapy crawl quotes again, the length of the list doubles to 40:

127.0.0.1:6379> llen quotes:items
(integer) 40

However, I would expect the length of quotes:items to still be 20, since I have simply re-scraped the same pages. Am I doing something wrong here?

R. Max · Accepted Answer

Scrapy-redis doesn't filter duplicate items automatically.

The (requests) dupefilter is about the requests in a crawl. What you want seems to be something similar to the deltafetch middleware: https://github.com/scrapy-plugins/scrapy-deltafetch

You would need to adapt deltafetch to work with a distributed storage, perhaps redis' bitmap feature will fit this case.

Dupefilter in Scrapy-Redis not working as expected

Answers (2)

Related Questions