Divick
Divick

Reputation: 1273

scrapy filtering duplicate requests

What is the difference between the Duplicate Filter which exists in the Scheduler and the IgnoreVisitedItems middleware?

Google group thread which suggests that there is a duplicate filter present in the Scheduler: http://groups.google.com/group/scrapy-users/browse_thread/thread/8e218bcc5b293532

Upvotes: 4

Views: 5172

Answers (1)

Pablo Hoffman
Pablo Hoffman

Reputation: 1540

The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). The IgnoreVistedItems middleware will keep a state between runs and avoiding visiting URLs seen in the past, but only for final item urls so that the rest of the site can be re-crawled (in order to find new items).

Upvotes: 13

Related Questions