Reputation: 1273
What is the difference between the Duplicate Filter which exists in the Scheduler and the IgnoreVisitedItems middleware?
Google group thread which suggests that there is a duplicate filter present in the Scheduler: http://groups.google.com/group/scrapy-users/browse_thread/thread/8e218bcc5b293532
Upvotes: 4
Views: 5172
Reputation: 1540
The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). The IgnoreVistedItems middleware will keep a state between runs and avoiding visiting URLs seen in the past, but only for final item urls so that the rest of the site can be re-crawled (in order to find new items).
Upvotes: 13