Reputation: 3
I have created a scraper which downloads all file from a website and saves the download links in a JSON
file using an item pipeline. How to prevent the scraper from downloading the same file again if its link is found in the JSON
file.
Upvotes: 0
Views: 655
Reputation: 2204
Great question! The fact is that what you want to do is quite complex to do programmatically in a generic way (you have to write your own middleware or to customise RFPDupeFilter here . But you are very lucky. Another generic way to achieve exactly what you want is just pausing and resuming crawls which is already implemented and tested.
Upvotes: 1