Reputation:
I would like to run Scrapy periodically to get all new content. Yielded items are stored in a database. What would be the best way to verify that, when Scrapy crawls again, that already yielded items are not stored as duplicates?
Would giving items a hash be a good way to verify this? I don't want to end up having duplicate items in my database.
Thanks!
Upvotes: 1
Views: 1263
Reputation: 421
If you are scraping items simultaneously between different crawls,checking the DB for duplicate in pipeline that is referred by Tomáš Linhart is a make-sense choice.
Otherwise,I think performing duplication process at scrapy scope is better alternative. For example scrapy-deltafetch provided by the community,which filter duplicate items in spider middleware for incremental(delta) crawls.
DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database(nested). For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.
After you install and configure this plugin properly.Then run crawler and take a look at the Stats that Scrapy logged at the end.You can see the new-add items that is about deltafetch.
2017-12-25 16:36:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'deltafetch/skipped': 88,
'deltafetch/stored': 262,
'downloader/request_count': 286,
'finish_reason': 'finished',
...
'item_scraped_count': 262,
...
}
Upvotes: 2