Reputation: 1
I am just starting with scrapy and trying to develop a project where I scrape 'news links' from websites. For example, there is a website iltalehti.fi and I would like to scrape their news, let's say in every 5 minutes. Since each crawl will return duplicates, how do I avoid those duplicates from being stored in my database? So the end result would be a database containing only different entries but not the same news link twice (or 200 times in scenario if I run the crawler in every 5mins).
Any help is more than welcome and please note I know very little from python!
Upvotes: 0
Views: 686
Reputation: 4378
Scrapy uses pipelines to to do the extra processing(validating and filtering) with the data which is scraped from the websites.
You can write a pipleline which will be used to check the unique items and drop items which are duplicates.
Here is an example from the python docs:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
More info on pipelines here
Upvotes: 2