Hessu
Hessu

Reputation: 1

When storing scrapy results to database, how to avoid storing duplicates

I am just starting with scrapy and trying to develop a project where I scrape 'news links' from websites. For example, there is a website iltalehti.fi and I would like to scrape their news, let's say in every 5 minutes. Since each crawl will return duplicates, how do I avoid those duplicates from being stored in my database? So the end result would be a database containing only different entries but not the same news link twice (or 200 times in scenario if I run the crawler in every 5mins).

Any help is more than welcome and please note I know very little from python!

Upvotes: 0

Views: 686

Answers (1)

asimhashmi
asimhashmi

Reputation: 4378

Scrapy uses pipelines to to do the extra processing(validating and filtering) with the data which is scraped from the websites.

You can write a pipleline which will be used to check the unique items and drop items which are duplicates.

Here is an example from the python docs:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

More info on pipelines here

Upvotes: 2

Related Questions