Matt
Matt

Reputation: 85

scrapy how to prevent duplicate data from being inserted into database

Can someone help me out with this, im little new with scrapy/python. I cant seem to prevent duplicate data from being inserted into the database. for examples. if my database has price of Mazda at $4000. I don't want the spider to insert crawled data again if 'car' already exist or if 'price with car' exist.

price | car
-------------
$4000 | Mazda   <----
$3000 | Mazda 3 <----
$4000 | BMW
$4000 | Mazda 3 <---- I also dont want to have two results like this 
$4000 | Mazda   <---- I don't want to have two results any help will be greatly appreciated - Thanks 


pipeline.py
-------------------
from scrapy import log  
#from scrapy.core.exceptions import DropItem  
from twisted.enterprise import adbapi  
from scrapy.http import Request  
from scrapy.exceptions import DropItem  
from scrapy.contrib.pipeline.images import ImagesPipeline  
import time  
import MySQLdb  
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno  

----------------------------------
when I put this peace of code, the crawled data does not save. but when removed it does save into the database.



class DuplicatesPipeline(object):

    def __init__(self):
         self.car_seen = set()

    def process_item(self, item, spider):
        if item['car'] in self.car_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.car_seen.add(item['car'])
            return item  
 --------------------------------------  

class MySQLStorePipeline(object):  

    def __init__(self):  
        self.dbpool = adbapi.ConnectionPool('MySQLdb',  
            db = 'test',  
            user = 'root',  
            passwd = 'test',  
            cursorclass = MySQLdb.cursors.DictCursor,  
            charset = 'utf8',  
            use_unicode = False  
        )  

    def _conditional_insert(self, tx, item):  
        if item.get('price'):  
            tx.execute(\
                "insert into data ( \
                price,\
                car \
                ) \
                values (%s, %s)",
                (item['price'],
                item['car'],
                )
            ) 

    def process_item(self, item, spider):                
        query = self.dbpool.runInteraction(self._conditional_insert, item)   
        return item



settings.py
------------
SPIDER_MODULES = ['car.spiders']
NEWSPIDER_MODULE = 'car.spiders'
ITEM_PIPELINES = ['car.pipelines.MySQLStorePipeline'] 

Upvotes: 3

Views: 2286

Answers (1)

Matt
Matt

Reputation: 85

Found the problem. make sure duplicatespipeline is first.

settings.py
ITEM_PIPELINES = {
    'car.pipelines.DuplicatesPipeline': 100,
    'car.pipelines.MySQLStorePipeline': 200,
    }

Upvotes: 3

Related Questions