station
station

Reputation: 7145

Crawler produces duplicates when run twice?

I use the crawler framework "scrapy" in python and I use the pipelines.py file to store my items in a json format to a file.The code for doing this is given below import json

class AYpiPipeline(object):
def __init__(self):
    self.file = open("a11ypi_dict.json","ab+")


# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
    d = {}    
    i = 0
# Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
try:
    while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
    i+=1
except IndexError:
    print "Index out of range"
    # Writing it to a file
    json.dump(d,self.file)
return item

The problem is when I run my crawler twice(say) then in my file I get duplicate scraped items .I tried prevented it by reading from the file first and then matching the data with the new data to be written but the data read from the file was a json format ,so then I decoded it with json.loads() function but it doesn't work:

import json 

class AYpiPipeline(object):
    def __init__(self):
        self.file = open("a11ypi_dict.json","ab+")
        self.temp = json.loads(file.read())
    
    # this method is called to process an item after it has been scraped.
    def process_item(self, item, spider):
        d = {}    
        i = 0
        # Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
        try:
            while i<len(item["foruri"]):
            d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
            i+=1
        except IndexError:
            print "Index out of range"
        # Writing it to a file
    
             if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
                  json.dump(d,self.file)
        return item
    .

Please suggest a method to do this.

Note: Please note that I have to open the file in "append" mode since I may crawl a different set of links but running the crawler twice with same start_url should write the same data to the file twice

Upvotes: 1

Views: 977

Answers (1)

rmalouf
rmalouf

Reputation: 3443

You can filter out duplicates by using some custom middleware, e.g, this. To actually use this in your spider, though, you'll need two more things: some way of assigning ids to items so that you the filter can identify duplicates, and some way of persisting the set of visited ids between spider runs. The second is easy -- you could use something pythonic like shelve, or you could use one of the many key-value stores that are popular these days. The first part is going to be harder, though, and will depend on the problem you're trying to solve.

Upvotes: 1

Related Questions