Reputation: 7145
I use the crawler framework "scrapy" in python and I use the pipelines.py file to store my items in a json format to a file.The code for doing this is given below import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
# Writing it to a file
json.dump(d,self.file)
return item
The problem is when I run my crawler twice(say) then in my file I get duplicate scraped items .I tried prevented it by reading from the file first and then matching the data with the new data to be written but the data read from the file was a json format ,so then I decoded it with json.loads() function but it doesn't work:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
self.temp = json.loads(file.read())
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
# Writing it to a file
if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
json.dump(d,self.file)
return item
.
Please suggest a method to do this.
Note: Please note that I have to open the file in "append" mode since I may crawl a different set of links but running the crawler twice with same start_url should write the same data to the file twice
Upvotes: 1
Views: 977
Reputation: 3443
You can filter out duplicates by using some custom middleware, e.g, this. To actually use this in your spider, though, you'll need two more things: some way of assigning ids to items so that you the filter can identify duplicates, and some way of persisting the set of visited ids between spider runs. The second is easy -- you could use something pythonic like shelve, or you could use one of the many key-value stores that are popular these days. The first part is going to be harder, though, and will depend on the problem you're trying to solve.
Upvotes: 1