Reputation: 35
I'm trying to crawl a website. In this website i store the crawled persons in person_set, and the queue for crawling next persons in parse_queue. At the starting of the each person crawl, i need to write these two data structures into a file in case crawling interrupts due to exceptions or bad connection so i can continue later.
I have three python files. A main file, a spider, and a person model. Main instantiates spider, spider starts parsing and calls write and read when necessary. person file has the class Person which is the model for storing person data.
I'm having problems with reading the data i wrote. I checked many questions about this error and it seems like an import problem. But even though i imported Person class into main and spiders it's still giving me error. It seems like emergency_read method is not affected by my top level import.
main.py
from spiders import Spider
from person import Person
import pickle
def main():
....
spider = Spider("seed_input")
spider.parse(client)
spiders.py
import pickle
from person import Person
class Spider:
def __init__(self, filename):
self.person_set = Set()
self.file_to_seed(filename)
for seed_url in self.seed_list:
self.seed_to_parse_queue(seed_url)
def parse(self, client):
if os.path.exists('tmp.person_set'):
print "Program wasnt ended properly, continuing from where it left"
self.emergency_read()
... starts parsing
def emergency_write(self):
if os.path.exists('tmp.person_set'):
self.delete_emergency_files()
with open('tmp.person_set', 'wb') as f:
pickle.dump(self.person_set, f)
with open('tmp.parse_queue', 'wb') as f:
pickle.dump(self.parse_queue, f)
def emergency_read(self):
with open('tmp.person_set', 'rb') as f:
self.person_set = pickle.load(f)
with open('tmp.parse_queue', 'rb') as f:
self.parse_queue = pickle.load(f)
person.py
class Person:
def __init__(self, name):
self.name = name
self.friend_set = Set()
self.profile_url = ""
self.id = 0
self.color = "Grey"
self.parent = None
self.depth = 0
def add_friend(self, friend):
self.friend_set.add(friend)
def __repr__(self):
return "Person(%s, %s)" % (self.profile_url, self.name)
def __eq__(self, other):
if isinstance(other, Person):
return ((self.profile_url == other.profile_url) and (self.name == other.name))
else:
return False
def __ne__(self, other):
return (not self.__eq__(other))
def __hash__(self):
return hash(self.__repr__())
Stacktrace
python main.py
Program wasnt ended properly, continuing from where it left
Traceback (most recent call last):
File "main.py", line 47, in <module>
main()
File "main.py", line 34, in main
spider.parse(client)
File "/home/ynscn/py-workspace/lll/spiders.py", line 39, in parse
self.emergency_read()
File "/home/ynscn/py-workspace/lll/spiders.py", line 262, in emergency_read
self.person_set = pickle.load(f)
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1198, in load_setitem
dict[key] = value
File "/home/ynscn/py-workspace/lll/person.py", line 30, in __hash__
return hash(self.__repr__())
File "/home/ynscn/py-workspace/lll/person.py", line 18, in __repr__
return "Person(%s, %s)" % (self.profile_url, self.name)
AttributeError: Person instance has no attribute 'profile_url'
Upvotes: 0
Views: 1520
Reputation: 35247
Your code might serialize as is if you use dill
instead of pickle
. dill
can pickle class objects, instances, methods, and attributes… and most everything in python. dill
can also store dynamically modified state for classes and class instances. I agree that it seems to be a pickle load
error, as @nofinator points out. However, dill
might let to get around it.
Probably even better might be that if you want to force an order for load and unload, you could try adding __getstate__
and __setstate__
methods.
Upvotes: 0
Reputation: 3023
Pickle loads the components of a class instance in a non-deterministic order. This error is happening during the load but before it has deserialized the Person.profile_url
attribute. Notice that it fails during load_setitem
, which means it is probably trying to load the friend_set
attribute, which is a set.
Your custom __repr__()
relies on a class attribute, and then your custom __hash__()
(which is needed by pickle
) relies on __repr__()
.
My recommendation is to use Python's default __hash__
method. Would that work?
Upvotes: 2