Reputation: 304
I have a dict which has 20,000 keys and total size of dict is 150MB. I dump the dict via pickle to disk every hour and then load the pickle file on program startup. Here is the gist of the writing code
cache_copy = copy.deepcopy(self.cache)
#self.cache is the dict
pickle.dump(cache_copy, cache_file, pickle.HIGHEST_PROTOCOL)
Sometimes I get the following error
cache_copy = copy.deepcopy(self.cache)
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 256, in _deepcopy_dict
for key, value in x.iteritems():
RuntimeError: dictionary changed size during iteration
What is the best way to do this? I want to really avoid thread locking as it makes code complex. If it is really necessary, locking should be as minimal/simple as possible and allow for some concurrency. There are several constraints in my code that could help in this direction:
Upvotes: 1
Views: 176
Reputation: 931
To avoid locking the dict during pickling, create a list that duplicates the dict:
# When you do this:
cache["new_key"] = "new_value"
# Also do this:
cache_list.append(("new_key", "new_value"))
And pickle the list instead.
Preferably, append to cache_file
instead of overwriting it. This way, you can cache_list.clear()
after each write to file, avoiding waste of memory and disk writes. But there can be a bug when some thread writes to the list right after the pickling, and then that value gets clear
ed. Maybe you're OK with losing some values if this is just a cache. If not, use some locks or just don't clear
the list. I was incorrect about double memory usage because the list doesn't deepcopy
the data, it only stores 20000 tuples, each with 2 references.
You need to lock all writes if you want to iterate, copy or pickle your dict. However, if you really don't want to lock the threads that use the dict and you don't mind doubling your memory usage, I can suggest keeping a list of key-value pairs (which duplicates your dict), a lock for that list and a queue of writes to that list:
# When you do this:
cache["new_key"] = "new_value"
# Also do this:
if not list_is_locked:
cache_list.append(("new_key", "new_value"))
else:
# Write here immediately instead of waiting for list to unlock
queue.append(("new_key", "new_value"))
And when the pickling time comes, you lock and pickle that list instead of the dict:
list_is_locked = True
# pickle cache_list here
list_is_locked = False
cache_list.extend(queue)
queue.clear()
I'm not really familiar with pickle
, but if it's possible to append to cache_file
instead of overwriting it, you could also clear the list after pickling it. That would help to avoid that huge memory burden.
Upvotes: 1