Reputation: 215
I have produced a list of dictionary of 8100000 bytes, with 9 million+ elements. Each element has a dictionary of 32 pairs of value and key, though the same set of keys is used in each element.
I wanted to save it for future analysis. I have tried dill.dump, but it took forever (more than 1 hour) that I had to interrupt the kernel. This is suppose to be fast and easy, right?
Here is what I have tried:
import dill
output_file=open('result_list', 'wb')
dill.dump(result_list, output_file)
output_file.close()
I also tried pickle and bzip2
import bz2
import pickle
output_file=bz2.BZ2File('result_list', 'w')
pickle.dump(result_list, output_file)
But ran into memory error.
Any tips on making this feasible and less time consuming? Thanks!
Upvotes: 1
Views: 1081
Reputation: 35217
I'm the dill
author. You may want to try klepto
for this case. dill
(actually any serializer) will treat the entire dict
as a single object... and something of that size, you might want to treat more like a database of entries... which is what klepto
can do. The fastest approach is probably to use the archive that treats each entry as a different file in a single directory on disk:
>>> import klepto
>>> x = range(10000)
>>> d = dict(zip(x,x))
>>> a = klepto.archives.dir_archive('foo', d)
>>> a.dump()
The above makes a directory with 10000
subdirectories with one entry each in it. Keys and values are both stored. Note you can tweak the serialization method a bit as well, so check the docs to see how to do that for your custom case.
Alternately, you could iterate over the dict, and serialize each of the entries with dump inside a parallel map from a multiprocess.Pool
.
(Side note, I'm the author of multiprocess
and klepto
as well).
UPDATE: as the question was changed from serializing a huge dict, to serializing a huge list of small dicts... this changes the answer.
klepto
is built for large dict
-like structures so it's probably not what you want then. You may want to try dask
, which is built for large array
-like structures.
I think you could also iterate over the list, serializing each of the list entries individually... and as long as you loaded them in the same order, you'd be able to reconstitute your results. You could do something like store the position with the value, so that you can restore the list and then sort if they got out of order.
I'd also ask you to think if you have your results could be restructured to be in a better form...
Upvotes: 6