O.rka
O.rka

Reputation: 30687

How to speed up dill serialization to store Python object to file

It says in the documentation that the output of sys.getsizeof() is in bytes. I'm trying to store a data structure that is a dictionary of class instances and lists. I did sys.getsizeof() on this dictionary of class instances and it was 3352 bytes. I'm serializing it using dill so I could load it later but it's taking a really, really long time.

The file size is already 260 MB which is much larger than 3352 bytes specified by sys.getsizeof(). Does anyone know why the values are different and why it is taking so long to store?

Is there a more efficient way to store objects like this when running on a 4GB memory Mac Air?

It's an incredible tool . I'm not sure if there is any parameters I can tweak to help with my low memory issue. I know there's a protocol=2 for pickle but it doesn't seem to store the environment as well as dill.

sys.getsizeof(D_storage_Data) #Output is 3352
dill.dump(D_storage_Data,open("storage.obj","wb"))

Upvotes: 0

Views: 2029

Answers (2)

Mike McKerns
Mike McKerns

Reputation: 35217

I'm the dill author. See my comment here: If Dill file is too large for RAM is there another way it can be loaded. In short, the answer is that it depends on what you are pickling… and if it's class instances, the answer is yes. Try the byref setting. Also if you are looking to store a dict of objects, you might want to map your dict to a directory of files, by using klepto -- that way you can dump and load individual elements of the dict individually, and still work out of a dict API.

So especially when using dill, and especially in a ipynb, check out dill.settings... Serialization (dill, pickle, or otherwise) recursively pulls objects into the pickle, and so often can pull in all of globals. Use dill.settings to change what is stored by reference and what is stored by pickling.

Upvotes: 4

RobertB
RobertB

Reputation: 1929

Watch this:

>>>  x = [ i for i in range(255) ]
>>>  sys.getsizeof(x)
2216
>>>  d = { 1 : x }
>>>  sys.getsizeof(d)
288
>>>  s = pickle.dumps(d) # Dill is similar, I just don't have it installed on this computer
>>>  sys.getsizeof(s)
557

The size of 'd' is just the size of the dict object itself (the class, methods, keys and overall structure of the dict) along with a pointer to 'x'. It does not include the size of 'x' at all.

When you serialize 'd' however, it has to serialize both 'd' and 'x' in order to be able to de-serialize into a meaningful dict later. This is the basis for why your file is bigger than the bytes from your call. And you can see, the serializer does a good job of packing it up actually.

Upvotes: 4

Related Questions