Reputation: 30687
It says in the documentation that the output of sys.getsizeof() is in bytes. I'm trying to store a data structure that is a dictionary of class instances and lists. I did sys.getsizeof() on this dictionary of class instances and it was 3352 bytes. I'm serializing it using dill so I could load it later but it's taking a really, really long time.
The file size is already 260 MB which is much larger than 3352 bytes specified by sys.getsizeof(). Does anyone know why the values are different and why it is taking so long to store?
Is there a more efficient way to store objects like this when running on a 4GB memory Mac Air?
It's an incredible tool . I'm not sure if there is any parameters I can tweak to help with my low memory issue. I know there's a protocol=2 for pickle but it doesn't seem to store the environment as well as dill.
sys.getsizeof(D_storage_Data) #Output is 3352
dill.dump(D_storage_Data,open("storage.obj","wb"))
Upvotes: 0
Views: 2029
Reputation: 35217
I'm the dill
author. See my comment here: If Dill file is too large for RAM is there another way it can be loaded. In short, the answer is that it depends on what you are pickling… and if it's class instances, the answer is yes. Try the byref
setting. Also if you are looking to store a dict
of objects, you might want to map your dict
to a directory of files, by using klepto
-- that way you can dump and load individual elements of the dict individually, and still work out of a dict
API.
So especially when using dill
, and especially in a ipynb, check out dill.settings
... Serialization (dill
, pickle
, or otherwise) recursively pulls objects into the pickle, and so often can pull in all of globals
. Use dill.settings
to change what is stored by reference and what is stored by pickling.
Upvotes: 4
Reputation: 1929
Watch this:
>>> x = [ i for i in range(255) ]
>>> sys.getsizeof(x)
2216
>>> d = { 1 : x }
>>> sys.getsizeof(d)
288
>>> s = pickle.dumps(d) # Dill is similar, I just don't have it installed on this computer
>>> sys.getsizeof(s)
557
The size of 'd' is just the size of the dict object itself (the class, methods, keys and overall structure of the dict) along with a pointer to 'x'. It does not include the size of 'x' at all.
When you serialize 'd' however, it has to serialize both 'd' and 'x' in order to be able to de-serialize into a meaningful dict later. This is the basis for why your file is bigger than the bytes from your call. And you can see, the serializer does a good job of packing it up actually.
Upvotes: 4