Reputation: 1023
I have a numpy array arr
(produced from multiple nested lists of mismatching lengths), which apparently takes only
sys.getsizeof(arr)/(1000*1000)
0.33848
MB of space in memory. However, when I attempt to save this data to disk with
myf=open('.\\test.npy', 'wb')
np.save(myf, arr)
myf.close()
the produced file test.npy
turns out to be over 100MB large.
Why is that? Did I make some mistake with measuring the actual data size in python memory? Or if not, is there some way to save the data more efficiently taking up only close to 0.33848MB space on the hard drive?
EDIT:
As requested in the comments, here some more properties of arr
arr.shape
(14101, 6)
arr.dtype
dtype('O')
arr.itemsize
4
arr.nbytes
338424
even though the dtype claims to be dtype('O')
, the array only contains numerical values (integers and floats). Perhaps the object specification arises because of mismatching dimensions of nested lists?
Upvotes: 1
Views: 402
Reputation: 231375
Make an array composed of several arrays:
In [98]: arr = np.array([np.ones(10), np.zeros((200,300)),np.arange(1000).reshape(100,10)],object)
Total memory use:
In [100]: sum([a.nbytes for a in arr]+[arr.nbytes])
Out[100]: 488104
Save it and check the file size
In [103]: np.save('test.npy', arr, allow_pickle=True)
In [104]: ll test.npy
-rw-rw-r-- 1 paul 488569 Jul 8 17:46 test.npy
That's close enough!
A npz archive takes about the same space:
In [106]: np.savez('test.npz', *arr)
In [107]: ll test.npz
-rw-rw-r-- 1 paul 488828 Jul 8 17:49 test.npz
But compressing helps significantly:
In [108]: np.savez_compressed('test.npz', *arr)
In [109]: ll test.npz
-rw-rw-r-- 1 paul 2643 Jul 8 17:50 test.npz
I suspect it's so compressible because the largest array is all 0s. With random values arrays of the same size, compression is only to 454909.
Upvotes: 1
Reputation: 2276
numpy.save
uses pickle
to store arrays that have the "object" dtype. From the numpy format documentation:
If the dtype contains Python objects (i.e. dtype.hasobject is True), then the data is a Python pickle of the array
The size of a pickled python object is not the same as its size in memory, hence the discrepancy.
Upvotes: 1