Why does numpy.save produce 100MB file for sys.getsizeof 0.33MB data?

Question

I have a numpy array arr (produced from multiple nested lists of mismatching lengths), which apparently takes only

sys.getsizeof(arr)/(1000*1000)

0.33848

MB of space in memory. However, when I attempt to save this data to disk with

myf=open('.\test.npy', 'wb')
np.save(myf, arr)
myf.close()

the produced file test.npy turns out to be over 100MB large.

Why is that? Did I make some mistake with measuring the actual data size in python memory? Or if not, is there some way to save the data more efficiently taking up only close to 0.33848MB space on the hard drive?

EDIT:

As requested in the comments, here some more properties of arr

arr.shape

(14101, 6)

arr.dtype

dtype('O')

arr.itemsize

4

arr.nbytes

338424

even though the dtype claims to be dtype('O'), the array only contains numerical values (integers and floats). Perhaps the object specification arises because of mismatching dimensions of nested lists?

hpaulj · Accepted Answer

Make an array composed of several arrays:

In [98]: arr = np.array([np.ones(10), np.zeros((200,300)),np.arange(1000).reshape(100,10)],object)

Total memory use:

In [100]: sum([a.nbytes for a in arr]+[arr.nbytes])                                                  
Out[100]: 488104

Save it and check the file size

In [103]: np.save('test.npy', arr, allow_pickle=True)                                                
In [104]: ll test.npy                                                                                
-rw-rw-r-- 1 paul 488569 Jul  8 17:46 test.npy

That's close enough!

A npz archive takes about the same space:

In [106]: np.savez('test.npz', *arr)                                                                 
In [107]: ll test.npz                                                                                
-rw-rw-r-- 1 paul 488828 Jul  8 17:49 test.npz

But compressing helps significantly:

In [108]: np.savez_compressed('test.npz', *arr)                                                      
In [109]: ll test.npz                                                                                
-rw-rw-r-- 1 paul 2643 Jul  8 17:50 test.npz

I suspect it's so compressible because the largest array is all 0s. With random values arrays of the same size, compression is only to 454909.

Why does numpy.save produce 100MB file for sys.getsizeof 0.33MB data?

Answers (2)

Related Questions