Kagaratsch
Kagaratsch

Reputation: 1023

Why does numpy.save produce 100MB file for sys.getsizeof 0.33MB data?

I have a numpy array arr (produced from multiple nested lists of mismatching lengths), which apparently takes only

sys.getsizeof(arr)/(1000*1000)

0.33848

MB of space in memory. However, when I attempt to save this data to disk with

myf=open('.\\test.npy', 'wb')
np.save(myf, arr)
myf.close()

the produced file test.npy turns out to be over 100MB large.

Why is that? Did I make some mistake with measuring the actual data size in python memory? Or if not, is there some way to save the data more efficiently taking up only close to 0.33848MB space on the hard drive?

EDIT:

As requested in the comments, here some more properties of arr

arr.shape

(14101, 6)

arr.dtype

dtype('O')

arr.itemsize

4

arr.nbytes

338424

even though the dtype claims to be dtype('O'), the array only contains numerical values (integers and floats). Perhaps the object specification arises because of mismatching dimensions of nested lists?

Upvotes: 1

Views: 402

Answers (2)

hpaulj
hpaulj

Reputation: 231375

Make an array composed of several arrays:

In [98]: arr = np.array([np.ones(10), np.zeros((200,300)),np.arange(1000).reshape(100,10)],object)   

Total memory use:

In [100]: sum([a.nbytes for a in arr]+[arr.nbytes])                                                  
Out[100]: 488104

Save it and check the file size

In [103]: np.save('test.npy', arr, allow_pickle=True)                                                
In [104]: ll test.npy                                                                                
-rw-rw-r-- 1 paul 488569 Jul  8 17:46 test.npy

That's close enough!

A npz archive takes about the same space:

In [106]: np.savez('test.npz', *arr)                                                                 
In [107]: ll test.npz                                                                                
-rw-rw-r-- 1 paul 488828 Jul  8 17:49 test.npz

But compressing helps significantly:

In [108]: np.savez_compressed('test.npz', *arr)                                                      
In [109]: ll test.npz                                                                                
-rw-rw-r-- 1 paul 2643 Jul  8 17:50 test.npz

I suspect it's so compressible because the largest array is all 0s. With random values arrays of the same size, compression is only to 454909.

Upvotes: 1

myrtlecat
myrtlecat

Reputation: 2276

numpy.save uses pickle to store arrays that have the "object" dtype. From the numpy format documentation:

If the dtype contains Python objects (i.e. dtype.hasobject is True), then the data is a Python pickle of the array

The size of a pickled python object is not the same as its size in memory, hence the discrepancy.

Upvotes: 1

Related Questions