Saving numpy arrays as a part of larger objects?

I'm working on some imaging-related ML tasks, and as a result of the preprocessing required, I'm creating objects of a class that contains important metadata attributes, along with a 3d numpy array of image data. I'd like to reduce the size of these objects, and increase the speed that they're written and read.

As it stands, the object is saved as a file using pickle, however this does not seem like the most efficient method. The dill library is supposed to be better at saving numpy items, however as I need to process many files, and overall performance is slower, this seems unhelpful.

I also heard of the numpy.save method, but I wasn't sure how to implement this as part of my pickling process. I pickle items using pickle.dump and pickle.load.

Upvotes: 0

Views: 173

Answers (1)

hpaulj
hpaulj

Reputation: 231385

pickle depends on a "pickle" method for each object, whether it's a list, dict, or something else. The pickle formatting for numpy arrays is essentially the same as np.save. So speed and file size should be similar. Conversely, np.save use pickle for format non-array arguments, or arrays that contain objects (note the allow_pickle parameters in save/load).

In [57]: import pickle
In [58]: x = np.ones((100,100,100))
In [59]: np.save('test.npy',x)
In [60]: !dir test.npy
 Volume in drive C is Windows
 Volume Serial Number is 4EEB-1BF0

 Directory of C:\Users\paul

01/18/2023  12:57 PM         8,000,128 test.npy
               1 File(s)      8,000,128 bytes
               0 Dir(s)  18,489,139,200 bytes free

In [61]: astr=pickle.dumps(x)
In [62]: len(astr)
Out[62]: 8000163

I've seen that some ML projects use HDF5/h5py to save the model and data, but I haven't paid much attention to that. I have answered h5py questions, but haven't tried it for large projects where speed and compression matters.

Multiple numpy arrays can also be saved with np.savez (on the compressed version). That saves each array as a npy file in a zip archive.

np.save is the most efficient means of saving an array. It essentially consists of a small header block, plus a byte copy of the array's data buffer. Unless the array has lots of the same values, there's little room for compression.

Upvotes: 2

Related Questions