Reputation: 119
I'm working on some imaging-related ML tasks, and as a result of the preprocessing required, I'm creating objects of a class that contains important metadata attributes, along with a 3d numpy array of image data. I'd like to reduce the size of these objects, and increase the speed that they're written and read.
As it stands, the object is saved as a file using pickle
, however this does not seem like the most efficient method. The dill
library is supposed to be better at saving numpy items, however as I need to process many files, and overall performance is slower, this seems unhelpful.
I also heard of the numpy.save method, but I wasn't sure how to implement this as part of my pickling process. I pickle items using pickle.dump and pickle.load.
Upvotes: 0
Views: 173
Reputation: 231385
pickle depends on a "pickle" method for each object, whether it's a list, dict, or something else. The pickle formatting for numpy arrays is essentially the same as np.save. So speed and file size should be similar. Conversely, np.save use pickle for format non-array arguments, or arrays that contain objects (note the allow_pickle parameters in save/load).
In [57]: import pickle
In [58]: x = np.ones((100,100,100))
In [59]: np.save('test.npy',x)
In [60]: !dir test.npy
Volume in drive C is Windows
Volume Serial Number is 4EEB-1BF0
Directory of C:\Users\paul
01/18/2023 12:57 PM 8,000,128 test.npy
1 File(s) 8,000,128 bytes
0 Dir(s) 18,489,139,200 bytes free
In [61]: astr=pickle.dumps(x)
In [62]: len(astr)
Out[62]: 8000163
I've seen that some ML projects use HDF5/h5py
to save the model and data, but I haven't paid much attention to that. I have answered h5py
questions, but haven't tried it for large projects where speed and compression matters.
Multiple numpy arrays can also be saved with np.savez
(on the compressed version). That saves each array as a npy
file in a zip archive.
np.save
is the most efficient means of saving an array. It essentially consists of a small header block, plus a byte copy of the array's data buffer. Unless the array has lots of the same values, there's little room for compression.
Upvotes: 2