Reputation: 353
l have a dataset of 40,000 examples dataset=(40.000,2048)
. After a process l would like to store and load dataset efficiently. Dataset is in an numpy format
l used pickle to store this dataset
but it takes time to store and more time to load it. I even get memory error.
l tried to split the dataset
into several sample as follow :
with open('dataset_10000.sav', 'wb') as handle:
pickle.dump(train_frames[:10000], handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('dataset_20000.sav', 'wb') as handle:
pickle.dump(train_frames[10000:20000], handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('dataset_30000.sav', 'wb') as handle:
pickle.dump(train_frames[20000:30000], handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('dataset_35000.sav', 'wb') as handle:
pickle.dump(train_frames[30000:35000], handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('dataset_40000.sav', 'wb') as handle:
pickle.dump(train_frames[35000:], handle, protocol=pickle.HIGHEST_PROTOCOL)
However l get a memory error and its too heavy.
What is the best/optimized way to save/load such a huge data from/into disk ?
Upvotes: 0
Views: 4716
Reputation: 96127
For numpy.ndarray
objects, use numpy.save
which you should prefer over pickle
anyway, since it is more portable.It should be faster and require less memory in the serialization process.
You can then load it with numpy.load
which even provides a memmap option, allowing you to work with arrays that are larger than can fit into memory.
Upvotes: 1