Reputation: 359
l have stored on my disk a huge dataset. Since my dataset is about 1.5 TB. l divide it into 32 samples to be able to use numpy.save('data_1.npy')
in python 2.7 . Here is a sample of 9 sub-datasets. Each one is about 30 GB.
The shape of each .npy
file is (number_of_examples,224,224,19
) and values are float.
data_1.npy
data_2.npy
data_3.npy
data_4.npy
data_5.npy
data_6.npy
data_7.npy
data_8.npy
data_9.npy
Using np.save(' *.npy'), my dataset occupy 1.5 Tera in my disk.
1)Is there an efficient way to compress my dataset in order to gain some free space disk ? 2) Is there an efficient way of saving files which take less space memory than np.save() ?
Thank you
Upvotes: 4
Views: 4552
Reputation: 43
You might want to check out xz compression mentioned in this answer. I've found it to be the best compression method while saving hundreds of thousands of .npy files adding up to a few hundred GB. The shell command for a directory called dataset
containing your .npy
files would be:
tar -vcJf dataset.tar.xz dataset/
Or with long arguments:
tar --verbose --create --xz --file=dataset.tar.xz dataset/
This is just to save disk space while storing and moving the dataset; it needs to be decompressed before loading into python.
Upvotes: 1