eric lardon
eric lardon

Reputation: 359

Compress .npy data to save space in disk

l have stored on my disk a huge dataset. Since my dataset is about 1.5 TB. l divide it into 32 samples to be able to use numpy.save('data_1.npy') in python 2.7 . Here is a sample of 9 sub-datasets. Each one is about 30 GB.

The shape of each .npy file is (number_of_examples,224,224,19) and values are float.

data_1.npy
data_2.npy
data_3.npy
data_4.npy
data_5.npy
data_6.npy
data_7.npy
data_8.npy
data_9.npy

Using np.save(' *.npy'), my dataset occupy 1.5 Tera in my disk.

1)Is there an efficient way to compress my dataset in order to gain some free space disk ? 2) Is there an efficient way of saving files which take less space memory than np.save() ?

Thank you

Upvotes: 4

Views: 4552

Answers (1)

mansi
mansi

Reputation: 43

You might want to check out xz compression mentioned in this answer. I've found it to be the best compression method while saving hundreds of thousands of .npy files adding up to a few hundred GB. The shell command for a directory called dataset containing your .npy files would be:

tar -vcJf dataset.tar.xz dataset/

Or with long arguments:

tar --verbose --create --xz --file=dataset.tar.xz dataset/

This is just to save disk space while storing and moving the dataset; it needs to be decompressed before loading into python.

Upvotes: 1

Related Questions