Alice Schwarze
Alice Schwarze

Reputation: 579

Python/Numpy: Efficiently store non-sparse large symmetric arrays?

I am trying to store data on my hard drive that comes in the form of 2 million symmetric 100x100 matrices. Almost all elements of these matrices are non-zero. I am currently saving this data in 200 npy files; each of which has size 5.1GB and contains 100000x100x100 numpy array. This takes up more than 1TB of hard drive space.

Is there anyway that I can use the fact that the matrices are symmetric to save space on my hard drive?

Upvotes: 6

Views: 1965

Answers (2)

jpp
jpp

Reputation: 164673

Consider HDF5. There are many options for compression via h5py python library. Two of the most popular are lzf (fast decompression, moderate compression ratio) and gzip (slower decompression, good compression ratio). With gzip, you can choose compression level. In addition, gzip and lzf allow shuffle filters to improve compression ratios.

For dense arrays of uncompressed size ~8GB (csv), I typically see 75% reduction after applying lzf in HDF5. I don't expect so large a benefit from npy to HDF5, but it could still be significant.

Another benefit is HDF5 supports lazy reading. In python you can do this directly through h5py or via dask.array.

If you wish to go down this route, h5py documentation has sample high-level code.

Disclaimer: I have no affiliation with h5py or dask. I just find these libraries useful for high-level data analysis.

Upvotes: 1

user545424
user545424

Reputation: 16189

To store only the upper half of the matrix (including the diagonal) you should be able to do something like:

import numpy as np

data = np.load([filename])

flat = []
for a in data:
    flat.append(a[np.triu_indices(100)])

np.savez([filename], *flat)

And then to load them back:

import numpy as np

flat = np.load([filename])

data = []

for name, a in flat:
    arr = np.zeros((100,100),dtype=[dtype])
    arr[np.triu_indices(100)] = a
    arr = arr + arr.T - np.diag(arr.diagonal)
    data.append(arr)

data = np.array(data)

Upvotes: 2

Related Questions