perimosocordiae
perimosocordiae

Reputation: 17797

Saving many arrays of different lengths

I have ~8000 arrays of two-dimensional points, stored in memory as a Python list of numpy arrays. Each array has shape (x,2), where x is a number between ~600 and ~4000. Essentially, I have a jagged 3-d array.

I want to store this data in a convenient/fast format for reading/writing from disk. I'd rather not create ~8000 separate files, but I'd also rather not pad out a full (8000,4000,2) matrix with zeros if I can avoid it.

How should I store my data on disk, such that both filesize and parsing/serialization are minimized?

Upvotes: 7

Views: 3886

Answers (2)

ederollora
ederollora

Reputation: 1181

There's a standard called HDF for storing large number data sets. You can find some information in the following link but in general terms, HDF defines a binary file format that can be used for large information storing.

You can find a example here that stores large Numpy arrays on disk. In that post, the writer makes a comparison between Python Pickle and HDF5.

I also recommend you this introduction to HDF5. Here's th h5py package, that is a Pythonic interface to the HDF5 binary data format.

Upvotes: 6

John1024
John1024

Reputation: 113844

Put all your numpy arrays into a single python list and then pickle, or cPickle, that list.

For example:

import cPickle
from numpy import array, ones
a = array((5,2))
b = ones((10,2))
c = array((20,2))
all = [a,b,c]
cPickle.dump(all, open('all_my_arrays', 'w'))

You can then retrieve them with:

all2 = cPickle.load(open('all_my_arrays'))

Note that the list all does not require any massive new memory allocation. Because all is just a list of pointers to your numpy arrays, nothing has to be padded with zeros or otherwise copied.

Relative to pickle, HDF5 as the advantages of speed on large arrays and cross-application support (octave, perl, etc.). On the other hand, pickle has the advantages of not requiring any extra software installation (it is included with python) and it also natively understands python objects.

Upvotes: 2

Related Questions