Nick
Nick

Reputation: 675

Is it possible to store multidimensional arrays of arbitrary shape in a PyTables cell?

PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:

class Particle(IsDescription):
    name = StringCol(itemsize=16) # 16-character string
    lati = Int32Col() # integer
    longi = Int32Col() # integer
    pressure = Float32Col(shape=(2,3)) # array of floats (single-precision) 
    temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)

However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.

If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.

Any pointers toward PyTables best practices are much appreciated!

Upvotes: 1

Views: 2042

Answers (2)

kenm
kenm

Reputation: 23945

The long answer is "yes, but you probably don't want to."

PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).

Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:

/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature

and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.

If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.

Upvotes: 1

clh
clh

Reputation: 829

The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.

I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.

I suspect the best thing to do is either: a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression. b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)

HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.

Upvotes: 0

Related Questions