ali_m
ali_m

Reputation: 74154

PyTables: indexing multiple dimensions of large arrays

I'm analysing some imaging data that consists of large 3-dimensional arrays of pixel intensities with dimensions [frame, x, y]. Since these are usually too big to hold in memory, they reside on the hard disk as PyTables arrays.

What I'd like to be able to do is read out the intensities in an arbitrary subset of pixels across all frames. The natural way to do this seems to be list indexing:

import numpy as np
import tables

tmph5 = tables.open_file('temp.hdf5', 'w')
bigarray = tmph5.create_array('/', 'bigarray', np.random.randn(1000, 200, 100))

roipixels = [[0, 1, 2, 4, 6], [34, 35, 36, 40, 41]]
roidata = bigarray[:, roipixels[0], roipixels[1]]
# IndexError: Only one selection list is allowed

Unfortunately it seems that PyTables currently only supports a single set of list indices. A further problem is that a list index can't contain duplicates - I couldn't simultaneously read pixels [1, 2] and [1, 3], since my list of pixel x-coordinates would contain [1, 1]. I know that I can iterate over rows in the array:

roidata = np.asarray([row[roipixels[0], roipixels[1]] for row in bigarray])

but these iterative reads become quite slow for the large number of frames I'm processing.

Is there a nicer way of doing this? I'm relatively new to PyTables, so if you have any tips on organising datasets in large arrays I'd love to hear them.

Upvotes: 3

Views: 991

Answers (1)

Joe Kington
Joe Kington

Reputation: 284572

For whatever it's worth, I often do the same thing with 3D seismic data stored in hdf format.

The iterative read is slow due to the nested loops. If you only do a single loop (rather than looping over each row) it's quite fast (at least when using h5py. I typically only store table-like data using pytables) and does exactly what you want.

In most cases, you'll want to iterate over your lists of indicies, rather than over each row.

Basically, you want:

roidata = np.vstack([bigarray[:,i,j] for i,j in zip(*roipixels)])

Instead of:

roidata = np.asarray([row[roipixels[0],roipixels[1]] for row in bigarray])

If this is your most common use case, adjusting the chunksize of the stored array will help dramatically. You'll want long, narrow chunks, with the longest length along the first axis, in your case.

(Caveat: I haven't tested this with pytables, but it works perfectly with h5py.)

Upvotes: 3

Related Questions