durden2.0
durden2.0

Reputation: 9542

Optimize chunkshape parameter of pytables/HDF5 for reading entire column

I'm trying to improve performance of my pytables/HDF5 code by specifying the chunkshape when creating a table. I can't figure out what the real dimensions or format of the chunkshape parameter. I can see from the code that I it ultimately ends up as a tuple with a single element.

Is this single element supposed to be the number of rows, bytes, or what?

My specific issue is I have existing code that creates an HDF5 table with 20 columns. I would like to change the chunks of the table so that each column is stored contiguously on disk. Thus, optimizing for reading entire columns out at a single time.

I tried just setting the chunkshape to 20 (number of columns), but this dramatically decreased the performance of reading an entire column. Should the chunk shape be set to the width (in bytes) of a single row?

I would just like to know what the chunkshape should be if:

  1. I want to read an entire column as fast as possible.
  2. I know exactly how many columns are in the table.
  3. I cannot just simply change the table to have the existing rows as columns and vice-versa for backwards-compatibility reasons.

Upvotes: 1

Views: 1535

Answers (1)

Ümit
Ümit

Reputation: 17489

The chunkshape in PyTables specifies the number of elements per row and column that should be stored contiguously on disk (that's the reason why it is a tuple).

So, for instance if your dataset is 10,000 x 20 (10,000 rows, 20 columns) and you always access a single column at a time , then each chunk should contain as much of a column as possible, given your best chunk size (see here for more details).

If you know how many rows you will have and they are not that huge you could specify a chunkshape of (10.000,1) (or fewer rows). So if you access all 20 columns it will take 20 accesses.

Upvotes: 4

Related Questions