Reputation: 9542
I'm trying to improve performance of my pytables/HDF5 code by specifying the chunkshape
when creating a table. I can't figure out what the real dimensions or format of the chunkshape
parameter. I can see from the code that I it ultimately ends up as a tuple with a single element.
Is this single element supposed to be the number of rows, bytes, or what?
My specific issue is I have existing code that creates an HDF5 table with 20 columns. I would like to change the chunks of the table so that each column is stored contiguously on disk. Thus, optimizing for reading entire columns out at a single time.
I tried just setting the chunkshape to 20 (number of columns), but this dramatically decreased the performance of reading an entire column. Should the chunk shape be set to the width (in bytes) of a single row?
I would just like to know what the chunkshape should be if:
Upvotes: 1
Views: 1535
Reputation: 17489
The chunkshape
in PyTables
specifies the number of elements per row and column that should be stored contiguously on disk (that's the reason why it is a tuple).
So, for instance if your dataset is 10,000 x 20 (10,000 rows, 20 columns) and you always access a single column at a time , then each chunk should contain as much of a column as possible, given your best chunk size (see here for more details).
If you know how many rows you will have and they are not that huge you could specify a chunkshape of (10.000,1)
(or fewer rows). So if you access all 20 columns it will take 20 accesses.
Upvotes: 4