Reputation: 924
I have only limited knowledge of HDF5 but I would like to understand something about HDF selections.
To give some context I'm interested in using HDF5 for applications in machine learning. Suppose you have a data matrix with n
rows and p
columns. In a typical k
-folds cross-validation setting, you will split the matrix into k
samples (each one of size (n/k, p)
) and repetitively use k-1
for training and 1 for testing. Of course, storing all the training and testing sets will use a lot of space. This is where HDF5 selections could help.
If I understand correctly, a selection can refer to any subset of a dataset. Moreover a selection can be stored into a dataset. Therefore, starting from a (n, p)
dataset in HDF5, I could create k
groups (one for each fold), containing a training dataset (a subset of the rows in the original dataset) and a testing dataset (the remainder). As it's only references, it won't use too much space.
I have found some documentation about selections but it's not very clear. The code examples are in C which is a bit harsh to modify and I mainly use Python. I haven't found anything related to this in PyTables
. I have found some examples in h5py
but I couldn't figure how to put data in the subset.
Can anybody confirm that this is an interesting approach and provide some python code for storing a subset of rows from a dataset as another dataset?
Upvotes: 3
Views: 1249
Reputation: 924
I have been able to almost do that thanks the h5py community. See the thread here.
Upvotes: 1