Storing HDF5 subsets as datasets (in python)

Question

I have only limited knowledge of HDF5 but I would like to understand something about HDF selections.

To give some context I'm interested in using HDF5 for applications in machine learning. Suppose you have a data matrix with n rows and p columns. In a typical k-folds cross-validation setting, you will split the matrix into k samples (each one of size (n/k, p)) and repetitively use k-1 for training and 1 for testing. Of course, storing all the training and testing sets will use a lot of space. This is where HDF5 selections could help.

If I understand correctly, a selection can refer to any subset of a dataset. Moreover a selection can be stored into a dataset. Therefore, starting from a (n, p) dataset in HDF5, I could create k groups (one for each fold), containing a training dataset (a subset of the rows in the original dataset) and a testing dataset (the remainder). As it's only references, it won't use too much space.

I have found some documentation about selections but it's not very clear. The code examples are in C which is a bit harsh to modify and I mainly use Python. I haven't found anything related to this in PyTables. I have found some examples in h5py but I couldn't figure how to put data in the subset.

Can anybody confirm that this is an interesting approach and provide some python code for storing a subset of rows from a dataset as another dataset?

Storing HDF5 subsets as datasets (in python)

Answers (1)

Related Questions