How to save variable-sized data to H5PY file?

Question

The data set that I am using is too large to fit into memory to do computations. To circumvent this issue, I am doing the computations in batches and again saving the results to file.

The problem that I have is that my last batch will not be saved to my H5py file, almost certainly because the ending batch size differs from all the previous. Is there any way I can get chunks to be more flexible?

Consider the following MWE:

import h5py
import numpy as np
import pandas as pd
from more_tools import chunked

df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)

with h5py.File('SO.h5', 'w') as f:
    dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)

    for step, i in enumerate(index_chunks):
        temp_df = df.iloc[i]
        dset = f['test']
        start = step*len(i)
        dset[start:start+len(i)] = temp_df['data']
        dset.attrs['last_index'] = (step+1)*len(i)

# check data
with h5py.File('SO.h5', 'r') as f:
    print('last entry:', f['test'][-10::])  # yields 3 empty values because it did not match the usual batch size

Mad Physicist · Accepted Answer

Your indexing is wrong. step, i goes like this:

 0,   0 ...   9
 1,  10 ...  19
 2,  20 ...  29
...
 9,  90 ...  99
10, 100 ... 109
11, 110 ... 112

For step == 11, len(i) == 3. That makes start = step * len(i) into 11 * 3 == 33, while you're expecting 11 * 10 == 110. You're simply writing to the wrong location. If you inspect the data in the fourth chunk, you will likely find that the fourth, fifth and sixth elements are overwritten by the missing data.

Here is a possible workaround:

last = 0
for step, i in enumerate(index_chunks):
    temp_df = df.iloc[i]
    dset = f['test']
    first = last
    last = first + len(i)
    dset[first:last] = temp_df['data']
    dset.attrs['last_index'] = last

How to save variable-sized data to H5PY file?

Answers (1)

Related Questions