John Stud
John Stud

Reputation: 1779

How to save variable-sized data to H5PY file?

The data set that I am using is too large to fit into memory to do computations. To circumvent this issue, I am doing the computations in batches and again saving the results to file.

The problem that I have is that my last batch will not be saved to my H5py file, almost certainly because the ending batch size differs from all the previous. Is there any way I can get chunks to be more flexible?

Consider the following MWE:

import h5py
import numpy as np
import pandas as pd
from more_tools import chunked

df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)

with h5py.File('SO.h5', 'w') as f:
    dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)

    for step, i in enumerate(index_chunks):
        temp_df = df.iloc[i]
        dset = f['test']
        start = step*len(i)
        dset[start:start+len(i)] = temp_df['data']
        dset.attrs['last_index'] = (step+1)*len(i)
# check data
with h5py.File('SO.h5', 'r') as f:
    print('last entry:', f['test'][-10::])  # yields 3 empty values because it did not match the usual batch size

Upvotes: 1

Views: 319

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114230

Your indexing is wrong. step, i goes like this:

 0,   0 ...   9
 1,  10 ...  19
 2,  20 ...  29
...
 9,  90 ...  99
10, 100 ... 109
11, 110 ... 112

For step == 11, len(i) == 3. That makes start = step * len(i) into 11 * 3 == 33, while you're expecting 11 * 10 == 110. You're simply writing to the wrong location. If you inspect the data in the fourth chunk, you will likely find that the fourth, fifth and sixth elements are overwritten by the missing data.

Here is a possible workaround:

last = 0
for step, i in enumerate(index_chunks):
    temp_df = df.iloc[i]
    dset = f['test']
    first = last
    last = first + len(i)
    dset[first:last] = temp_df['data']
    dset.attrs['last_index'] = last

Upvotes: 1

Related Questions