Reputation: 1779
The data set that I am using is too large to fit into memory to do computations. To circumvent this issue, I am doing the computations in batches and again saving the results to file.
The problem that I have is that my last batch will not be saved to my H5py file, almost certainly because the ending batch size differs from all the previous. Is there any way I can get chunks
to be more flexible?
Consider the following MWE:
import h5py
import numpy as np
import pandas as pd
from more_tools import chunked
df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)
with h5py.File('SO.h5', 'w') as f:
dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)
for step, i in enumerate(index_chunks):
temp_df = df.iloc[i]
dset = f['test']
start = step*len(i)
dset[start:start+len(i)] = temp_df['data']
dset.attrs['last_index'] = (step+1)*len(i)
# check data
with h5py.File('SO.h5', 'r') as f:
print('last entry:', f['test'][-10::]) # yields 3 empty values because it did not match the usual batch size
Upvotes: 1
Views: 319
Reputation: 114230
Your indexing is wrong. step, i
goes like this:
0, 0 ... 9
1, 10 ... 19
2, 20 ... 29
...
9, 90 ... 99
10, 100 ... 109
11, 110 ... 112
For step == 11
, len(i) == 3
. That makes start = step * len(i)
into 11 * 3 == 33
, while you're expecting 11 * 10 == 110
. You're simply writing to the wrong location. If you inspect the data in the fourth chunk, you will likely find that the fourth, fifth and sixth elements are overwritten by the missing data.
Here is a possible workaround:
last = 0
for step, i in enumerate(index_chunks):
temp_df = df.iloc[i]
dset = f['test']
first = last
last = first + len(i)
dset[first:last] = temp_df['data']
dset.attrs['last_index'] = last
Upvotes: 1