Reputation: 57
What is the correct way to go about creating a dask.dataframe from a list of HDF5 files? I basically want to do this but with a dataframe
dsets = [h5py.File(fn)['/data'] for fn in sorted(glob('myfiles.*.hdf5')]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.stack(arrays, axis=0)
Upvotes: 3
Views: 1230
Reputation: 57301
Briefly if your individual files can be read with pd.read_hdf
then you can do this with dd.read_hdf
and dd.concat
.
import dask.dataframe as dd
dfs = [dd.read_hdf(fn, '/data') for fn in sorted(glob('myfiles.*.hdf5')]
df = dd.concat(dfs)
But it would be useful (and easy) to support this idiom within dd.read_hdf
directly. I've created an issue for this and will try to get to it in the next couple of days.
Upvotes: 1