Creating a dask dataframe from a list of HDF5 files

Question

What is the correct way to go about creating a dask.dataframe from a list of HDF5 files? I basically want to do this but with a dataframe

dsets = [h5py.File(fn)['/data'] for fn in sorted(glob('myfiles.*.hdf5')]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.stack(arrays, axis=0)

MRocklin · Accepted Answer

Briefly if your individual files can be read with pd.read_hdf then you can do this with dd.read_hdf and dd.concat.

import dask.dataframe as dd
dfs = [dd.read_hdf(fn, '/data') for fn in sorted(glob('myfiles.*.hdf5')]
df = dd.concat(dfs)

But it would be useful (and easy) to support this idiom within dd.read_hdf directly. I've created an issue for this and will try to get to it in the next couple of days.

Creating a dask dataframe from a list of HDF5 files

Answers (1)

Related Questions