PythonSmurf
PythonSmurf

Reputation: 31

How to import hdf5 data with dask in parallel and create a dask dataframe?

I am completely stuck, hence I am looking for kind advice. My aim is to read out many hdf5 files in parallel, extract the multi-dim arrays inside, and store each array in one row, precisely a cell, of a dask dataframe. I don't opt for a pandas df, because I believe it will be too big.

How would you realise this, please? xarray was suggested to me as well, but I don't know what's the best way. Earlier I tried to collect all arrays in a multi-dimensional dask array, but the conversion to a dataframe is only possible for ndim=2.

Thank you for your advice. Have a good day.

import numpy as np
import h5py
import dask.dataframe as dd
import dask.array as da
import dask
print('This is dask version', dask.__version__)


ra=np.ones([10,3199,4000])

print(ra.shape)
file_list=[]
for i in range(0,4):
    #print(i)
    fstr='data_{0}.h5'.format(str(i))
    #print(fstr)
    hf = h5py.File('./'+fstr, 'w')
    hf.create_dataset('dataset_{0}'.format(str(i)), data=ra)
    hf.close()
    file_list.append(fstr)

!ls

print(file_list)

for i,fn in enumerate(file_list):
    dd.read_hdf(fn,key='dataset_{0}'.format(str(i))) #breaks here

Upvotes: 1

Views: 1924

Answers (1)

Michael Delgado
Michael Delgado

Reputation: 15432

You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using dask.dataframe.from_delayed.

from dask.distributed import Client
import dask.dataframe as dd

client = Client()

def preprocess_hdf_file_to_dataframe(filepath):
    # process your data into a dataframe however you want, e.g.
    with xr.open_dataset(filepath) as ds:
        return ds.to_dataframe()

files = ['file1.hdf5', 'file2.hdf5']

futures = client.map(preprocess_hdf_file_to_dataframe, files)
df = dd.from_delayed(futures)

That said, this seems like a perfect use case for xarray, which can read HDF5 files and work with dask natively, e.g.

ds = xr.open_mfdataset(files)

This dataset is similar to a dask.dataframe, in that it contains references to dask.arrays which are read from the file. But xarray is built to handle N-dimensional arrays natively and can work much more naturally with the HDF5 format.

There are certainly areas where dataframes make more sense than a Dataset or DataArray, though, and converting between them can be tricky with larger-than-memory data, so the first approach is always an option if you want a dataframe.

Upvotes: 2

Related Questions