Reputation: 31
I am completely stuck, hence I am looking for kind advice. My aim is to read out many hdf5 files in parallel, extract the multi-dim arrays inside, and store each array in one row, precisely a cell, of a dask dataframe. I don't opt for a pandas df, because I believe it will be too big.
read_hdf()
from dask hdf5 files created with h5py.
What could I do to import thousands of hdf5-files with dask in paralleL and get access to the multi-dim arrays inside?How would you realise this, please? xarray was suggested to me as well, but I don't know what's the best way. Earlier I tried to collect all arrays in a multi-dimensional dask array, but the conversion to a dataframe is only possible for ndim=2.
Thank you for your advice. Have a good day.
import numpy as np
import h5py
import dask.dataframe as dd
import dask.array as da
import dask
print('This is dask version', dask.__version__)
ra=np.ones([10,3199,4000])
print(ra.shape)
file_list=[]
for i in range(0,4):
#print(i)
fstr='data_{0}.h5'.format(str(i))
#print(fstr)
hf = h5py.File('./'+fstr, 'w')
hf.create_dataset('dataset_{0}'.format(str(i)), data=ra)
hf.close()
file_list.append(fstr)
!ls
print(file_list)
for i,fn in enumerate(file_list):
dd.read_hdf(fn,key='dataset_{0}'.format(str(i))) #breaks here
Upvotes: 1
Views: 1924
Reputation: 15432
You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using dask.dataframe.from_delayed
.
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
def preprocess_hdf_file_to_dataframe(filepath):
# process your data into a dataframe however you want, e.g.
with xr.open_dataset(filepath) as ds:
return ds.to_dataframe()
files = ['file1.hdf5', 'file2.hdf5']
futures = client.map(preprocess_hdf_file_to_dataframe, files)
df = dd.from_delayed(futures)
That said, this seems like a perfect use case for xarray, which can read HDF5 files and work with dask natively, e.g.
ds = xr.open_mfdataset(files)
This dataset is similar to a dask.dataframe, in that it contains references to dask.arrays which are read from the file. But xarray is built to handle N-dimensional arrays natively and can work much more naturally with the HDF5 format.
There are certainly areas where dataframes make more sense than a Dataset or DataArray, though, and converting between them can be tricky with larger-than-memory data, so the first approach is always an option if you want a dataframe.
Upvotes: 2