Load HDF file into list of Python Dask DataFrames

Question

I have a HDF5 file that I would like to load into a list of Dask DataFrames. I have set this up using a loop following an abbreviated version of the Dask pipeline approach. Here is the code:

import pandas as pd
from dask import compute, delayed
import dask.dataframe as dd
import os, h5py

@delayed
def load(d,k):
    ddf = dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
    return ddf

if __name__ == '__main__':      
    d = 'C:\Users\User\FileD'
    loaded = [load(d,'/DF'+str(i)) for i in range(1,10)]

    ddf_list = compute(*loaded)
    print(ddf_list[0].head(),ddf_list[0].compute().shape)

I get this error message:

C:\Python27\lib\site-packages	ables\group.py:1187: UserWarning: problems loading leaf ``/DF1/table``::

  HDF5 error back trace

  File "..\..\hdf5-1.8.18\src\H5Dio.c", line 173, in H5Dread
    can't read data
  File "..\..\hdf5-1.8.18\src\H5Dio.c", line 543, in H5D__read
    can't initialize I/O info
  File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 841, in H5D__chunk_io_init
    unable to create file chunk selections
  File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 1330, in H5D__create_chunk_file_map_hyper
    can't insert chunk into skip list
  File "..\..\hdf5-1.8.18\src\H5SL.c", line 1066, in H5SL_insert
    can't create new skip list node
  File "..\..\hdf5-1.8.18\src\H5SL.c", line 735, in H5SL_insert_common
    can't insert duplicate key

End of HDF5 error back trace

Problems reading the array data.

The leaf will become an ``UnImplemented`` node.
  % (self._g_join(childname), exc))

The message mentions a duplicate key. I iterated over the first 9 files to test out the code and, in the loop, I am using each iteration to assemble a different key that I use with dd.read_hdf. Across all iterations, I'm keeping the filename is the same - only the key is being changed.

I need to use dd.concat(list,axis=0,...) in order to vertically concatenate the contents of the file. My approach was to load them into a list first and then concatenate them.

I have installed PyTables and h5Py and have Dask version 0.14.3+2.

With Pandas 0.20.1, I seem to get this to work:

for i in range(1,10):
    hdf = pd.HDFStore(os.path.join(d,'Cleaned.h5'),mode='r')
    df = hdf.get('/DF{}' .format(i))
    print df.shape
    hdf.close()

Is there a way I can load this HDF5 file into a list of Dask DataFrames? Or is there another approach to vertically concatenate them together?

MRocklin · Accepted Answer

Dask.dataframe is already lazy, so there is no need to use dask.delayed to make it lazier. You can just call dd.read_hdf repeatedly:

ddfs = [dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
        for k in keys]

ddf = dd.concat(ddfs)

Load HDF file into list of Python Dask DataFrames

Answers (1)

Related Questions