use dask to store larger then memory csv file(s) to hdf5 file

Question

Task: read larger than memory csv files, convert to arrays and store in hdf5. One simple way is to use pandas to read the files in chunks but I wanted to use dask, so far without success:

Latest attempt:

fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')

How could I do this?

Actually, I have a set of csv files representing 2D slices of a 3D array that I would like to assemble and store. A suggestion on how to do the latter would be welcome as well.

Given the comments below, here is one of many variations I tried:

dset  = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr   = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)

which results in the error: KeyError: ('arr12345', 77, 0)

mdurant · Accepted Answer

You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.

Here is an example.

Data

0,1,2
2,3,4

Code

dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')

Where "True" means "find the lengths", or you can supply your own set of values.

use dask to store larger then memory csv file(s) to hdf5 file

Answers (2)

For arrays

Related Questions