ea42_gh
ea42_gh

Reputation: 35

use dask to store larger then memory csv file(s) to hdf5 file

Task: read larger than memory csv files, convert to arrays and store in hdf5. One simple way is to use pandas to read the files in chunks but I wanted to use dask, so far without success:

Latest attempt:

fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')

How could I do this?

Actually, I have a set of csv files representing 2D slices of a 3D array that I would like to assemble and store. A suggestion on how to do the latter would be welcome as well.

Given the comments below, here is one of many variations I tried:

dset  = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr   = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)

which results in the error: KeyError: ('arr12345', 77, 0)

Upvotes: 2

Views: 1013

Answers (2)

mdurant
mdurant

Reputation: 28673

You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.

Here is an example.

Data

0,1,2
2,3,4

Code

dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')

Where "True" means "find the lengths", or you can supply your own set of values.

Upvotes: 2

MRocklin
MRocklin

Reputation: 57251

You should use the to_hdf method on dask dataframes instead of on dask arrays

import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_hdf('myfile.hdf', '/data')

Alternatively, you might consider using parquet. This will be faster and is simpler in many ways

import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_parquet('myfile.parquet')

For more information, see the documentation on creating and storing dask dataframes: http://docs.dask.org/en/latest/dataframe-create.html

For arrays

If for some reason you really want to convert to a dask array first then you'll need to figure out how many rows each chunk of your data has and assign that to chunks attribute. See http://docs.dask.org/en/latest/array-chunks.html#unknown-chunks . I don't recommend this approach though, it's needlessly complex.

Upvotes: 1

Related Questions