Reputation: 35
Task: read larger than memory csv files, convert to arrays and store in hdf5. One simple way is to use pandas to read the files in chunks but I wanted to use dask, so far without success:
Latest attempt:
fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')
How could I do this?
Actually, I have a set of csv files representing 2D slices of a 3D array that I would like to assemble and store. A suggestion on how to do the latter would be welcome as well.
Given the comments below, here is one of many variations I tried:
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)
which results in the error: KeyError: ('arr12345', 77, 0)
Upvotes: 2
Views: 1013
Reputation: 28673
You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.
Here is an example.
Data
0,1,2
2,3,4
Code
dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')
Where "True" means "find the lengths", or you can supply your own set of values.
Upvotes: 2
Reputation: 57251
You should use the to_hdf
method on dask dataframes instead of on dask arrays
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_hdf('myfile.hdf', '/data')
Alternatively, you might consider using parquet. This will be faster and is simpler in many ways
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_parquet('myfile.parquet')
For more information, see the documentation on creating and storing dask dataframes: http://docs.dask.org/en/latest/dataframe-create.html
If for some reason you really want to convert to a dask array first then you'll need to figure out how many rows each chunk of your data has and assign that to chunks attribute. See http://docs.dask.org/en/latest/array-chunks.html#unknown-chunks . I don't recommend this approach though, it's needlessly complex.
Upvotes: 1