Reputation: 57251
I have a dask dataframe object but would like to have a dask array. How do I accomplish this?
Upvotes: 6
Views: 3852
Reputation: 57251
There are three ways to do this.
.values
attribute, or the to_records()
method, like with Pandasmap_partitions
to call any function that converts a pandas dataframe into a numpy array on all of the partitions Here is an example doing all three.
>>> import dask
>>> df = dask.datasets.timeseries()
>>> df
Dask DataFrame Structure:
id name x y
npartitions=30
2000-01-01 int64 object float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: make-timeseries, 30 tasks
>>> import numpy as np
>>> df.map_partitions(np.asarray)
dask.array<asarray, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>
>>> df.to_dask_array()
dask.array<array, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>
>>> df.values
dask.array<values, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>
>>> df.to_records() # note that this returns a record array
dask.array<to_records, shape=(nan,), dtype=(numpy.record, [('timestamp', 'O'), ('id', '<i8'), ('name', 'O'), ('x', '<f8'), ('y', '<f8')]), chunksize=(nan,)
>>> dask.__version__
0.19.0
Note that because Dask dataframes don't maintain the number of rows in each chunk, the resulting arrays also won't have this information. (note the NaN
values in the shape and chunk size.
Upvotes: 7