MRocklin
MRocklin

Reputation: 57251

How do I convert a Dask Dataframe into a Dask Array?

I have a dask dataframe object but would like to have a dask array. How do I accomplish this?

Upvotes: 6

Views: 3852

Answers (1)

MRocklin
MRocklin

Reputation: 57251

There are three ways to do this.

  1. Use the aptly named .to_dask_array() method
  2. Use the .values attribute, or the to_records() method, like with Pandas
  3. Use map_partitions to call any function that converts a pandas dataframe into a numpy array on all of the partitions

Here is an example doing all three.

>>> import dask

>>> df = dask.datasets.timeseries()

>>> df
Dask DataFrame Structure:
                   id    name        x        y
npartitions=30                                 
2000-01-01      int64  object  float64  float64
2000-01-02        ...     ...      ...      ...
...               ...     ...      ...      ...
2000-01-30        ...     ...      ...      ...
2000-01-31        ...     ...      ...      ...
Dask Name: make-timeseries, 30 tasks

>>> import numpy as np

>>> df.map_partitions(np.asarray)
dask.array<asarray, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>

>>> df.to_dask_array()
dask.array<array, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>

>>> df.values
dask.array<values, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>

>>> df.to_records()  # note that this returns a record array
dask.array<to_records, shape=(nan,), dtype=(numpy.record, [('timestamp', 'O'), ('id', '<i8'), ('name', 'O'), ('x', '<f8'), ('y', '<f8')]), chunksize=(nan,)

>>> dask.__version__
0.19.0

Note that because Dask dataframes don't maintain the number of rows in each chunk, the resulting arrays also won't have this information. (note the NaN values in the shape and chunk size.

Upvotes: 7

Related Questions