Reda E.
Reda E.

Reputation: 878

Build a dask dataframe from a list of dask delayed objects

I have a list of dask delayed objects Portfolio_perfs:

type(Portfolio_perfs)
<class 'list'>
#print until 3
Portfolio_perfs[:3]
[Delayed('getitem-b7fd8629e2a0ecfe4e61ae6f39926140'), Delayed('getitem-af3225459229d541b73dc79319edaec2'), Delayed('getitem-0555389e6dd01031de85e293b8c42b85')]

Each delayed object is a numpy array of length 2

Portfolio_perfs[0].compute()
array([0.75620425, 0.1835988 ])

I want to build the following dataframe without using dask.compute:

pd.DataFrame(dask.compute(*Portfolio_perfs))
            0         1
0    0.756204  0.183599
1    0.825101  0.195705
2    0.792804  0.189422
3    0.786267  0.178194
4    0.860377  0.220204
..        ...       ...
595  0.636857  0.139955
596  0.925144  0.218462
597  0.925077  0.213963
598  0.922016  0.206081
599  0.770950  0.170273

[600 rows x 2 columns]

How can I build this dask dataframe without going through dask.compute? Thank you

Upvotes: 1

Views: 1825

Answers (2)

SultanOrazbayev
SultanOrazbayev

Reputation: 16581

Since each delayed object is a numpy array, you are interested in da.from_delayed():

dask_array = da.from_delayed(Portfolio_perfs)

Alternatively, it's possible to convert numpy arrays to pandas dataframes and then use:

dd.from_delayed()

Note that it's not possible to do it with pd.DataFrame because pandas will not know what to do with the delayed objects, so you will need to use dask.dataframe for this task.

Upvotes: 1

Reda E.
Reda E.

Reputation: 878

tried to use dd.from_delayed but had the following error

dd.from_delayed(Portfolios_perfs)
TypeError: Expected partition to be DataFrame, Series, or Index, got numpy.ndarray

had to convert my numpy array to dataframe before using dd.from_delayed()

Portfolios_perfs[0].compute()
          0         1
0  0.764544  0.176615
#
dd_final=dd.from_delayed(Portfolios_perfs)
dd_final
Dask DataFrame Structure:
                       0        1
npartitions=300
                 float64  float64
                     ...      ...
...                  ...      ...
                     ...      ...
                     ...      ...
Dask Name: from-delayed, 900 tasks
#
#
dd_final.compute()
           0         1
0   0.764544  0.176615
0   0.753957  0.176094
0   0.891951  0.180247
0   0.813954  0.180084
0   1.089214  0.260875
..       ...       ...
0   0.655544  0.138117
0   0.944792  0.233119
0   0.720967  0.157746
0   0.774837  0.181025
0   0.770270  0.165283

[300 rows x 2 columns]

Upvotes: 1

Related Questions