How to do row processing and item assignment in Dask

Question

Similar unanswered question: Row by row processing of a Dask DataFrame

I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to Dask is:

 for row in df.itertuples():                                                                                                                                                                                                         
     ratio = row.ratio                                                                                                                                                                                                                     
     tmpratio = row.tmpratio                                                                                                                                                                                                                                                                                                                                                                                                 
     tmplabel = row.tmplabel                                                                                                                                                                                                               
     if tmpratio > ratio:                                                                                                                                                                                                                  
         df.loc[row.Index,'ratio'] = tmpratio                                                                                                                                                                                        
         df.loc[row.Index,'label'] = tmplabel

What is the appropriate way to set a value by index in Dask, or conditionally set values in rows? Given that .loc doesn't support item assignment in Dask, there does not appear to be a set_value, at[], or iat[] in Dask either.

I have attempted to use map_partitions with assign, but I am not seeing any ability to perform conditional assignment at the row-level.

MRocklin · Accepted Answer

Dask dataframe does not support efficient iteration or row assignment. In general these workflows rarely scale well. They are also quite slow in Pandas itself.

Instead, you might consider using the Series.where method. Here is a minimal example:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})

In [3]: import dask.dataframe as dd

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf['z'] = ddf.x.where(ddf.x > ddf.y, ddf.y)

In [6]: ddf.compute()
Out[6]:
   x  y  z
0  1  3  3
1  2  2  2
2  3  1  3

How to do row processing and item assignment in Dask

Answers (1)

Related Questions