Reputation: 1147
Similar unanswered question: Row by row processing of a Dask DataFrame
I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to Dask is:
for row in df.itertuples():
ratio = row.ratio
tmpratio = row.tmpratio
tmplabel = row.tmplabel
if tmpratio > ratio:
df.loc[row.Index,'ratio'] = tmpratio
df.loc[row.Index,'label'] = tmplabel
What is the appropriate way to set a value by index in Dask, or conditionally set values in rows? Given that .loc
doesn't support item assignment in Dask, there does not appear to be a set_value
, at[]
, or iat[]
in Dask either.
I have attempted to use map_partitions with assign, but I am not seeing any ability to perform conditional assignment at the row-level.
Upvotes: 3
Views: 6417
Reputation: 57251
Dask dataframe does not support efficient iteration or row assignment. In general these workflows rarely scale well. They are also quite slow in Pandas itself.
Instead, you might consider using the Series.where method. Here is a minimal example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['z'] = ddf.x.where(ddf.x > ddf.y, ddf.y)
In [6]: ddf.compute()
Out[6]:
x y z
0 1 3 3
1 2 2 2
2 3 1 3
Upvotes: 6