Reputation: 422
I need to create a column which is based on some condition on dask dataframe. In pandas it is fairly straightforward:
ddf['TEST_VAR'] = ['THIS' if x == 200607 else
'NOT THIS' if x == 200608 else
'THAT' if x == 200609 else 'NONE'
for x in ddf['shop_week'] ]
While in dask I have to do same thing like below:
def f(x):
if x == 200607:
y= 'THIS'
elif x == 200608 :
y= 'THAT'
else :
y= 1
return y
ddf1 = ddf.assign(col1 = list(ddf.shop_week.apply(f).compute()))
ddf1.compute()
Questions:
Upvotes: 12
Views: 4300
Reputation: 40628
A better approach might be pull out the column as a dask array and then perform some nested where
operations before adding it back to the dataframe:
import dask.array as da
x = ddf['shop_week'].to_dask_array()
df['TEST_VAR'] = \
da.where(x == 200607, 'THIS',
da.where(x == 200608, 'NOT THIS',
da.where(x == 200609, 'THAT', 'NONE')))
df['TEST_VAR'].compute()
Upvotes: 0
Reputation: 57261
Answers:
What you're doing now is almost ok. You don't need to call compute
until you're ready for your final answer.
# ddf1 = ddf.assign(col1 = list(ddf.shop_week.apply(f).compute()))
ddf1 = ddf.assign(col1 = ddf.shop_week.apply(f))
For some cases dd.Series.where
might be a good fit
ddf1 = ddf.assign(col1 = ddf.shop_week.where(cond=ddf.balance > 0, other=0))
As of version 0.10.2 you can now insert columns directly into dask.dataframes
ddf['col'] = ddf.shop_week.apply(f)
Upvotes: 7
Reputation: 2086
You could just use:
f = lambda x: 'THIS' if x == 200607 else 'NOT THIS' if x == 200608 else 'THAT' if x == 200609 else 'NONE'
And then:
ddf1 = ddf.assign(col1 = list(ddf.shop_week.apply(f).compute()))
Unfortunately I don't have an answer to the second question or I don't understand it...
Upvotes: 1