Reputation: 21
When calling the apply
method of a dask DataFrame
inside a for loop where I use the iterator variable as an argument to apply
, I get unexpected results when performing the calculation later. This example shows the behavior:
import dask.dataframe as dd
import random
import numpy as np
df = pd.DataFrame({'col_1':random.sample(range(10000), 10000),
'col_2': random.sample(range(10000), 10000) })
ddf = dd.from_pandas(df, npartitions=8)
def myfunc(x, channel):
return channel
for ch in ['ch1','ch2']:
ddf[f'df_apply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'df_apply_{ch}', np.unicode_))
print(ddf.head(5))
From the row-wise application of myfunc
I expect to see two additional columns, one with "ch1" and one with "ch2" on each row. However, this is the output of the script:
col_1 col_2 df_apply_ch1 df_apply_ch2
0 5485 2234 ch2 ch2
1 6338 6802 ch2 ch2
2 9408 5760 ch2 ch2
3 8447 1451 ch2 ch2
4 1230 3838 ch2 ch2
Apparently, the final iteration of the loop overwrote the first argument to apply
. In fact, any later changes to ch
between the loop and the call to head
affect the result the same way, overwriting what I expect to see in both columns.
This is not what one sees doing the same exercise with pure pandas. And I found a work-around for dask as well:
def myapply(ddf, ch):
ddf[f'myapply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'myapply_{ch}', np.unicode_))
for ch in ['ch1','ch2']:
myapply(ddf, ch)
print(ddf.head(10))
gives:
col_1 col_2 myapply_ch1 myapply_ch2
0 7394 3528 ch1 ch2
1 2181 6681 ch1 ch2
2 7945 1063 ch1 ch2
3 5164 8091 ch1 ch2
4 3569 2889 ch1 ch2
So I see that this has to do with the scope of the variable used as argument to apply, but I don't understand why exactly this happens with dask (only). Is this the intended/expected behavior?
Any insights would be appreciated! :)
Upvotes: 1
Views: 209
Reputation: 21
This turns out to be a duplicate after all, see question on stackoverlow including another work-around. A more detailed explanation of the behavior can be found in the corresponding issue on the dask tracker:
This isn't a bug, this is just how python works. Closures evaluate based on the defining scope, if you change the value of
trig
in that scope then the closure will evaluate differently. The issue here is that this code would run fine in pandas, since there is an evaluation in each loop, but in dask all the evaluations are delayed until later, and thus all use the same value fortrig
.
Where trig
is the variable in the loop used in that discussion.
So this is not a bug and a feature of Python triggered by dask but not pandas.
Upvotes: 1