Dask dataframe apply giving unexpected results when passing local variables as argument

Question

When calling the apply method of a dask DataFrame inside a for loop where I use the iterator variable as an argument to apply, I get unexpected results when performing the calculation later. This example shows the behavior:

import dask.dataframe as dd
import random
import numpy as np

df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 
                   'col_2': random.sample(range(10000), 10000) })
ddf = dd.from_pandas(df, npartitions=8)

def myfunc(x, channel):
    return channel

for ch in ['ch1','ch2']:
    ddf[f'df_apply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'df_apply_{ch}', np.unicode_))

print(ddf.head(5))

From the row-wise application of myfunc I expect to see two additional columns, one with "ch1" and one with "ch2" on each row. However, this is the output of the script:

   col_1  col_2 df_apply_ch1 df_apply_ch2
0   5485   2234          ch2          ch2
1   6338   6802          ch2          ch2
2   9408   5760          ch2          ch2
3   8447   1451          ch2          ch2
4   1230   3838          ch2          ch2

Apparently, the final iteration of the loop overwrote the first argument to apply. In fact, any later changes to ch between the loop and the call to head affect the result the same way, overwriting what I expect to see in both columns.

This is not what one sees doing the same exercise with pure pandas. And I found a work-around for dask as well:

def myapply(ddf, ch):
    ddf[f'myapply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'myapply_{ch}', np.unicode_))

for ch in ['ch1','ch2']:
    myapply(ddf, ch)

print(ddf.head(10))

gives:

   col_1  col_2 myapply_ch1 myapply_ch2
0   7394   3528         ch1         ch2
1   2181   6681         ch1         ch2
2   7945   1063         ch1         ch2
3   5164   8091         ch1         ch2
4   3569   2889         ch1         ch2

So I see that this has to do with the scope of the variable used as argument to apply, but I don't understand why exactly this happens with dask (only). Is this the intended/expected behavior?

Any insights would be appreciated! :)

hperrey · Accepted Answer

This turns out to be a duplicate after all, see question on stackoverlow including another work-around. A more detailed explanation of the behavior can be found in the corresponding issue on the dask tracker:

This isn't a bug, this is just how python works. Closures evaluate based on the defining scope, if you change the value of trig in that scope then the closure will evaluate differently. The issue here is that this code would run fine in pandas, since there is an evaluation in each loop, but in dask all the evaluations are delayed until later, and thus all use the same value for trig.

Where trig is the variable in the loop used in that discussion.

So this is not a bug and a feature of Python triggered by dask but not pandas.

Dask dataframe apply giving unexpected results when passing local variables as argument

Answers (1)

Related Questions