lightyagami96
lightyagami96

Reputation: 336

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.

Following is pandas equivalent -

import dask.dataframe as dd

df['adjusted_revenue'] =  0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']

I am trying to do this on a dask dataframe but it doesn't support assignment.

TypeError: '_LocIndexer' object does not support item assignment

This is working for me -

df['adjusted_revenue'] =  0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])

However, I was hoping if there is any simpler way to do this.

Thanks!

Upvotes: 1

Views: 756

Answers (1)

mdurant
mdurant

Reputation: 28673

You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.

def make_col(df):
    df['adjusted_revenue'] =  0
    df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
    df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
    return df

new_df = df.map_partitions(make_col)

Upvotes: 1

Related Questions