safex
safex

Reputation: 2514

Multiply many columns by one column in dask

I want to multiply roughly 50,000 columns with one other column in a large dask dataframe (6_500_000 x 50_002). The solution, using a for loop, works but is painfully slow. Below I tried two other appraoches that failed. Any advice is appreciated.

Pandas

import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df[['a','b']].multiply(df['c'], axis="index")

Dask

import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=1)

# works but very slow for large datasets: 
for column in ['a', 'b']:
    ddf[column] = ddf[column] * ddf['c']

# don't work:
ddf[['a','b']].multiply(ddf['c'], axis="index") 
ddf[['a', 'b']].map_partitions(pd.DataFrame.mul, other=ddf['c'] ).compute()

Upvotes: 1

Views: 549

Answers (2)

David Erickson
David Erickson

Reputation: 16683

Use .mul for dask:

import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
ddf = dd.from_pandas(df, npartitions=1)

ddf[['a','b']] = ddf[['a','b']].mul(ddf['c'], axis='index') # or axis=0

ddf.compute()
Out[1]: 
    a   b  c
0   7  28  7
1  16  40  8
2  27  54  9

Upvotes: 1

noah
noah

Reputation: 2776

You basically had it for pandas, just multiply() isn't inplace. I also changed to using .loc for all but one column so you don't type 50,000 column names :)

import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[:, df.columns != 'c']=df.loc[:, df.columns != 'c'].multiply(df['c'], axis="index")

Output:

    a   b  c
0   7  28  7
1  16  40  8
2  27  54  9

NOTE: I am not familiar with Dask, but I imagine that it is the same issue for that attempt.

Upvotes: 1

Related Questions