Reputation: 2514
I want to multiply roughly 50,000 columns with one other column in a large dask dataframe (6_500_000 x 50_002
). The solution, using a for loop, works but is painfully slow. Below I tried two other appraoches that failed. Any advice is appreciated.
Pandas
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df[['a','b']].multiply(df['c'], axis="index")
Dask
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=1)
# works but very slow for large datasets:
for column in ['a', 'b']:
ddf[column] = ddf[column] * ddf['c']
# don't work:
ddf[['a','b']].multiply(ddf['c'], axis="index")
ddf[['a', 'b']].map_partitions(pd.DataFrame.mul, other=ddf['c'] ).compute()
Upvotes: 1
Views: 549
Reputation: 16683
Use .mul
for dask:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
ddf = dd.from_pandas(df, npartitions=1)
ddf[['a','b']] = ddf[['a','b']].mul(ddf['c'], axis='index') # or axis=0
ddf.compute()
Out[1]:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
Upvotes: 1
Reputation: 2776
You basically had it for pandas, just multiply()
isn't inplace. I also changed to using .loc
for all but one column so you don't type 50,000 column names :)
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[:, df.columns != 'c']=df.loc[:, df.columns != 'c'].multiply(df['c'], axis="index")
Output:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
NOTE: I am not familiar with Dask, but I imagine that it is the same issue for that attempt.
Upvotes: 1