Reputation: 1137
I have a dask dataframe as below:
Column1 Column2 Column3 Column4 Column5
0 a 1 2 3 4
1 a 3 4 5
2 b 6 7 8
3 c 7 7
I want to merge all of the columns into a single one efficiently. And I want each row to be a single string. Like below:
Merged_Column
0 a,1,2,3,4
1 a,3,4,5
2 b,6,7,8
3 c,7,7,7
I've seen this question but it doesn't seem efficient since it is using the apply function. How can I achieve this as efficient as possible? (Speed + memory usage) Or is apply isn't as problematic as I believe since this is dask, not pandas.
This is what I tried. It seems like it is working but I am worried about the speed of it with the big dataframe.
cols= df.columns
df['combined'] = df[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1, meta=('str'))
df = df.drop(cols, axis=1)
I also need to get rid of the column header.
Upvotes: 2
Views: 3298
Reputation: 13437
When you have to join string @saravanan saminathan methods wins hands down. Here there are some timing with dask
import dask.dataframe as dd
import numpy as np
import pandas as pd
N = int(1e6)
df = pd.DataFrame(np.random.randint(0,100,[N,10]))
df = dd.from_pandas(df, npartitions=4)
df = df.astype("str")
df_bk = df.copy()
%%time
df["comb"] = df.apply(lambda x:",".join(x), axis=1,meta=("str"))
df = df.compute()
CPU times: user 44.4 s, sys: 925 ms, total: 45.3 s
Wall time: 44.6 s
df = df_bk.copy()
%%time
df["comb"] = df[0]+","+df[1]+","+df[2]+","+df[3]+","+df[4]+","+\
df[5]+","+df[6]+","+df[7]+","+df[8]+","+df[9]
df = df.compute()
CPU times: user 8.95 s, sys: 860 ms, total: 9.81 s
Wall time: 9.56 s
In case you have many columns and you don't want to write down all of them
df = df_bk.copy()
%%time
df["comb"] = ''
for col in df.columns:
df["comb"]+=df[col]+","
df = df.compute()
CPU times: user 11.6 s, sys: 1.32 s, total: 12.9 s
Wall time: 12.3 s
Upvotes: 3
Reputation: 702
Lambda function will take each row and so it may affect the speed of the process
So If we apply the operation by columns wise it will be fast
df = df.astype(str)
df["Column1"]+","+df["Column2"]+","+df["Column3"]+","+df["Column4"]+","+df["Column5"]
Upvotes: 2