Reputation: 1137

Dask - Merge multiple columns into a single column

I have a dask dataframe as below:

   Column1  Column2  Column3  Column4  Column5
 0    a        1        2        3        4
 1    a        3        4        5
 2    b        6        7        8
 3    c        7        7

I want to merge all of the columns into a single one efficiently. And I want each row to be a single string. Like below:

   Merged_Column
 0    a,1,2,3,4   
 1    a,3,4,5  
 2    b,6,7,8  
 3    c,7,7,7

I've seen this question but it doesn't seem efficient since it is using the apply function. How can I achieve this as efficient as possible? (Speed + memory usage) Or is apply isn't as problematic as I believe since this is dask, not pandas.

This is what I tried. It seems like it is working but I am worried about the speed of it with the big dataframe.

cols= df.columns
df['combined'] = df[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1, meta=('str'))
df = df.drop(cols, axis=1)

I also need to get rid of the column header.

Upvotes: 2

Answers (2)

rpanai

Reputation: 13437

When you have to join string @saravanan saminathan methods wins hands down. Here there are some timing with dask

import dask.dataframe as dd
import numpy as np
import pandas as pd

N = int(1e6)

df = pd.DataFrame(np.random.randint(0,100,[N,10]))
df = dd.from_pandas(df, npartitions=4)
df = df.astype("str")
df_bk = df.copy()

Apply

%%time
df["comb"] = df.apply(lambda x:",".join(x), axis=1,meta=("str"))
df = df.compute()

CPU times: user 44.4 s, sys: 925 ms, total: 45.3 s
Wall time: 44.6 s

Add (explicit)

df = df_bk.copy()

%%time
df["comb"] = df[0]+","+df[1]+","+df[2]+","+df[3]+","+df[4]+","+\
             df[5]+","+df[6]+","+df[7]+","+df[8]+","+df[9]

df = df.compute()

CPU times: user 8.95 s, sys: 860 ms, total: 9.81 s
Wall time: 9.56 s

Add (loop)

In case you have many columns and you don't want to write down all of them

df = df_bk.copy()

%%time
df["comb"] = ''
for col in df.columns:
    df["comb"]+=df[col]+","

df = df.compute()

CPU times: user 11.6 s, sys: 1.32 s, total: 12.9 s
Wall time: 12.3 s

Upvotes: 3

saravanan saminathan

Reputation: 702

Lambda function will take each row and so it may affect the speed of the process

So If we apply the operation by columns wise it will be fast

df = df.astype(str)
df["Column1"]+","+df["Column2"]+","+df["Column3"]+","+df["Column4"]+","+df["Column5"]