Combine Multiple Column Values in a Single Column

Question

I need to merge multiple columns of a dataframe into a single column as comma-separated value.

For example, take this dataframe:

pd.DataFrame([['C1',2,np.nan,2,3,2],['C2',4,5,np.nan,6,1],['C3',9,2,4,np.nan,np.nan]], columns=list('ABCDEF'))

    A  B    C    D    E    F
0  C1  2  NaN  2.0  3.0  2.0
1  C2  4  5.0  NaN  6.0  1.0
2  C3  9  2.0  4.0  NaN  NaN

I need to convert it into the following:

    A              Result
0  C1  B=2, D=2, E=3, F=2
1  C2  B=4, C=5, E=6, F=1
2  C3       B=9, C=2, D=4

Is there any proper pythonic way to do this?

I have a huge number of rows. So iterating over rows will be problematic. Also I tried using df.apply but can't seem to find any proper way to do this and ignore NaN values.

jezrael · Accepted Answer

Idea is convert column A to index first and then processing data in custom function with remove missing values by NaN != NaN trick:

f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)
    A              Result
0  C1  B=2, D=2, E=3, F=2
1  C2  B=4, C=5, E=6, F=1
2  C3       B=9, C=2, D=4

If dont use trick NaN != NaN test missing values and Nones by notna:

f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)

    A              Result
0  C1  B=2, D=2, E=3, F=2
1  C2  B=4, C=5, E=6, F=1
2  C3       B=9, C=2, D=4

EDIT:

Some another solutions tested for 30k rows:

df = pd.DataFrame([['C1',2,np.nan,2,3,2],['C2',4,5,np.nan,6,1],['C3',9,2,4,np.nan,np.nan]], columns=list('ABCDEF'))
df = pd.concat([df] * 10000, ignore_index=True)

#pure pandas solution, slow
In [247]: %%timeit
     ...: df1 = df.set_index('A', append=True).stack().astype(int).astype(str).reset_index(level=-1)
     ...: df1.columns = ['a', 'b']
     ...: df1 = df1.a + '=' + df1.b
     ...: df1.groupby(level=[0, 'A']).agg(', '.join).reset_index(level=[1], name='Result')
     ...: 
     ...: 
     ...: 
1.66 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [248]: %%timeit
     ...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
     ...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
     ...: 
344 ms ± 9.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [249]: %%timeit
     ...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
     ...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
     ...: 
431 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehensions solutions:

In [258]: %%timeit
     ...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
     ...:                   for x in df.drop('A', axis=1).to_dict('records')]
     ...:        
     ...: df[['A']].assign(Result = L)
     ...: 
250 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [259]: %%timeit
     ...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
     ...:                   for x in df.drop('A', axis=1).to_dict('records')]
     ...:        
     ...: df[['A']].assign(Result = L)
     ...: 
336 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Combine Multiple Column Values in a Single Column

Answers (2)

Related Questions