Reputation: 511
I need to merge multiple columns of a dataframe into a single column as comma-separated value.
For example, take this dataframe:
pd.DataFrame([['C1',2,np.nan,2,3,2],['C2',4,5,np.nan,6,1],['C3',9,2,4,np.nan,np.nan]], columns=list('ABCDEF'))
A B C D E F
0 C1 2 NaN 2.0 3.0 2.0
1 C2 4 5.0 NaN 6.0 1.0
2 C3 9 2.0 4.0 NaN NaN
I need to convert it into the following:
A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4
Is there any proper pythonic way to do this?
I have a huge number of rows. So iterating over rows will be problematic. Also I tried using df.apply
but can't seem to find any proper way to do this and ignore NaN values.
Upvotes: 1
Views: 120
Reputation: 863531
Idea is convert column A
to index first and then processing data in custom function with remove missing values by NaN != NaN
trick:
f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)
A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4
If dont use trick NaN != NaN
test missing values and None
s by notna
:
f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)
A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4
EDIT:
Some another solutions tested for 30k rows:
df = pd.DataFrame([['C1',2,np.nan,2,3,2],['C2',4,5,np.nan,6,1],['C3',9,2,4,np.nan,np.nan]], columns=list('ABCDEF'))
df = pd.concat([df] * 10000, ignore_index=True)
#pure pandas solution, slow
In [247]: %%timeit
...: df1 = df.set_index('A', append=True).stack().astype(int).astype(str).reset_index(level=-1)
...: df1.columns = ['a', 'b']
...: df1 = df1.a + '=' + df1.b
...: df1.groupby(level=[0, 'A']).agg(', '.join).reset_index(level=[1], name='Result')
...:
...:
...:
1.66 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [248]: %%timeit
...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
...:
344 ms ± 9.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [249]: %%timeit
...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
...:
431 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehensions solutions:
In [258]: %%timeit
...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
...: for x in df.drop('A', axis=1).to_dict('records')]
...:
...: df[['A']].assign(Result = L)
...:
250 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [259]: %%timeit
...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
...: for x in df.drop('A', axis=1).to_dict('records')]
...:
...: df[['A']].assign(Result = L)
...:
336 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 3
Reputation: 7676
You can also solve it with this
to_string = lambda x: ', '.join(['%s=%i' % (k,v) for k,v in x.dropna().items()])
test['Result'] = test[list('BCDEF')].T.apply(to_string).T
The idea is to work with transposed dataframe.
I just learned from the other answer, apply
can take axis
as an argument. Therefore, you don't actually need to transpose the dataframe (which could be very expensive)
to_string = lambda x: ', '.join(['%s=%i' % (k,v) for k,v in x.dropna().items()])
test['Result'] = test[list('BCDEF')].apply(to_string,axis=1)
Upvotes: 1