Reputation: 347
Wanting to do something similar to this, but complete row aggregation even when nulls are present without including them.
import pandas as pd
import numpy as np
df = pd.DataFrame(data= {'Subject': ['X', 'G', 'H', 'M'],
'Col1': ['cat', 'dog', np.nan, 'horse'],
'Col2': [np.nan, 'black', 'brown', 'grey'],
'Col3': ['small', 'medium', 'large', 'large']})
df['Col4'] = df['Col1'] + ', ' + df['Col2'] + ', ' + df['Col3']
For clarification, this is the resulting dataframe I am looking for
Subject Col1 Col2 Col3 Col4
0 X cat NaN small cat, small
1 G dog black medium dog, black, medium
2 H NaN brown large brown, large
3 M horse grey large horse, grey, large
Upvotes: 5
Views: 11747
Reputation: 137
The following version of yatu's answer is to answer @annena's question:
why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?
def join_columns(df: pd.DataFrame, cols: list[str], join_str: str = '; ') -> pd.Series:
df_cp = df.copy()
at_least_one_col_populated = df_cp[cols].notnull().any(axis=1)
df_cp.loc[at_least_one_col_populated, 'return_col'] = df_cp[cols].stack().groupby(level=0, sort=False).agg(join_str.join).values
return df_cp['return_col']
df = pd.DataFrame({
'col1': ['1', '1', None, '1', None],
'col2': [None, None, None, None, None],
'col3': ['2', '2', '2', None, None],
})
df['joined'] = join_columns(df, ['col1', 'col3'])
df
To answer with words instead of code:
"Subject" was passed to set_index()
because it was unique, I think. Which is not necessary in my function because I filter out rows that are all blank.
And you could have specified Col1 and Col3 as dataframe slices, which is what my function does.
Upvotes: 0
Reputation: 148975
You could use apply
, dropna
and join
to the column axis:
df['Col4'] = df[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)
It gives as expected:
Subject Col1 Col2 Col3 Col4
0 X cat NaN small cat,small
1 G dog black medium dog,black,medium
2 H NaN brown large brown,large
3 M horse grey large horse,grey,large
It should be more or less 30% faster than @yatu's way for small dataframes like this one, but the other way is better for larger ones.
Upvotes: 14
Reputation: 88246
One approach is to set_index
and stack
(which will remove missing values), groupby
on the first level, and aggregate with str.join
:
df['Col4'] = (df.set_index('Subject')
.stack()
.groupby(level=0, sort=False)
.agg(', '.join)
.values)
print(df)
Subject Col1 Col2 Col3 Col4
0 X cat NaN small cat, small
1 G dog black medium dog, black, medium
2 H NaN brown large brown, large
3 M horse grey large horse, grey, large
Timings -
df_ = pd.concat([df]*1000, axis=0).reset_index(drop=True)
%timeit df_[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)
# 743 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (df_.set_index('Subject').stack().groupby(level=0, sort=False).agg(', '.join).values)
# 5.73 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 6