Trace R.
Trace R.

Reputation: 347

Concatenate strings across columns that are not null

Wanting to do something similar to this, but complete row aggregation even when nulls are present without including them.

import pandas as pd
import numpy as np

df = pd.DataFrame(data= {'Subject': ['X', 'G', 'H', 'M'],
                         'Col1': ['cat', 'dog', np.nan, 'horse'],
                         'Col2': [np.nan, 'black', 'brown', 'grey'],
                         'Col3': ['small', 'medium', 'large', 'large']})

df['Col4'] = df['Col1'] + ', ' + df['Col2'] + ', ' + df['Col3']

For clarification, this is the resulting dataframe I am looking for

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Upvotes: 5

Views: 11747

Answers (3)

Weston A. Greene
Weston A. Greene

Reputation: 137

The following version of yatu's answer is to answer @annena's question:

why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?

def join_columns(df: pd.DataFrame, cols: list[str], join_str: str = '; ') -> pd.Series:
    df_cp = df.copy()
    at_least_one_col_populated = df_cp[cols].notnull().any(axis=1)
    df_cp.loc[at_least_one_col_populated, 'return_col'] = df_cp[cols].stack().groupby(level=0, sort=False).agg(join_str.join).values
    return df_cp['return_col']

df = pd.DataFrame({
    'col1': ['1', '1', None, '1', None],
    'col2': [None, None, None, None, None],
    'col3': ['2', '2', '2', None, None],
})

df['joined'] = join_columns(df, ['col1', 'col3'])
df

enter image description here'

To answer with words instead of code: "Subject" was passed to set_index() because it was unique, I think. Which is not necessary in my function because I filter out rows that are all blank. And you could have specified Col1 and Col3 as dataframe slices, which is what my function does.

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 148975

You could use apply, dropna and join to the column axis:

df['Col4'] = df[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)

It gives as expected:

  Subject   Col1   Col2    Col3              Col4
0       X    cat    NaN   small         cat,small
1       G    dog  black  medium  dog,black,medium
2       H    NaN  brown   large       brown,large
3       M  horse   grey   large  horse,grey,large

It should be more or less 30% faster than @yatu's way for small dataframes like this one, but the other way is better for larger ones.

Upvotes: 14

yatu
yatu

Reputation: 88246

One approach is to set_index and stack (which will remove missing values), groupby on the first level, and aggregate with str.join:

df['Col4'] = (df.set_index('Subject')
                .stack()
                .groupby(level=0, sort=False)
                .agg(', '.join)
                .values)

print(df)

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Timings -

df_ = pd.concat([df]*1000, axis=0).reset_index(drop=True)

%timeit df_[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)
# 743 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit (df_.set_index('Subject').stack().groupby(level=0, sort=False).agg(', '.join).values)
# 5.73 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 6

Related Questions