Reputation: 347

Concatenate strings across columns that are not null

Wanting to do something similar to this, but complete row aggregation even when nulls are present without including them.

import pandas as pd
import numpy as np

df = pd.DataFrame(data= {'Subject': ['X', 'G', 'H', 'M'],
                         'Col1': ['cat', 'dog', np.nan, 'horse'],
                         'Col2': [np.nan, 'black', 'brown', 'grey'],
                         'Col3': ['small', 'medium', 'large', 'large']})

df['Col4'] = df['Col1'] + ', ' + df['Col2'] + ', ' + df['Col3']

For clarification, this is the resulting dataframe I am looking for

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Upvotes: 5

Answers (3)

Weston A. Greene

Reputation: 141

The following version of yatu's answer is to answer @annena's question:

why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?

def join_columns(df: pd.DataFrame, cols: list[str], join_str: str = '; ') -> pd.Series:
    df_cp = df.copy()
    at_least_one_col_populated = df_cp[cols].notnull().any(axis=1)
    df_cp.loc[at_least_one_col_populated, 'return_col'] = df_cp[cols].stack().groupby(level=0, sort=False).agg(join_str.join).values
    return df_cp['return_col']

df = pd.DataFrame({
    'col1': ['1', '1', None, '1', None],
    'col2': [None, None, None, None, None],
    'col3': ['2', '2', '2', None, None],
})

df['joined'] = join_columns(df, ['col1', 'col3'])
df

To answer with words instead of code: "Subject" was passed to set_index() because it was unique, I think. Which is not necessary in my function because I filter out rows that are all blank. And you could have specified Col1 and Col3 as dataframe slices, which is what my function does.

Upvotes: 0

Serge Ballesta

Reputation: 149195

You could use apply, dropna and join to the column axis:

df['Col4'] = df[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)

It gives as expected:

  Subject   Col1   Col2    Col3              Col4
0       X    cat    NaN   small         cat,small
1       G    dog  black  medium  dog,black,medium
2       H    NaN  brown   large       brown,large
3       M  horse   grey   large  horse,grey,large

It should be more or less 30% faster than @yatu's way for small dataframes like this one, but the other way is better for larger ones.

Upvotes: 14

yatu

Reputation: 88305

One approach is to set_index and stack (which will remove missing values), groupby on the first level, and aggregate with str.join:

df['Col4'] = (df.set_index('Subject')
                .stack()
                .groupby(level=0, sort=False)
                .agg(', '.join)
                .values)

print(df)

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Timings -

df_ = pd.concat([df]*1000, axis=0).reset_index(drop=True)

%timeit df_[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)
# 743 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit (df_.set_index('Subject').stack().groupby(level=0, sort=False).agg(', '.join).values)
# 5.73 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 6

Concatenate strings across columns that are not null

Answers (3)

Related Questions