m13op22
m13op22

Reputation: 2337

Grouping by duplicate columns to merge them into one column with the same name

I'm trying to combine duplicate columns in my dataframe. My dataframe is Multiindexed and looks like

                   sex_COPYL    sex_COPYR    age_COPYL    age_COPYR
ID    Date
 A    2010-01-01        NaN           F            NaN          230
 B    2010-01-01        NaN           F            NaN          487
      2010-02-01        NaN           M            NaN          488
 C    2010-01-01        NaN           M            NaN          534
 D    2012-09-08          M         NaN            432          NaN

I expect the dataframe to look like

                       sex           age
ID    Date
 A    2010-01-01        F            230
 B    2010-01-01        F            487
      2010-02-01        M            488
 C    2010-01-01        M            534
 D    2012-09-08        M            432

I'm trying to achieve this by

df.groupby(df.columns.map(lambda x: x.split('_COPY')[0], 1)).apply(lambda x: x.mode(1)[0])

but I get the error

ValueError: Grouper and axis must be same length

Other posts with regards to that error show that the user didn't have the column they were trying to group by. Is my error a result of the duplicate stripped names and if so, how do I fix it?

Upvotes: 0

Views: 373

Answers (3)

perl
perl

Reputation: 9941

We can convert the column names first:

df.columns = [c.split('_COPY')[0] for c in df.columns]
df

Out:
    sex sex age age
ID  Date                
A   2010-01-01  NaN F   NaN 230
B   2010-01-01  NaN F   NaN 487
NaN 2010-02-01  NaN M   NaN 488
C   2010-01-01  NaN M   NaN 534
D   2012-09-08  NaN M   NaN 432

Then group by column names and use mode:

df.groupby(axis=1, level=0).agg(lambda x: x.mode(axis=1)[0])

Out:
        age sex
ID  Date        
A   2010-01-01  230.0   F
B   2010-01-01  487.0   F
NaN 2010-02-01  488.0   M
C   2010-01-01  534.0   M
D   2012-09-08  432.0   M

Update: the original one-liner also works with axis=1 added to the list of groupby parameters (since we're grouping by columns instead of rows):

df.groupby(df.columns.map(lambda x: x.split('_COPY')[0], 1), axis=1).apply(lambda x: x.mode(1)[0])

Upvotes: 1

liran
liran

Reputation: 81

you are missing axis=1 in the groupby:

df.groupby(df.columns.map(lambda x: x.split('_COPY')[0], 1), axis=1).apply(lambda x: x.mode(1)[0])

alternative solution (without groupby, but similar with stack and unstack):

df.rename(columns=lambda x: x.split('_COPY')[0]).stack().unstack()

the stack method drop na values by default

Upvotes: 1

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

An alternate solution:

# use both bfill and ffill to handle NaNs on both
# left and right of valid values
df['sex'] = (df.filter(like='sex')
               .bfill(axis=1)
               .ffill(axis=1)
               .iloc[:, 0])

df['age'] = (df.filter(like='age')
               .bfill(axis=1)
               .ffill(axis=1)
               .iloc[:, 0]
               .astype(int))

df = df[['sex', 'age']]
df
                      sex  age
ID         Date
A          2010-01-01   F  230
B          2010-01-01   F  487
           2010-02-01   M  488
C          2010-01-01   M  534
D          2012-09-08   M  432

Upvotes: 1

Related Questions