Reputation: 2337
I'm trying to combine duplicate columns in my dataframe. My dataframe is Multiindexed and looks like
sex_COPYL sex_COPYR age_COPYL age_COPYR
ID Date
A 2010-01-01 NaN F NaN 230
B 2010-01-01 NaN F NaN 487
2010-02-01 NaN M NaN 488
C 2010-01-01 NaN M NaN 534
D 2012-09-08 M NaN 432 NaN
I expect the dataframe to look like
sex age
ID Date
A 2010-01-01 F 230
B 2010-01-01 F 487
2010-02-01 M 488
C 2010-01-01 M 534
D 2012-09-08 M 432
I'm trying to achieve this by
df.groupby(df.columns.map(lambda x: x.split('_COPY')[0], 1)).apply(lambda x: x.mode(1)[0])
but I get the error
ValueError: Grouper and axis must be same length
Other posts with regards to that error show that the user didn't have the column they were trying to group by. Is my error a result of the duplicate stripped names and if so, how do I fix it?
Upvotes: 0
Views: 373
Reputation: 9941
We can convert the column names first:
df.columns = [c.split('_COPY')[0] for c in df.columns]
df
Out:
sex sex age age
ID Date
A 2010-01-01 NaN F NaN 230
B 2010-01-01 NaN F NaN 487
NaN 2010-02-01 NaN M NaN 488
C 2010-01-01 NaN M NaN 534
D 2012-09-08 NaN M NaN 432
Then group by column names and use mode:
df.groupby(axis=1, level=0).agg(lambda x: x.mode(axis=1)[0])
Out:
age sex
ID Date
A 2010-01-01 230.0 F
B 2010-01-01 487.0 F
NaN 2010-02-01 488.0 M
C 2010-01-01 534.0 M
D 2012-09-08 432.0 M
Update: the original one-liner also works with axis=1
added to the list of groupby
parameters (since we're grouping by columns instead of rows):
df.groupby(df.columns.map(lambda x: x.split('_COPY')[0], 1), axis=1).apply(lambda x: x.mode(1)[0])
Upvotes: 1
Reputation: 81
you are missing axis=1 in the groupby:
df.groupby(df.columns.map(lambda x: x.split('_COPY')[0], 1), axis=1).apply(lambda x: x.mode(1)[0])
alternative solution (without groupby, but similar with stack and unstack):
df.rename(columns=lambda x: x.split('_COPY')[0]).stack().unstack()
the stack method drop na values by default
Upvotes: 1
Reputation: 11105
An alternate solution:
# use both bfill and ffill to handle NaNs on both
# left and right of valid values
df['sex'] = (df.filter(like='sex')
.bfill(axis=1)
.ffill(axis=1)
.iloc[:, 0])
df['age'] = (df.filter(like='age')
.bfill(axis=1)
.ffill(axis=1)
.iloc[:, 0]
.astype(int))
df = df[['sex', 'age']]
df
sex age
ID Date
A 2010-01-01 F 230
B 2010-01-01 F 487
2010-02-01 M 488
C 2010-01-01 M 534
D 2012-09-08 M 432
Upvotes: 1