Reputation: 69
I have a dataframe that has dummy variables; however, I want to group multiple columns(100+) into a few amount of columns. Any suggestion(s) would be much appreciated. Thanks! For example:
A_1 | A_2 | A_3| B_1| B_2| B_3| C_1| C_2| C_2|
0 | 0| 1| 0| 0| 0| 0| 0| 0| 0|
1 | 0| 0| 0| 0| 0| 1| 0| 1| 0|
2 | 0| 0| 0| 0| 0| 0| 1| 0| 0|
3 | 0| 0| 0| 0| 1| 0| 0| 0| 0|
4 | 1| 0| 0| 0| 0| 0| 0| 0| 0|
5 | 0| 0| 1| 0| 0| 0| 0| 1| 0|
6 | 0| 0| 0| 1| 0| 0| 0| 0| 0|
Desired Output:
A| B| C|
0| 1| 0| 0|
1| 0| 1| 1|
2| 0| 0| 1|
3| 0| 1| 0|
4| 1| 0| 0|
5| 1| 0| 1|
6| 0| 1| 0|
I've tried using this code; however, I keep running into error messages saying that the column name is missing.
categories = {'A':'A','B': 'B','C': 'C'}
def correct_categories(cols1):
return [categories[cat] for col1 in cols1 for cat in categories.keys() if col1.startswith(cat)]
rslt = df3.groupby(correct_categories(df3.columns),axis=1).sum()
print(rslt)
Error Message: KeyError: 'A'
Upvotes: 1
Views: 74
Reputation: 150785
Try extracting the first part with .str.split()
or .str.extract
, then groupby
on axis=1
:
# also groupby on
# df.columns.str.extract('^([^_]+)', expand=False)
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
A B C
0 1 0 0
1 0 1 1
2 0 0 1
3 0 1 0
4 1 0 0
5 1 0 1
6 0 1 0
Upvotes: 1