Reputation: 75
I have a dataframe with several categorical columns, and I want to aggregate all these into a single categorical column, preferably using Pandas.
For an example, if I have two columns, named category1 (c1) and category2 (c2), both with data that range from 0 to 2, I want to aggregate them in some other column category (c), which can range from 0 to 5, representing all the possible categorical values combinations.
I would go from this:
d1 d2 c1 c2
1 1 NA 0
2 1 1 1
3 1 0 2
4 2 2 NA
5 1 NA NA
6 2 2 2
7 2 0 NA
8 2 0 2
To this:
d1 d2 c
1 1 0
2 1 1
3 1 2
4 2 3
5 1 4
6 2 5
7 2 6
8 2 2
I tried following this, but it didn't seem to work and threw some errors, namely ValueError: cannot reindex from a duplicate axis.
I appreciate in advance any help.
Upvotes: 2
Views: 155
Reputation: 59274
IIUC, you can use ngroup
with groupby
.
df['c'] = df.fillna(-1).groupby(['c1', 'c2']).ngroup()
The order might be arbitrary (i.e. not same as yours), but hopefully that's not important.
d1 d2 c
0 1 1 1
1 2 1 4
2 3 1 3
3 4 2 5
4 5 1 0
5 6 2 6
6 7 2 2
7 8 2 3
Upvotes: 2