Reputation: 79
I have a pandas dataframe looking like this:
id | attr1 | attr2 | attr3 | label |
---|---|---|---|---|
123 | 2 or less | category1 | other_cat1 | 2 |
45 | between 3 and 4 | category3 | other_cat3 | 7 |
123 | 2 or less | category1 | other_cat1 | 2 |
76 | 5 or more | category4 | other_cat4 | 9 |
45 | between 3 and 4 | category3 | other_cat3 | 7 |
123 | 2 or less | category1 | other_cat1 | 1 |
What I want to have is a resulting dataframe with the same columns but the labels to be grouped by the id counts. I have done this:
df.groupby(['label'])['id'].count().reset_index()
which results of course in the following :
label | id |
---|---|
1 | 1 |
2 | 2 |
7 | 2 |
9 | 1 |
The goal is to have the following:
id | attr1 | attr2 | attr3 | label | counts |
---|---|---|---|---|---|
123 | 2 or less | category1 | other_cat1 | 2 | 2 |
45 | between 3 and 4 | category3 | other_cat3 | 7 | 2 |
76 | 5 or more | category4 | other_cat4 | 9 | 1 |
123 | 2 or less | category1 | other_cat1 | 1 | 1 |
The attr1, attr2, attr3 are always the same per id. Only the label value may differ per id. Thus, my problem can be summarized in the existence of multiple identical rows and I thought that I could group them like that in order to make it more concise.
I can't use the pandas transform method because I don't want the returning dataframe to have the same shape as the initial one. I can't use the pandas aggregate method because I need to have all the columns as the initial dataframe. I also can't think of a situation where I can use the pandas apply method.
I know probably that the answer somehow lies behind the use of one the above methods or some combination of them but I am lacking the experience to think of it. I would appreciate some guidance/help. Thank you.
Upvotes: 0
Views: 38
Reputation: 2128
First use .groupby.transform
to add count column in the dataframe. Then drop the duplicate rows.
Use:
df['count'] = df.groupby('label')['id'].transform('count')
out = df.drop_duplicates(keep = 'first')
Output:
>>> out
id attr1 attr2 attr3 label count
0 123 2 or less category1 other_cat1 2 2
1 45 between 3 and 4 category3 other_cat3 7 2
3 76 5 or more category4 other_cat4 9 1
5 123 2 or less category1 other_cat1 1 1
Upvotes: 1