Reputation: 1075
Following is a sample of my dataframe :
revisionId | itemId | wikidataType
290397666 | 23 | Q5
307190482 | 23 | Q5
292002833 | 80 | Q5
292428272 | 80 | Q5
305018745 | 80 | Q5
305018964 | 80 | Q5
305019084 | 80 | Q5
301731639 | 181 | Q5
303692414 | 181 | Q5
306600439 | 192 | Q5
294596767 | 206 | Q5
294597048 | 206 | Q5
I want to group the elements using itemId column and return only one row for each item which will have maximum revisionId. I have already sorted them based on wikidataType,itemId and revisionId. Following is how I want my dataframe to be look like:
revisionId | itemId | wikidataType
307190482 | 23 | Q5
305019084 | 80 | Q5
303692414 | 181 | Q5
306600439 | 192 | Q5
294597048 | 206 | Q5
Moreover, here this wikidataType column contains only one value 'Q5'. But in whole dataframe, there are 100 different types. But that might not be an issue because I can iterate over these 100 types since I have already fixed their values. Does anyone has idea about this? Thank in advance.
Upvotes: 1
Views: 81
Reputation: 862511
Need sort_values
+ drop_duplicates
with select columns for check dupes
and keep only last value:
df = (df.sort_values(by=['wikidataType', 'itemId', 'revisionId'])
.drop_duplicates(['itemId','wikidataType'], keep='last'))
print (df)
revisionId itemId wikidataType
1 307190482 23 Q5
6 305019084 80 Q5
8 303692414 181 Q5
9 306600439 192 Q5
11 294597048 206 Q5
Upvotes: 1