Nilakshi Naphade
Nilakshi Naphade

Reputation: 1075

Group the values using one column and return the one having max value in other column using pandas dataframe

Following is a sample of my dataframe :

revisionId  | itemId    | wikidataType
290397666   | 23        | Q5
307190482   | 23        | Q5
292002833   | 80        | Q5
292428272   | 80        | Q5
305018745   | 80        | Q5
305018964   | 80        | Q5
305019084   | 80        | Q5
301731639   | 181       | Q5
303692414   | 181       | Q5
306600439   | 192       | Q5
294596767   | 206       | Q5
294597048   | 206       | Q5

I want to group the elements using itemId column and return only one row for each item which will have maximum revisionId. I have already sorted them based on wikidataType,itemId and revisionId. Following is how I want my dataframe to be look like:

revisionId  | itemId    | wikidataType
307190482   | 23        | Q5
305019084   | 80        | Q5
303692414   | 181       | Q5
306600439   | 192       | Q5
294597048   | 206       | Q5

Moreover, here this wikidataType column contains only one value 'Q5'. But in whole dataframe, there are 100 different types. But that might not be an issue because I can iterate over these 100 types since I have already fixed their values. Does anyone has idea about this? Thank in advance.

Upvotes: 1

Views: 81

Answers (1)

jezrael
jezrael

Reputation: 862511

Need sort_values + drop_duplicates with select columns for check dupes and keep only last value:

df = (df.sort_values(by=['wikidataType', 'itemId', 'revisionId']) 
        .drop_duplicates(['itemId','wikidataType'], keep='last'))
print (df)
    revisionId  itemId wikidataType
1    307190482      23           Q5
6    305019084      80           Q5
8    303692414     181           Q5
9    306600439     192           Q5
11   294597048     206           Q5

Upvotes: 1

Related Questions