aidanmelen
aidanmelen

Reputation: 6604

pandas data frame effeciently remove duplicates and keep records largest int value

I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?

import pandas
import io
import json

input = """
KEY      VALUE
apple    0
apple    1
apple    2
bannana  0
bannana  1
bannana  2
pear     0
pear     1
pear     2
pear     3
orange   0
orange   1
orange   2
orange   3
orange   4
"""

df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))

running this yields the expected output:

{
    "apple": 2,
    "bannana": 2,
    "pear": 3,
    "orange": 4
}

Is there a more efficient way to achieve this transformation with pandas?

Upvotes: 0

Views: 75

Answers (2)

Vincent
Vincent

Reputation: 1614

df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()

If your input needs to be a dictionary, just add to_dict() :

df.groupby('KEY')['VALUE'].max().to_dict()

Upvotes: 2

anky
anky

Reputation: 75100

Also you can try:

[*df.groupby('KEY',sort=False).last().to_dict().values()][0]

{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}

Upvotes: 1

Related Questions