Reputation: 6604
I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?
import pandas
import io
import json
input = """
KEY VALUE
apple 0
apple 1
apple 2
bannana 0
bannana 1
bannana 2
pear 0
pear 1
pear 2
pear 3
orange 0
orange 1
orange 2
orange 3
orange 4
"""
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))
running this yields the expected output:
{
"apple": 2,
"bannana": 2,
"pear": 3,
"orange": 4
}
Is there a more efficient way to achieve this transformation with pandas?
Upvotes: 0
Views: 75
Reputation: 1614
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()
If your input needs to be a dictionary, just add to_dict()
:
df.groupby('KEY')['VALUE'].max().to_dict()
Upvotes: 2
Reputation: 75100
Also you can try:
[*df.groupby('KEY',sort=False).last().to_dict().values()][0]
{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}
Upvotes: 1