Reputation: 2439
I have a following DataFrame in which I need to filter rows given a pair of IDs from two different columns by selecting a pair that has the highest numeric value from a third column.
import pandas as pd
data = [
['11x', '12x', 5.5, 'other_1'],
['11x', '12x', 3.5, 'other_2'],
['10x', '9x', 1.5, 'other_1'],
['10x', '9x', 3.5, 'other_2'],
['1x', '1x', 3.5, 'other_x'],
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['id1', 'id2', 'to_filter_on', 'other_data'])
df.head()
# output of head
"""
id1 id2 to_filter_on other_data
0 11x 12x 5.5 other_1
1 11x 12x 3.5 other_2
2 10x 9x 1.5 other_1
3 10x 9x 3.5 other_2
4 1x 2x 3.5 other_x
"""
Given the pairs of ID fields (id1 and id2) I want to only select a pair where the threshold from the column to_filter_on
is the highest. In a sense, I need a following DataFrame given the one above:
"""
id1 id2 to_filter_on other_data
0 11x 12x 5.5 other_1
1 10x 9x 3.5 other_2
2 1x 2x 3.5 other_x
"""
Notice that the ID pair "11x and 12x" with lower value in to_filter_on
was removed, same for pair of "10x and 9x".
Any pointers and help are appreciated.
Upvotes: 3
Views: 71
Reputation: 150735
In case you want to retain all rows with the same (max) values of to_filter_on
:
s = df.groupby(['id1','id2'])['to_filter_on'].transform('max')
df[df.to_filter_on.eq(s)]
gives:
id1 id2 to_filter_on other_data
0 11x 12x 5.5 other_1
3 10x 9x 3.5 other_2
4 1x 1x 3.5 other_x
Upvotes: 1
Reputation: 402363
Use groupby
and idxmax
to get the indices of the highest "to_filter_on" value, then use that to index df
:
df.iloc[df.groupby(['id1', 'id2'], sort=False)['to_filter_on'].idxmax()]
id1 id2 to_filter_on other_data
0 11x 12x 5.5 other_1
3 10x 9x 3.5 other_2
4 1x 1x 3.5 other_x
Alternatively, avoid the groupby
by sorting and dropping duplicates:
(df.sort_values(['id1', 'id2', 'to_filter_on'])
.drop_duplicates(['id1', 'id2'], keep='last')
.sort_index())
id1 id2 to_filter_on other_data
0 11x 12x 5.5 other_1
3 10x 9x 3.5 other_2
4 1x 1x 3.5 other_x
Upvotes: 3