Ivan Bilan
Ivan Bilan

Reputation: 2439

Filter column pairs given a numeric threshold from another column

I have a following DataFrame in which I need to filter rows given a pair of IDs from two different columns by selecting a pair that has the highest numeric value from a third column.

import pandas as pd

data = [
    ['11x', '12x', 5.5, 'other_1'], 
    ['11x', '12x', 3.5, 'other_2'],
    ['10x', '9x', 1.5, 'other_1'],
    ['10x', '9x', 3.5, 'other_2'],
    ['1x', '1x', 3.5, 'other_x'],
]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['id1', 'id2', 'to_filter_on', 'other_data']) 

df.head()


# output of head
"""
    id1     id2     to_filter_on    other_data
0   11x     12x     5.5     other_1
1   11x     12x     3.5     other_2
2   10x     9x      1.5     other_1
3   10x     9x      3.5     other_2
4   1x      2x      3.5     other_x
"""

Given the pairs of ID fields (id1 and id2) I want to only select a pair where the threshold from the column to_filter_on is the highest. In a sense, I need a following DataFrame given the one above:

"""
    id1     id2     to_filter_on    other_data
0   11x     12x     5.5     other_1
1   10x     9x      3.5     other_2
2   1x      2x      3.5     other_x
"""

Notice that the ID pair "11x and 12x" with lower value in to_filter_on was removed, same for pair of "10x and 9x".

Any pointers and help are appreciated.

Upvotes: 3

Views: 71

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150735

In case you want to retain all rows with the same (max) values of to_filter_on:

s = df.groupby(['id1','id2'])['to_filter_on'].transform('max')
df[df.to_filter_on.eq(s)]

gives:

    id1     id2     to_filter_on    other_data
0   11x     12x     5.5             other_1
3   10x     9x      3.5             other_2
4   1x      1x      3.5             other_x

Upvotes: 1

cs95
cs95

Reputation: 402363

Use groupby and idxmax to get the indices of the highest "to_filter_on" value, then use that to index df:

df.iloc[df.groupby(['id1', 'id2'], sort=False)['to_filter_on'].idxmax()]

   id1  id2  to_filter_on other_data
0  11x  12x           5.5    other_1
3  10x   9x           3.5    other_2
4   1x   1x           3.5    other_x

Alternatively, avoid the groupby by sorting and dropping duplicates:

(df.sort_values(['id1', 'id2', 'to_filter_on'])
   .drop_duplicates(['id1', 'id2'], keep='last')
   .sort_index())

   id1  id2  to_filter_on other_data
0  11x  12x           5.5    other_1
3  10x   9x           3.5    other_2
4   1x   1x           3.5    other_x

Upvotes: 3

Related Questions