RaduS
RaduS

Reputation: 2565

Pandas: Get highest n rows based on multiple columns and they are matching each other

Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.

enter image description here

How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.

enter image description here

At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?

Upvotes: 1

Views: 990

Answers (1)

piRSquared
piRSquared

Reputation: 294488

Consider the example dataframe df

np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
    dict(
        A=rng,
        B=list('abcdefghij'),
        C=np.random.permutation(rng),
        D=np.random.permutation(rng)
    )
)

print(df)

   A  B  C  D
0  0  a  9  1
1  1  b  4  3
2  2  c  5  5
3  3  d  1  9
4  4  e  7  4
5  5  f  6  6
6  6  g  8  0
7  7  h  3  2
8  8  i  2  7
9  9  j  0  8

Use nlargest to identify lists. Then use query to filter dataframe

n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)

df.query('C in @c_lrgst & D in @d_lrgst')

   A  B  C  D
2  2  c  5  5
5  5  f  6  6

Upvotes: 1

Related Questions