Pandas: Get highest n rows based on multiple columns and they are matching each other

Question

Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.

How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.

At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?

piRSquared · Accepted Answer

Consider the example dataframe df

np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
    dict(
        A=rng,
        B=list('abcdefghij'),
        C=np.random.permutation(rng),
        D=np.random.permutation(rng)
    )
)

print(df)

   A  B  C  D
0  0  a  9  1
1  1  b  4  3
2  2  c  5  5
3  3  d  1  9
4  4  e  7  4
5  5  f  6  6
6  6  g  8  0
7  7  h  3  2
8  8  i  2  7
9  9  j  0  8

Use nlargest to identify lists. Then use query to filter dataframe

n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)

df.query('C in @c_lrgst & D in @d_lrgst')

   A  B  C  D
2  2  c  5  5
5  5  f  6  6

Pandas: Get highest n rows based on multiple columns and they are matching each other

Answers (1)

Related Questions