Reputation: 2565
Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.
How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.
At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?
Upvotes: 1
Views: 990
Reputation: 294488
Consider the example dataframe df
np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
dict(
A=rng,
B=list('abcdefghij'),
C=np.random.permutation(rng),
D=np.random.permutation(rng)
)
)
print(df)
A B C D
0 0 a 9 1
1 1 b 4 3
2 2 c 5 5
3 3 d 1 9
4 4 e 7 4
5 5 f 6 6
6 6 g 8 0
7 7 h 3 2
8 8 i 2 7
9 9 j 0 8
Use nlargest
to identify lists. Then use query
to filter dataframe
n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)
df.query('C in @c_lrgst & D in @d_lrgst')
A B C D
2 2 c 5 5
5 5 f 6 6
Upvotes: 1