Reputation: 51
I have created a pandas data frame and would like to filter the data based on certain boolean logic. Essentially what I'd like to do is closer to excels' index match function than to simple filtering. I have researched a lot of other threads.
When I apply my filter, the data frame returns zero true values. Why are zero true values being returned when I have been flexible with my logic? and;
If I introduced a 5th column, say column 'D'
, with random.randomint(100-1000,100)
, what logic would I use to conditionally find the maximum values only for column D
? I.e. Can I force a data frame to return the highest true values only from a certain column, in the event that multiple true values are returned?
Advice much appreciated. Thank you in advance.
import pandas as pd
df = pd.DataFrame({
'Step': [1,1,1,1,1,1,2,2,2,2,2,2],
'A': [4,5,6,7,4,5,6,7,4,5,6,7],
'B': [10,20,30,40,10,20,30,40,10,20,30,40],
'C': [0,0.5,1,1.5,2,2.5,0,0.5,1,1.5,2.0,2.5]
})
columns = ['Step','A','B','C']
df=df[columns]
new_df=df[(df.Step == 1) & (df.A == 4|5|6|7) & (df.B == 10|20|30|40)]
new_df
Upvotes: 3
Views: 285
Reputation: 862681
You can use boolean indexing
with isin
:
new_df=df[(df.Step == 1) & (df.A.isin([4,5,6,7])) & (df.B.isin([10,20,30,40]))]
It seems for second question need DataFrame.nlargest
:
np.random.seed(789)
df = pd.DataFrame({
'Step': [1,1,1,1,1,1,2,2,2,2,2,2],
'A': [4,5,6,7,4,5,6,7,4,5,6,7],
'B': [10,20,30,40,10,20,30,40,10,20,30,40],
'C': [0,0.5,1,1.5,2,2.5,0,0.5,1,1.5,2.0,2.5],
'D':np.random.choice(np.arange(100,1000,100), size=12)
})
print (df)
A B C D Step
0 4 10 0.0 400 1
1 5 20 0.5 300 1
2 6 30 1.0 200 1
3 7 40 1.5 400 1
4 4 10 2.0 500 1
5 5 20 2.5 900 1
6 6 30 0.0 500 2
7 7 40 0.5 200 2
8 4 10 1.0 900 2
9 5 20 1.5 100 2
10 6 30 2.0 200 2
11 7 40 2.5 200 2
new_df= df[(df.Step == 1)&(df.A.isin([4,5,6,7]))&(df.B.isin([10,20,30,40]))].nlargest(1,'D')
print (new_df)
A B C D Step
5 5 20 2.5 900 1
Upvotes: 4
Reputation: 210842
Using DataFrame.query() method:
In [7]: new_df = df.query("Step==1 and A in [4,5,6,7] and B in [10,20,30,40]")
In [8]: new_df
Out[8]:
Step A B C
0 1 4 10 0.0
1 1 5 20 0.5
2 1 6 30 1.0
3 1 7 40 1.5
4 1 4 10 2.0
5 1 5 20 2.5
Upvotes: 4