I have created a pandas data frame and would like to filter the data based on certain boolean logic. Essentially what I'd like to do is closer to excels' index match function than to simple filtering. I have researched a lot of other threads. When I apply my filter, the data frame returns zero true values. Why are zero true values being returned when I have been flexible with my logic? and; If I introduced a 5th column, say column 'D' , with random.randomint(100-1000,100) , what logic would I use to conditionally find the maximum values only for column D ? I.e. Can I force a data frame to return the highest true values only from a certain column, in the event that multiple true values are returned? Advice much appreciated. Thank you in advance. import pandas as pd df = pd.DataFrame({ 'Step': [1,1,1,1,1,1,2,2,2,2,2,2], 'A': [4,5,6,7,4,5,6,7,4,5,6,7], 'B': [10,20,30,40,10,20,30,40,10,20,30,40], 'C': [0,0.5,1,1.5,2,2.5,0,0.5,1,1.5,2.0,2.5] }) columns = ['Step','A','B','C'] df=df[columns] new_df=df[(df.Step == 1) & (df.A == 4|5|6|7) & (df.B == 10|20|30|40)] new_df

Reputation: 51

Dataframe Boolean Logic Index Match

I have created a pandas data frame and would like to filter the data based on certain boolean logic. Essentially what I'd like to do is closer to excels' index match function than to simple filtering. I have researched a lot of other threads.

When I apply my filter, the data frame returns zero true values. Why are zero true values being returned when I have been flexible with my logic? and;
If I introduced a 5th column, say column 'D', with random.randomint(100-1000,100), what logic would I use to conditionally find the maximum values only for column D ? I.e. Can I force a data frame to return the highest true values only from a certain column, in the event that multiple true values are returned?

Advice much appreciated. Thank you in advance.

import pandas as pd

df = pd.DataFrame({
    'Step': [1,1,1,1,1,1,2,2,2,2,2,2],
    'A': [4,5,6,7,4,5,6,7,4,5,6,7],
    'B': [10,20,30,40,10,20,30,40,10,20,30,40],
    'C': [0,0.5,1,1.5,2,2.5,0,0.5,1,1.5,2.0,2.5]
})

columns = ['Step','A','B','C']

df=df[columns]

new_df=df[(df.Step == 1) & (df.A == 4|5|6|7) & (df.B == 10|20|30|40)]
new_df

Upvotes: 3

Answers (2)

jezrael

Reputation: 862681

You can use boolean indexing with isin:

new_df=df[(df.Step == 1) & (df.A.isin([4,5,6,7])) & (df.B.isin([10,20,30,40]))]

It seems for second question need DataFrame.nlargest:

np.random.seed(789)
df = pd.DataFrame({
    'Step': [1,1,1,1,1,1,2,2,2,2,2,2],
    'A': [4,5,6,7,4,5,6,7,4,5,6,7],
    'B': [10,20,30,40,10,20,30,40,10,20,30,40],
    'C': [0,0.5,1,1.5,2,2.5,0,0.5,1,1.5,2.0,2.5],
    'D':np.random.choice(np.arange(100,1000,100), size=12)
})
print (df)
    A   B    C    D  Step
0   4  10  0.0  400     1
1   5  20  0.5  300     1
2   6  30  1.0  200     1
3   7  40  1.5  400     1
4   4  10  2.0  500     1
5   5  20  2.5  900     1
6   6  30  0.0  500     2
7   7  40  0.5  200     2
8   4  10  1.0  900     2
9   5  20  1.5  100     2
10  6  30  2.0  200     2
11  7  40  2.5  200     2

new_df= df[(df.Step == 1)&(df.A.isin([4,5,6,7]))&(df.B.isin([10,20,30,40]))].nlargest(1,'D')
print (new_df)
   A   B    C    D  Step
5  5  20  2.5  900     1

Upvotes: 4

MaxU - stand with Ukraine

Reputation: 210842

Using DataFrame.query() method:

In [7]: new_df = df.query("Step==1 and A in [4,5,6,7] and B in [10,20,30,40]")

In [8]: new_df
Out[8]:
   Step  A   B    C
0     1  4  10  0.0
1     1  5  20  0.5
2     1  6  30  1.0
3     1  7  40  1.5
4     1  4  10  2.0
5     1  5  20  2.5

Upvotes: 4

Dataframe Boolean Logic Index Match

Answers (2)

Related Questions