Reputation: 69
This is a continuation of the method used in this question.
Say we have a dataframe
Make Model Year HP Cylinders Transmission MPG-H MPG-C Price
0 BMW 1 Series M 2011 335.0 6.0 MANUAL 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL 28 18 29450
4 BMW 1 Series 2011 230.0 6.0 MANUAL 28 18 34500
...
Using the interquartile range (IQR) (i.e middle 50%), I created 2 variables, upper
and lower
. The specific calculation isn't important in this discussion, but to give an example of upper
:
Year 2029.50
HP 498.00
Cylinders 9.00
MPG-H 42.00
MPG-C 31.00
Price 75291.25
As expected, it only calculates values for columns that have int64 values.
When I want to filter out values that lie outside of the IQR,
correct_df = df[~((df < lower) |(df > upper)).any(axis=1)]
it gives me the right answer. However, when I invert the logic to use &
instead of |
, I get an empty dataframe. Here is the code:
another_df = df[((df >= lower) & (df <= upper)).all(axis=1)]
Which gives the results, but can be fixed by converting the index of upper
/lower
into a list ('lst'):
Make Model Year HP Cylinders Transmission Drive Mode MPG-H MPG-C Price
----------------------------------------------------------------------------------------------
another_df = df[((df[lst] >= lower) & (df[lst] <= upper)).all(axis=1)]
It seems like &
and |
behave differently for non-numerical columns? Why does that happen?
Upvotes: 2
Views: 63
Reputation:
&
and |
behave just as you'd expect; they're not the problem. They problem is that you're use all
in the code that doesn't work, but in the code that does work, you're using any
.
In the first example you say "select all rows where any column of the row is less than lower
OR is greater than upper
"
In the second example you say "select all rows where ALL columns of the row are greater than or equal to lower
OR are less than or equal to upper
".
Change all
to any
and you should be fine.
Upvotes: 1