David
David

Reputation: 500

Filtering pandas dataframe rows based on a multiple column condition

import pandas as pd

data = {'side':['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b', 'a'],
        'price':[10400, 10400, 10400, 10380, 1041, 10400, 1041, 10400, 10399, 10399],
        'b_d100_v':[1, 1, 1, 0.3, 10, 10, 10, 10, 9, 9],
        'b_d100_p':[10390, 10391, 10390, 10390, 10390.5, 10385, 10385, 10386, 10387, 10387],
        'a_d052_v':[11, 11, 11, 9.3, 0.1, 0.1, 0.1, 0.1, 0.2, 0.3],
        'a_d052_p':[10399, 10403, 10400, 10401, 1041, 1041, 10400, 10400, 10402, 10404]
        }
df = pd.DataFrame(data, index=[101, 102, 102, 104, 105, 106, 107, 107, 107, 107])

print(df)
    side  price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101    a  10400       1.0   10390.0      11.0     10399
102    a  10400       1.0   10391.0      11.0     10403
102    a  10400       1.0   10390.0      11.0     10400
104    b  10380       0.3   10390.0       9.3     10401
105    a   1041      10.0   10390.5       0.1      1041  # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106    a  10400      10.0   10385.0       0.1      1041  # Row to delete (outlier 1041 on 'a_d052_p' column)
107    a   1041      10.0   10385.0       0.1     10400  # Row to delete (outlier 1041 on 'price' column)
107    b  10400      10.0   10386.0       0.1     10400
107    b  10399       9.0   10387.0       0.2     10402
107    a  10399       9.0   10387.0       0.3     10404

I want to delete the rows containing outliers only on the 'price', 'b_d100_p' and 'a_d052_p' columns. To do this, I chose to use a condition based on the standard deviation. Here is the code I tried.

row_with_potential_outliers = ['price', 'b_d100_p', 'a_d052_p']
df = df[abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5]  # The value '1.5' here is arbitrary, and does not matter too much

print(df)
    side    price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101  NaN  10400.0       NaN   10390.0       NaN   10399.0
102  NaN  10400.0       NaN   10391.0       NaN   10403.0
102  NaN  10400.0       NaN   10390.0       NaN   10400.0
104  NaN  10380.0       NaN   10390.0       NaN   10401.0
105  NaN      NaN       NaN   10390.5       NaN       NaN  # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106  NaN  10400.0       NaN   10385.0       NaN       NaN  # Row to delete (outlier 1041 on 'a_d052_p' column)
107  NaN      NaN       NaN   10385.0       NaN   10400.0  # Row to delete (outlier 1041 on 'price' column)
107  NaN  10400.0       NaN   10386.0       NaN   10400.0
107  NaN  10399.0       NaN   10387.0       NaN   10402.0
107  NaN  10399.0       NaN   10387.0       NaN   10404.0

How to keep the original values of the columns 'side', 'b_d100_v' and 'a_d052_v'? This will allow me to then apply a 'dropna()' to achieve my ends... Or maybe there is a better solution ? I work with dataframes with hundreds of columns and thousands of rows, so I want to avoid iterations as much as possible for performance reasons. Thanks in advance for your help.

Upvotes: 0

Views: 57

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30971

First, to use proper names, set the name of your list with columns to check just as cols_to_check:

cols_to_check = ['price', 'b_d100_p', 'a_d052_p']

Note that rows with outliers are not found yet. You specify only which columns to check for outliers.

Then define a function to look for outliers in particular column, with the given threshold:

def isOutlier(col, thr):
    return ((col - col.mean()) / col.std()).abs() > thr

And to drop rows with outliers, run:

df = df[~df[cols_to_check].apply(isOutlier, thr=1.5).any(axis=1)]

The result is:

    side  price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101    a  10400       1.0   10390.0      11.0     10399
102    a  10400       1.0   10391.0      11.0     10403
102    a  10400       1.0   10390.0      11.0     10400
104    b  10380       0.3   10390.0       9.3     10401
107    b  10400      10.0   10386.0       0.1     10400
107    b  10399       9.0   10387.0       0.2     10402
107    a  10399       9.0   10387.0       0.3     10404

To fully comprehend how this code works:

  1. To see the full result of check whether any element is an outlier, within its column, run:

     df[cols_to_check].apply(isOutlier, thr=1.5)
    

    You apply isOutlier function to each column from cols_to_check list, passing the chosen value of the threshold. The result is:

          price  b_d100_p  a_d052_p
     101  False     False     False
     102  False     False     False
     102  False     False     False
     104  False     False     False
     105   True     False      True
     106  False     False      True
     107   True     False     False
     107  False     False     False
     107  False     False     False
     107  False     False     False
    
  2. To get the cumulative result for each row (whether any element in a row is True), run:

     df[cols_to_check].apply(isOutlier, thr=1.5).any(axis=1)
    

    The result is:

     101    False
     102    False
     102    False
     104    False
     105     True
     106     True
     107     True
     107    False
     107    False
     107    False
     dtype: bool
    

So if you run df[~...] you will get rows where the above condition is False and just this result should be saved back in df.

Note that you want to delete the rows containing al least one outlier, not change the outlier value to NaN (as it was the result of your code).

Upvotes: 2

Can Cakiroglu
Can Cakiroglu

Reputation: 56

To make your method give the desired output, you just need to apply the outlier removal to the specified columns of the dataframe df

df[row_with_potential_outliers] = df[row_with_potential_outliers][abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5]

    side    price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101    a  10400.0       1.0   10390.0      11.0   10399.0
102    a  10400.0       1.0   10391.0      11.0   10403.0
102    a  10400.0       1.0   10390.0      11.0   10400.0
104    b  10380.0       0.3   10390.0       9.3   10401.0
105    a      NaN      10.0   10390.5       0.1       NaN
106    a  10400.0      10.0   10385.0       0.1       NaN
107    a      NaN      10.0   10385.0       0.1   10400.0
107    b  10400.0      10.0   10386.0       0.1   10400.0
107    b  10399.0       9.0   10387.0       0.2   10402.0
107    a  10399.0       9.0   10387.0       0.3   10404.0

Upvotes: 1

Related Questions