Reputation: 500
import pandas as pd
data = {'side':['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b', 'a'],
'price':[10400, 10400, 10400, 10380, 1041, 10400, 1041, 10400, 10399, 10399],
'b_d100_v':[1, 1, 1, 0.3, 10, 10, 10, 10, 9, 9],
'b_d100_p':[10390, 10391, 10390, 10390, 10390.5, 10385, 10385, 10386, 10387, 10387],
'a_d052_v':[11, 11, 11, 9.3, 0.1, 0.1, 0.1, 0.1, 0.2, 0.3],
'a_d052_p':[10399, 10403, 10400, 10401, 1041, 1041, 10400, 10400, 10402, 10404]
}
df = pd.DataFrame(data, index=[101, 102, 102, 104, 105, 106, 107, 107, 107, 107])
print(df)
side price b_d100_v b_d100_p a_d052_v a_d052_p
101 a 10400 1.0 10390.0 11.0 10399
102 a 10400 1.0 10391.0 11.0 10403
102 a 10400 1.0 10390.0 11.0 10400
104 b 10380 0.3 10390.0 9.3 10401
105 a 1041 10.0 10390.5 0.1 1041 # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106 a 10400 10.0 10385.0 0.1 1041 # Row to delete (outlier 1041 on 'a_d052_p' column)
107 a 1041 10.0 10385.0 0.1 10400 # Row to delete (outlier 1041 on 'price' column)
107 b 10400 10.0 10386.0 0.1 10400
107 b 10399 9.0 10387.0 0.2 10402
107 a 10399 9.0 10387.0 0.3 10404
I want to delete the rows containing outliers only on the 'price', 'b_d100_p' and 'a_d052_p' columns. To do this, I chose to use a condition based on the standard deviation. Here is the code I tried.
row_with_potential_outliers = ['price', 'b_d100_p', 'a_d052_p']
df = df[abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5] # The value '1.5' here is arbitrary, and does not matter too much
print(df)
side price b_d100_v b_d100_p a_d052_v a_d052_p
101 NaN 10400.0 NaN 10390.0 NaN 10399.0
102 NaN 10400.0 NaN 10391.0 NaN 10403.0
102 NaN 10400.0 NaN 10390.0 NaN 10400.0
104 NaN 10380.0 NaN 10390.0 NaN 10401.0
105 NaN NaN NaN 10390.5 NaN NaN # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106 NaN 10400.0 NaN 10385.0 NaN NaN # Row to delete (outlier 1041 on 'a_d052_p' column)
107 NaN NaN NaN 10385.0 NaN 10400.0 # Row to delete (outlier 1041 on 'price' column)
107 NaN 10400.0 NaN 10386.0 NaN 10400.0
107 NaN 10399.0 NaN 10387.0 NaN 10402.0
107 NaN 10399.0 NaN 10387.0 NaN 10404.0
How to keep the original values of the columns 'side', 'b_d100_v' and 'a_d052_v'? This will allow me to then apply a 'dropna()' to achieve my ends... Or maybe there is a better solution ? I work with dataframes with hundreds of columns and thousands of rows, so I want to avoid iterations as much as possible for performance reasons. Thanks in advance for your help.
Upvotes: 0
Views: 57
Reputation: 30971
First, to use proper names, set the name of your list with columns to check just as cols_to_check:
cols_to_check = ['price', 'b_d100_p', 'a_d052_p']
Note that rows with outliers are not found yet. You specify only which columns to check for outliers.
Then define a function to look for outliers in particular column, with the given threshold:
def isOutlier(col, thr):
return ((col - col.mean()) / col.std()).abs() > thr
And to drop rows with outliers, run:
df = df[~df[cols_to_check].apply(isOutlier, thr=1.5).any(axis=1)]
The result is:
side price b_d100_v b_d100_p a_d052_v a_d052_p
101 a 10400 1.0 10390.0 11.0 10399
102 a 10400 1.0 10391.0 11.0 10403
102 a 10400 1.0 10390.0 11.0 10400
104 b 10380 0.3 10390.0 9.3 10401
107 b 10400 10.0 10386.0 0.1 10400
107 b 10399 9.0 10387.0 0.2 10402
107 a 10399 9.0 10387.0 0.3 10404
To fully comprehend how this code works:
To see the full result of check whether any element is an outlier, within its column, run:
df[cols_to_check].apply(isOutlier, thr=1.5)
You apply isOutlier function to each column from cols_to_check list, passing the chosen value of the threshold. The result is:
price b_d100_p a_d052_p
101 False False False
102 False False False
102 False False False
104 False False False
105 True False True
106 False False True
107 True False False
107 False False False
107 False False False
107 False False False
To get the cumulative result for each row (whether any element in a row is True), run:
df[cols_to_check].apply(isOutlier, thr=1.5).any(axis=1)
The result is:
101 False
102 False
102 False
104 False
105 True
106 True
107 True
107 False
107 False
107 False
dtype: bool
So if you run df[~...]
you will get rows where the above condition
is False and just this result should be saved back in df.
Note that you want to delete the rows containing al least one outlier, not change the outlier value to NaN (as it was the result of your code).
Upvotes: 2
Reputation: 56
To make your method give the desired output, you just need to apply the outlier removal to the specified columns of the dataframe df
df[row_with_potential_outliers] = df[row_with_potential_outliers][abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5]
side price b_d100_v b_d100_p a_d052_v a_d052_p
101 a 10400.0 1.0 10390.0 11.0 10399.0
102 a 10400.0 1.0 10391.0 11.0 10403.0
102 a 10400.0 1.0 10390.0 11.0 10400.0
104 b 10380.0 0.3 10390.0 9.3 10401.0
105 a NaN 10.0 10390.5 0.1 NaN
106 a 10400.0 10.0 10385.0 0.1 NaN
107 a NaN 10.0 10385.0 0.1 10400.0
107 b 10400.0 10.0 10386.0 0.1 10400.0
107 b 10399.0 9.0 10387.0 0.2 10402.0
107 a 10399.0 9.0 10387.0 0.3 10404.0
Upvotes: 1