Reputation: 575
I am removing outliers from a dataset.
I decided to remove outlier from each column one-by-one. I have columns with a different number of missing values.
I used this code but it removed the whole row containg the outlier and due to many NaN values in my data, number of rows of my data reduced drastically.
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Then I decided to remove outlier from each column, and fill ouliers with NaN in each column I wrote this code
def remove_outlier(df_in, col_name, thres=1.5):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-thres*iqr
fence_high = q3+thres*iqr
mask = (df_in[col_name] > fence_high) & (df_in[col_name] < fence_low)
df_in.loc[mask, col_name] = np.nan
return df_in
But this code doesn't filters the outliers. gave the same result.
What is wrong in this code? How can I correct it?
Is there any other elegant method to filter outlier?
Upvotes: 1
Views: 1373
Reputation: 316
Check the condition once. How can that be &
. It should be |
Upvotes: 1
Reputation: 521
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
In this snipplet, you select rows based on df_in[col_name] > fence_low
and df_in[col_name] < fence_high
, hence each time one of these condition is not respected, the row will be removed;
As a general rule, if you have a column with 30% outliers, 30% of you dataset will disappear, and you have two options
1. Fill the missing value ffill, mean constant value ...
2. Or drop these feature, if it is not mandatory, because in some times you would better drop a feature than reduce your dataset too much
Hope it helps
Upvotes: 1