shashank2806
shashank2806

Reputation: 575

Removing outlier from a single column

I am removing outliers from a dataset.

I decided to remove outlier from each column one-by-one. I have columns with a different number of missing values.

I used this code but it removed the whole row containg the outlier and due to many NaN values in my data, number of rows of my data reduced drastically.

def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

Then I decided to remove outlier from each column, and fill ouliers with NaN in each column I wrote this code

def remove_outlier(df_in, col_name, thres=1.5):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-thres*iqr
    fence_high = q3+thres*iqr
    mask = (df_in[col_name] > fence_high) & (df_in[col_name] < fence_low)
    df_in.loc[mask, col_name] = np.nan
    return df_in

But this code doesn't filters the outliers. gave the same result.

What is wrong in this code? How can I correct it?

Is there any other elegant method to filter outlier?

Upvotes: 1

Views: 1373

Answers (2)

Venkatesh Garnepudi
Venkatesh Garnepudi

Reputation: 316

Check the condition once. How can that be &. It should be |

Upvotes: 1

thetradingdogdj
thetradingdogdj

Reputation: 521

df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]

In this snipplet, you select rows based on df_in[col_name] > fence_low and df_in[col_name] < fence_high, hence each time one of these condition is not respected, the row will be removed;

As a general rule, if you have a column with 30% outliers, 30% of you dataset will disappear, and you have two options
1. Fill the missing value ffill, mean constant value ...
2. Or drop these feature, if it is not mandatory, because in some times you would better drop a feature than reduce your dataset too much

Hope it helps

Upvotes: 1

Related Questions