marton mar suri
marton mar suri

Reputation: 107

How to remove the outliers using Python

I am doing a binary classification problem, I am struggling with removing outliers and also increasing accuracy.

Ratings are one my feature looks like this:

enter image description here

0        0.027465
1        0.027465
2        0.027465
3        0.027465
4        0.027465
           ...   
26043    0.027465
26044    0.027465
26045    0.102234
26046    0.027465
26047    0.027465

mean value of the data:

train.ratings.mean()
0.03871552285960927 

std of the data:

train.ratings.std()
0.07585168664836195

I tried the log transformation but accuracy is not increased:

train['ratings']=np.log(train.ratings+1)

my goal is to classify the data true or false:

train.netgain
0        False
1        False
2        False
3        False
4         True
         ...  
26043     True
26044    False
26045     True
26046    False
26047    Fals 

Upvotes: 1

Views: 2507

Answers (2)

Ravi
Ravi

Reputation: 3217

  • Assume that the rating feature is normally distributed and convert it to the standard normal distribution

  • From normal distribution, we know 99.7% values are covered with 3 standard deviations. so we can remove the values which are above 3 standard deviations away from the mean.

enter image description here.**

See below for python code.

ratings_mean=train['ratings'].mean()  #Finding the mean of ratings column

ratings_std=train['ratings'].std()     # standard deviation of the column

train['ratings']=train['ratings'].map(lamdba x: (x - ratings_mean)/ ratings_std

Ok, now we have now converted our data into a standard normal distribution. Now we if you see, its mean should be 0 and the standard deviation should be 1. From this, we can find out which are greater than 3 and less than -3. so that we can remove those rows from the dataset.

train=train[np.abs(train_ratings) < 3]

Now train dataframe will remove the outliers from the dataset.

**Note: You can apply 2 standard deviations as well because 2-std contains 95% of the data. Its all depends on the domain knowledge and your data. **

Upvotes: 0

TZof
TZof

Reputation: 151

One method I used was to calculate a MAD and after that I tag all outlier with a bool type with that I can get all outliers.

Sample of MAD calculation:

def mad(x): return np.median(np.abs(x - np.median(x)))

def mad_ratio(x): mad_value = mad(x) if mad_value == 0: return 0 x_mad = np.abs(x - np.median(x)) / mad_value return x_mad

Upvotes: 1

Related Questions