Reputation: 107
I am doing a binary classification problem, I am struggling with removing outliers and also increasing accuracy.
Ratings are one my feature looks like this:
0 0.027465
1 0.027465
2 0.027465
3 0.027465
4 0.027465
...
26043 0.027465
26044 0.027465
26045 0.102234
26046 0.027465
26047 0.027465
mean value of the data:
train.ratings.mean()
0.03871552285960927
std of the data:
train.ratings.std()
0.07585168664836195
I tried the log transformation but accuracy is not increased:
train['ratings']=np.log(train.ratings+1)
my goal is to classify the data true or false:
train.netgain
0 False
1 False
2 False
3 False
4 True
...
26043 True
26044 False
26045 True
26046 False
26047 Fals
Upvotes: 1
Views: 2507
Reputation: 3217
Assume that the rating feature is normally distributed and convert it to the standard normal distribution
From normal distribution, we know 99.7% values are covered with 3 standard deviations. so we can remove the values which are above 3 standard deviations away from the mean.
See below for python code.
ratings_mean=train['ratings'].mean() #Finding the mean of ratings column
ratings_std=train['ratings'].std() # standard deviation of the column
train['ratings']=train['ratings'].map(lamdba x: (x - ratings_mean)/ ratings_std
Ok, now we have now converted our data into a standard normal distribution. Now we if you see, its mean should be 0 and the standard deviation should be 1. From this, we can find out which are greater than 3 and less than -3. so that we can remove those rows from the dataset.
train=train[np.abs(train_ratings) < 3]
Now train dataframe will remove the outliers from the dataset.
**Note: You can apply 2 standard deviations as well because 2-std contains 95% of the data. Its all depends on the domain knowledge and your data. **
Upvotes: 0
Reputation: 151
One method I used was to calculate a MAD and after that I tag all outlier with a bool type with that I can get all outliers.
Sample of MAD calculation:
def mad(x):
return np.median(np.abs(x - np.median(x)))
def mad_ratio(x):
mad_value = mad(x)
if mad_value == 0:
return 0
x_mad = np.abs(x - np.median(x)) / mad_value
return x_mad
Upvotes: 1