Reputation: 586
I have the above distribution with a mean of -0.02
, standard deviation of 0.09
and with a sample size of 13905
.
I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.
I am just wondering is it possible to normalize to make it more smooth and 'normal' distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.
Upvotes: 7
Views: 5404
Reputation: 402
I agree with the top answer, except the last 2 paragraphs, because the interpretation of normaltest
's output is flipped. These paragraphs should instead read:
"The test returns two values k2
and p
. The value of p
is of our interest here.
if p
is greater less than some threshold (ex 0.001
or so), we can say reject the null hypothesis that data
comes from a normal distribution.
In the example above, you'll see that p
is greater less than 0.001
while transformed_p
is less greater than this threshold indicating that we are moving in the right direction."
Source: normaltest
documentation.
Upvotes: 0
Reputation: 7967
You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you'll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats
You can avoid those steps and simply use Yeo-Johnson transform. sklearn
provides an API for that
from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer
data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])
pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)
We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.
k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)
The test returns two values k2
and p
. The value of p
is of our interest here.
if p
is greater than some threshold (ex 0.001
or so), we can say reject the hypothesis that data
comes from a normal distribution.
In the example above, you'll see that p
is greater than 0.001
while transformed_p
is less than this threshold indicating that we are moving in the right direction.
Upvotes: 10