Reputation: 862
I am learning statitistics, and i want to check a data's distribution, to find if it comes from normal distribution.
I find a ks-test can do this. my code list below:
In [1]: from scipy import stats
In [2]: from read_cj import read
In [3]: df = read()
[read] cost 10.066437721252441
In [4]: stats.kstest(df['XH.self_rank(30)'],'norm')
Out[4]: KstestResult(statistic=0.3203690716401366, pvalue=0.0)
this result seems mean my colums XH.self_rank(30)
is normal distribution.
but the hist plot shows like:
I dont think it comes from normal distribution.
and i tried more:
In [9]: stats.kstest([1,2,3,4], 'norm')
Out[9]: KstestResult(statistic=0.8413447460685429, pvalue=0.0012672077773713667)
In [10]: stats.kstest([1]*10000, 'norm')
Out[10]: KstestResult(statistic=0.8413447460685429, pvalue=0.0)
as you can see, the [1]*10000
is stilled considered comes from normal distribution, and [1]*10000
has same statistic value with [1, 2, 3,4]
, but different p-value. this confused me.
i think this kind of hist plot is normal distribution:
did i miss anything? can you help on this?
Upvotes: 0
Views: 445
Reputation: 1
The p-value in this case (and honestly, just about every time this phrase is used) is the probability that you would have seen the test statistic you saw, or worse, if the null hypothesis were true. Thus, if you get a very small value, it means that the data you saw were very unlikely, given the hypothesized distribution. Armed with that information, you should be willing to reject the null hypothesis, with the ever-present risk that it might have been true and you might just have observed some really unlikely data.
Upvotes: 0
Reputation: 13929
The null hypothesis of Kolmogorov-Smirnov test is that the sample comes from a normal distribution. So a p-value near zero rejects normality.
from scipy import stats
import random
print(stats.kstest([1] * 1000, 'norm').pvalue) # 0.0
print(stats.kstest([random.gauss(0, 1) for _ in range(1000)], 'norm').pvalue) # 0.7275173462861986
You can see that the uniform-ish sample leads to a p-value of zero, strongly suggesting this is not normal. On the other hand, the normal sample indeed leads to a large p-value, (correctly) suggesting that the sample is from a normal distribution.
The same applies to your case. All the suspected samples show p-values near zero, indicating that they are not from normal distributions. So stats.kstest
is not broken in my opinion.
Upvotes: 1