nick
nick

Reputation: 862

python ks-test failed to identify a normal distribution?

I am learning statitistics, and i want to check a data's distribution, to find if it comes from normal distribution.

I find a ks-test can do this. my code list below:

In [1]: from scipy import stats

In [2]: from read_cj import read

In [3]: df = read()
[read] cost 10.066437721252441

In [4]: stats.kstest(df['XH.self_rank(30)'],'norm')
Out[4]: KstestResult(statistic=0.3203690716401366, pvalue=0.0)

this result seems mean my colums XH.self_rank(30) is normal distribution.

but the hist plot shows like:

enter image description here

I dont think it comes from normal distribution.

and i tried more:

In [9]: stats.kstest([1,2,3,4], 'norm')
Out[9]: KstestResult(statistic=0.8413447460685429, pvalue=0.0012672077773713667)

In [10]: stats.kstest([1]*10000, 'norm')
Out[10]: KstestResult(statistic=0.8413447460685429, pvalue=0.0)

as you can see, the [1]*10000 is stilled considered comes from normal distribution, and [1]*10000 has same statistic value with [1, 2, 3,4], but different p-value. this confused me.

i think this kind of hist plot is normal distribution: enter image description here

did i miss anything? can you help on this?

Upvotes: 0

Views: 445

Answers (2)

David Lovell
David Lovell

Reputation: 1

The p-value in this case (and honestly, just about every time this phrase is used) is the probability that you would have seen the test statistic you saw, or worse, if the null hypothesis were true. Thus, if you get a very small value, it means that the data you saw were very unlikely, given the hypothesized distribution. Armed with that information, you should be willing to reject the null hypothesis, with the ever-present risk that it might have been true and you might just have observed some really unlikely data.

Upvotes: 0

j1-lee
j1-lee

Reputation: 13929

The null hypothesis of Kolmogorov-Smirnov test is that the sample comes from a normal distribution. So a p-value near zero rejects normality.

from scipy import stats
import random

print(stats.kstest([1] * 1000, 'norm').pvalue) # 0.0
print(stats.kstest([random.gauss(0, 1) for _ in range(1000)], 'norm').pvalue) # 0.7275173462861986

You can see that the uniform-ish sample leads to a p-value of zero, strongly suggesting this is not normal. On the other hand, the normal sample indeed leads to a large p-value, (correctly) suggesting that the sample is from a normal distribution.

The same applies to your case. All the suspected samples show p-values near zero, indicating that they are not from normal distributions. So stats.kstest is not broken in my opinion.

Upvotes: 1

Related Questions