Gabriel
Gabriel

Reputation: 160

p_value is 0 when I use scipy.stats.kstest() for large dataset

I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0

The histogram of the original data and my code are in the followings: Histogram of my dataset

[In]: frequencies = mp[['c','v']]

[In]: print frequencies
         c      v
31  3475.8   18.0
30  3475.6   12.0
29  3475.4   13.0
28  3475.2    8.0
20  3475.0   49.0
14  3474.8   69.0
13  3474.6   79.0
12  3474.4   78.0
11  3474.2   78.0
7   3474.0  151.0
6   3473.8  157.0
5   3473.6  129.0
2   3473.4  149.0
1   3473.2  162.0
0   3473.0  179.0
3   3472.8  145.0
4   3472.6  139.0
8   3472.4   95.0
9   3472.2  103.0
10  3472.0  125.0
15  3471.8   56.0
16  3471.6   75.0
17  3471.4   70.0
18  3471.2   70.0
19  3471.0   57.0
21  3470.8   36.0
22  3470.6   22.0
23  3470.4   20.0
24  3470.2   12.0
25  3470.0   23.0
26  3469.8   13.0
27  3469.6   17.0
32  3469.4    6.0

[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)

[In]: testData = list(itertools.chain.from_iterable(testData))

[In]: print len(testData)
2415

[In]: print np.unique(testData)
[ 3469.4  3469.6  3469.8  3470.   3470.2  3470.4  3470.6  3470.8  3471.
  3471.2  3471.4  3471.6  3471.8  3472.   3472.2  3472.4  3472.6  3472.8
  3473.   3473.2  3473.4  3473.6  3473.8  3474.   3474.2  3474.4  3474.6
  3474.8  3475.   3475.2  3475.4  3475.6  3475.8]

[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)

Thanks everyone at first.

Upvotes: 6

Views: 11485

Answers (1)

James
James

Reputation: 36608

Using 'norm' for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf with default parameters: loc=0, scale=1.

Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.

import numpy as np
from scipy.stats import norm, kstest
import matplotlib.pyplot as plt

freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
    [3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
    [3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
    [3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
    [3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
    [3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
    [3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]

data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
loc, scale = norm.fit(data)
# create a normal distribution with loc and scale
n = norm(loc=loc, scale=scale)

Plot the fit of the norm to the data:

plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
x = np.arange(data.min(), data.max()+0.2, 0.2)
plt.plot(x, 350*n.pdf(x))
plt.show()

enter image description here

This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf of the fitted normal distribution

kstest(data, n.cdf)
# returns:
KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)

So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.

Upvotes: 9

Related Questions