user2626148
user2626148

Reputation:

Using Scipy's stats.kstest module for goodness-of-fit testing

I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.

The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:

kstest(mydata,'norm')

where mydata is a Numpy array. Instead, I want to do something like:

kstest(mydata,myfunc)

where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:

ks_2samp(mydata,myfunc(abscissa))

but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)

In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?

Upvotes: 12

Views: 18463

Answers (2)

Jaime
Jaime

Reputation: 67507

Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:

>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))

To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:

>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)

The above would normally be run with the more convenient form:

>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)

If we have uniformly distributed data, it is easy to build the cdf by hand:

>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)

Upvotes: 19

kiriloff
kiriloff

Reputation: 26333

as for ks_2samp, it tests null hypothesis that both samples are sampled from same probability distribution.

you can do for example:

>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>> 

where x, y are two instances of numpy.array:

>>> ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)

first value is the test statistics, and second value is the p-value. if the p-value is less than 95 (for a level of significance of 5%), this means that you cannot reject the Null-Hypothese that the two sample distributions are identical.

Upvotes: 4

Related Questions