Using Scipy's stats.kstest module for goodness-of-fit testing

Question

I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.

The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:

kstest(mydata,'norm')

where mydata is a Numpy array. Instead, I want to do something like:

kstest(mydata,myfunc)

where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:

ks_2samp(mydata,myfunc(abscissa))

but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)

In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?

Jaime · Accepted Answer

Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:

>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))

To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:

>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)

The above would normally be run with the more convenient form:

>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)

If we have uniformly distributed data, it is easy to build the cdf by hand:

>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)

Using Scipy's stats.kstest module for goodness-of-fit testing

Answers (2)

Related Questions

Using Scipy&#39;s stats.kstest module for goodness-of-fit testing

Answers (2)

Related Questions

Using Scipy's stats.kstest module for goodness-of-fit testing