Is it possible to use a distribution from statsmodels with scipy.stats?

Question

I'm using a certain StatsModels distribution (Azzalini's Skew Student-t) and I'd like to perform a (one-sample) Kolmogorov-Smirnov test with it.

Is it possible to use Scipy's kstest with a StatsModels distribution? Scipy's documentation (rather vaguely) suggests that the cdf argument may be a String or a callable, with no further details or examples about the latter.

On the other hand, the StatsModels' distribution I'm using has many of the methods that Scipy distributions do; thus, I'm supposing there is some way of using it as a callable argument passed to kstest. Am I wrong?

Here is what I have so far. What I'd like to achieve is commented out in the last line:

import statsmodels.sandbox.distributions.extras as azt
import scipy.stats as stats

x = ([-0.2833379 , -3.05224565,  0.13236267, -0.24549146, -1.75106484,
       0.95375723,  0.28628686,  0.        , -3.82529261, -0.26714159,
       1.07142857,  2.56183746, -1.89491817, -0.3414301 ,  1.11589663,
       -0.74540174, -0.60470106, -1.93307821,  1.56093656,  1.28078818])

# This is how kstest works.
print stats.kstest(x, stats.norm.cdf) #(0.21003262911224113, 0.29814145956367311)

# This is Statsmodels' distribution I'm using. It has a cdf function as well.
ast = azt.ACSkewT_gen()

# This is what I'd want. Executing this will throw a TypeError because ast.cdf 
# needs some shape parameters etc.
# print stats.kstest(x, ast.cdf)

Note: I'll happily use two-sample KS test if what I'm expecting is not possible. Just wanted to know if this is possible.

Josef · Accepted Answer

Those functions have been written a long time ago with scipy compatibility in mind. But there were several changes in scipy in the meantime.

kstest has an args keyword for the distribution parameters.

To get the distribution parameters we can try to estimate them by using the fit method of the scipy.stats distributions. However, estimating all parameters prints some warnings and the estimated df parameter is large. If we fix df at specific values we get estimates without warnings that we can use in the call of kstest.

>>> ast.fit(x)
C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\scipy\integrate\quadpack.py:352: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
  If increasing the limit yields no improvement it is advised to analyze 
  the integrand in order to determine the difficulties.  If the position of a 
  local difficulty can be determined (singularity, discontinuity) one will 
  probably gain from splitting up the interval and calling the integrator 
  on the subranges.  Perhaps a special-purpose integrator should be used.
  warnings.warn(msg, IntegrationWarning)
C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\scipy\integrate\quadpack.py:352: IntegrationWarning: The integral is probably divergent, or slowly convergent.
  warnings.warn(msg, IntegrationWarning)
(31834.800527154337, -2.3475921468088172, 1.3720725621594987, 2.2766515091760722)

>>> p = ast.fit(x, f0=100)
>>> print(stats.kstest(x, ast.cdf, args=p)) 
(0.13897385693057401, 0.83458552699682509)

>>> p = ast.fit(x, f0=5)
>>> print(stats.kstest(x, ast.cdf, args=p)) 
(0.097960232618178544, 0.990756154198281)

However, the distribution for the Kolmogorov-Smirnov test assumes that the distribution parameters are fixed and not estimated. If we estimate the parameters as above, then the p-value will not be correct since it is not based on the correct distribution.

For some distributions we can use tables for the kstest with estimated mean and scale parameter, e.g. the Lilliefors test kstest_normal in statsmodels. If we have estimated shape parameters, then the distribution of the ks test statistics will depend on the parameters of the model, and we could get the pvalue from bootstrapping.

(I don't remember anything about estimating the parameters of the SkewT distribution and whether maximum likelihood estimation has any specific problems.)

Is it possible to use a distribution from statsmodels with scipy.stats?

Answers (1)

Related Questions