newbie
newbie

Reputation: 21

How to implement a KS-Test in Python

scipy.stats.kstest(rvs, cdf, N) can perform a KS-Test on a dataset rvs. It tests if the dataset follows a propability distribution, whose cdf is specified in the parameters of this method.

Consider now a dataset of N=4800 samples. I have performed a KDE on this data and, therefore, have an estimated PDF. This PDF looks an awful lot like a bimodal distribution. When plotting the estimated PDF and curve_fitting a bimodal distribution to it, these two plots are pretty much identical. The parameters of the fitted bimodal distribution are (scale1, mean1, stdv1, scale2, mean2, stdv2): [0.6 0.036 0.52, 0.23 1.25 0.4]

How can I apply scipy.stats.kstest to test if my estimated PDF is bimodal distributed? As my null hypothesis, I state that the estimated PDF equals the following PDF:

hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)

x_grid is just a vector that contains the x-values at which I evaluate my estimated PDF. So each entry of pdf has a corresponding value of x_grid. It might be that my computation of hypoCdf is incorrect. Maybe instead of dividing by len(x_grid), should I divide by np.sum(hypoDist) ?

Challenge: cdf parameter of kstest cannot be specified as bimodal. Neither can I specify it to be hypoDist.

If I wanted to test whether my dataset was Gaussian distributed, I would write:

KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)

measurementError is the dataset that I have performed the KDE on. This returns: statistic=0.459, pvalue=0.0 To me, it is a little irritating that the pvalue is 0.0

Upvotes: 2

Views: 5810

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114976

The cdf argument to kstest can be a callable that implements the cumulative distribution function of the distribution against which you want to test your data. To use it, you have to implement the CDF of your bimodal distribution. You want the distribution to be a mixture of two normal distributions. You can implement the CDF for this distribution by computing the weighted sum of the CDFs of the two normal distributions that make up the mixture.

Here's a script that shows how you can do this. To demonstrate how kstest is used, the script runs kstest twice. First it uses a sample that is not from the distribution. As expected, kstest computes a very small p-value for this first sample. It then generates a sample that is drawn from the mixture. For this sample, the p-value is not small.

import numpy as np
from scipy import stats


def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
    """
    CDF of a mixture of two normal distributions.
    """
    return (weight1*stats.norm.cdf(x, mean1, stdv1) +
            (1 - weight1)*stats.norm.cdf(x, mean2, stdv2))


# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4

n = 200

# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)

# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)

# Create a sample from the bimodal distribution.  This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution.  The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
                         (stats.norm.rvs(mean2, stdv2, size=n - n1))))

# Most of time, the p-value returned by kstest with sample2 will not
# be small.  We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)

Typical output (the numbers will be different each time the script is run):

sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403

You might find that, for your problem, this test does not work well. You have 4800 samples, but in your code you have parameters whose numerical values have just one or two significant digits. Unless you have good reason to believe that your sample is drawn from a distribution with exactly those parameters, it is likely that kstest will return a very small p-value.

Upvotes: 3

Related Questions