Lunar Cultist
Lunar Cultist

Reputation: 81

Why does the KS-Test give a p-value of 1 if the distribution is different

Let's take two sets:

a = [5,5,5,5,5,4,4,4,4,3,3,3,2,2,1]
b = [5,4,3,2,1]

We perform the KS-Test using Python:

from scipy import stats
stats.ks_2samp(b,a)
KstestResult(statistic=0.2, pvalue=0.9979360165118678, statistic_location=2, statistic_sign=1)

Why is the result a p-value of 0.9979? This means that the distribution of the values in the two sets is almost identical. But it's not! What do I missunderstand?

Kind regards.

Upvotes: 3

Views: 1387

Answers (2)

Robert Dodier
Robert Dodier

Reputation: 17585

The observed value of the KS test statistic, namely 0.2, is actually relatively small, considering the distribution of the test statistic for a reasonable null hypothesis; I think this is where the surprise is coming from.

As mentioned, the usual KS test assumes there are no ties, so we'll have to compute the p-value ourselves. We can make progress by assuming the null hypothesis is that both samples come from a uniform distribution and estimating the p-value by random sampling. (This null hypothesis is more restrictive than the conventional one which just assume the same distribution, not necessarily uniform.)

Here are a few lines of R code to estimate the p-value. In the interest of brevity, it's specific to the problem as stated: a sample of size 5 and a sample of size 15, each one from a uniform distribution on the set { 1, 2, 3, 4, 5 }, and the observed KS test statistic is 0.2.

my.ecdf <- function (x) cumsum (sapply (1:5, function (k) sum (x == k))/length(x))

R <- function (n) sample.int (5, size = n, replace = T)

generate.ks.test.statistic <- function (n) sapply (1:n, function (k) max (abs (my.ecdf (R (15)) - my.ecdf (R (5)))))

ks <- generate.ks.test.statistic (10000)

sum (ks >= 0.2)/10000

For this last input, I get 0.8243. That's not as extreme as the value you reported (more than 0.99), but still enough to show that 0.2 is actually relatively small. You can look at hist(ks) to see what the distribution looks like.

Upvotes: 2

Timur Shtatland
Timur Shtatland

Reputation: 12425

Do not use Kolmogorov-Smirnov test for data with ties (which you have lots of). Use permutation- or bootstrap-based tests.

References:

Instead of using the KS test you could simply use a permutation or resampling procedure as implemented in the oneway_test function of the coin package [in R].

Is there an alternative to the Kolmogorov-Smirnov test for tied data with correction? - Cross Validated


The distribution of the test statistic is based on the assumption that the distributions are continuous (so ties are impossible). The distribution is impacted when there are ties, but in such a way that it depends on the particular pattern of ties. Exact answers aren't generally practical and approximations are required.

(If there aren't a large proportion of ties, it won't make a great deal of difference if they're ignored. If there are, it will heavily affect the significance level.)

Different implementations of Kolmogorov-Smirnov test and ties - Cross Validated

Upvotes: 1

Related Questions