Reputation: 81
Let's take two sets:
a = [5,5,5,5,5,4,4,4,4,3,3,3,2,2,1]
b = [5,4,3,2,1]
We perform the KS-Test using Python:
from scipy import stats
stats.ks_2samp(b,a)
KstestResult(statistic=0.2, pvalue=0.9979360165118678, statistic_location=2, statistic_sign=1)
Why is the result a p-value of 0.9979? This means that the distribution of the values in the two sets is almost identical. But it's not! What do I missunderstand?
Kind regards.
Upvotes: 3
Views: 1387
Reputation: 17585
The observed value of the KS test statistic, namely 0.2, is actually relatively small, considering the distribution of the test statistic for a reasonable null hypothesis; I think this is where the surprise is coming from.
As mentioned, the usual KS test assumes there are no ties, so we'll have to compute the p-value ourselves. We can make progress by assuming the null hypothesis is that both samples come from a uniform distribution and estimating the p-value by random sampling. (This null hypothesis is more restrictive than the conventional one which just assume the same distribution, not necessarily uniform.)
Here are a few lines of R code to estimate the p-value. In the interest of brevity, it's specific to the problem as stated: a sample of size 5 and a sample of size 15, each one from a uniform distribution on the set { 1, 2, 3, 4, 5 }, and the observed KS test statistic is 0.2.
my.ecdf <- function (x) cumsum (sapply (1:5, function (k) sum (x == k))/length(x))
R <- function (n) sample.int (5, size = n, replace = T)
generate.ks.test.statistic <- function (n) sapply (1:n, function (k) max (abs (my.ecdf (R (15)) - my.ecdf (R (5)))))
ks <- generate.ks.test.statistic (10000)
sum (ks >= 0.2)/10000
For this last input, I get 0.8243. That's not as extreme as the value you reported (more than 0.99), but still enough to show that 0.2 is actually relatively small. You can look at hist(ks)
to see what the distribution looks like.
Upvotes: 2
Reputation: 12425
Do not use Kolmogorov-Smirnov test for data with ties (which you have lots of). Use permutation- or bootstrap-based tests.
References:
Instead of using the KS test you could simply use a permutation or resampling procedure as implemented in the
oneway_test
function of thecoin
package [inR
].
The distribution of the test statistic is based on the assumption that the distributions are continuous (so ties are impossible). The distribution is impacted when there are ties, but in such a way that it depends on the particular pattern of ties. Exact answers aren't generally practical and approximations are required.
(If there aren't a large proportion of ties, it won't make a great deal of difference if they're ignored. If there are, it will heavily affect the significance level.)
Different implementations of Kolmogorov-Smirnov test and ties - Cross Validated
Upvotes: 1