Akavall
Akavall

Reputation: 86356

Two-sample Kolmogorov-Smirnov Test in Python Scipy

I can't figure out how to do a Two-sample KS test in Scipy.

After reading the documentation of scipy kstest, I can see how to test whether a distribution is identical to standard normal distribution

from scipy.stats import kstest
import numpy as np

x = np.random.normal(0,1,1000)
test_stat = kstest(x, 'norm')
#>>> test_stat
#(0.021080234718821145, 0.76584491300591395)

Which means that at p-value of 0.76 we cannot reject the null hypothesis that the two distributions are identical.

However, I want to compare two distributions and see if I can reject the null hypothesis that they are identical, something like:

from scipy.stats import kstest
import numpy as np

x = np.random.normal(0,1,1000)
z = np.random.normal(1.1,0.9, 1000)

and test whether x and z are identical.

I tried the naive:

test_stat = kstest(x, z)

and got the following error:

TypeError: 'numpy.ndarray' object is not callable

Is there a way to do a two-sample KS test in Python? If so, how should I do it?

Upvotes: 109

Views: 117444

Answers (3)

cottontail
cottontail

Reputation: 23459

Since scipy 1.5.0, scipy.stats.kstest performs two-sample Kolmogorov-Smirnov test, if the second argument passed to it is an array-like (numpy array, Python list, pandas Series etc.); in other words, OP's code produces the expected result. This is because internally, if a distribution in scipy.stats can be parsed from the input, ks_1samp is called and ks_2samp is called otherwise. So kstest encompasses both.

import numpy as np
from scipy import stats

x = np.random.normal(0, 1, 1000)
z = np.random.normal(1.1, 0.9, 1000)

stats.kstest(x, stats.norm.cdf) == stats.ks_1samp(x, stats.norm.cdf)  # True
stats.kstest(x, z) == stats.ks_2samp(x, z)                            # True

How to interpret the result?

Two-sample K-S test asks the question, how likely is it to see the two samples as provided if they were drawn from the same distribution. So the null hypothesis is that the samples are drawn from the same distribution.

A small p-value means we reject the null hypothesis. So in the following example, p-value is very small, which means we reject the null hypothesis and conclude that the two samples x and z are not drawn from the same distribution.

stats.kstest(x, z)
# KstestResult(statistic=0.445, pvalue=1.8083688851037378e-89, statistic_location=0.6493521689945357, statistic_sign=1)

Indeed, if we plot the empirical cumulative distributions of x and z, they very clearly do not overlap, which supports the results of the K-S test performed above.

cdf of plots


On the other hand if the two samples were as follows (both drawn from a normal distribution with the same mean and very close standard deviation), the p-value of the K-S test is very large, so we cannot reject the null hypothesis that they are drawn from the same distribution.

x = np.random.normal(0, 1, 1000)
z = np.random.normal(0, 0.9, 1000)

stats.kstest(x, z)
# KstestResult(statistic=0.04, pvalue=0.4006338815832625, statistic_location=0.07673397024028321, statistic_sign=1)

Indeed, if we plot the empirical cumulative functions of these two samples, there's a great deal of overlap, which supports the conclusions from the K-S test.

another example

Note that, not rejecting null does not mean we conclude that the two samples are drawn from the same distribution, it simply means we cannot reject the null given the two samples.


Code to produce the cumulative distribution plots:

import seaborn as sns
import pandas as pd
sns.ecdfplot(pd.DataFrame({'x': x, 'z': z}))

Upvotes: 5

DSM
DSM

Reputation: 353569

You are using the one-sample KS test. You probably want the two-sample test ks_2samp:

>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>> 
>>> np.random.seed(12345678)
>>> x = np.random.normal(0, 1, 1000)
>>> y = np.random.normal(0, 1, 1000)
>>> z = np.random.normal(1.1, 0.9, 1000)
>>> 
>>> ks_2samp(x, y)
Ks_2sampResult(statistic=0.022999999999999909, pvalue=0.95189016804849647)
>>> ks_2samp(x, z)
Ks_2sampResult(statistic=0.41800000000000004, pvalue=3.7081494119242173e-77)

Results can be interpreted as following:

  1. You can either compare the statistic value given by python to the KS-test critical value table according to your sample size. When statistic value is higher than the critical value, the two distributions are different.

  2. Or you can compare the p-value to a level of significance a, usually a=0.05 or 0.01 (you decide, the lower a is, the more significant). If p-value is lower than a, then it is very probable that the two distributions are different.

Upvotes: 157

David Lijun Yu
David Lijun Yu

Reputation: 79

This is what the scipy docs say:

If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.

Cannot reject doesn't mean we confirm.

Upvotes: 7

Related Questions