Misterrik
Misterrik

Reputation: 9

How can the 2 sample K-S test be implemented in Python?

I have a data-set of experimental data (y_observed, t whereby y_observed represents measured values and t represents the time in seconds since start of the measurement). I perform a Gaussian fitting to this data, but want to assess whether it makes sense to fit a Gaussian or whether the observed experimental data is non-Gaussian.

I've tried to implement the 2 sample Kolmogorov-Smirnov goodness of fit test in Python. However, I am wondering whether I am implementing and interpreting this test correctly. A small example can be found below, whereby y_observed does not look like a Gaussian (see attached figure) but according to the 2 sample K-S test y_observed does follow a Gaussian distribution. Can someone please explain to me what I'm doing wrong? Thanks in advance!

enter image description here

from scipy.stats import ks_2samp
import numpy as np

y_observed = [
    1.93291, 9.84869, 29.5713, 20.6346, -1.51537,
    0.396292, 50.8895, 68.855, 63.9291, 55.6863,
    37.6503, 46.87, 33.6637, 25.0395,18.3027,
    38.0947, 27.9305, 10.2012, -0.704519, 18.6656,
    20.2873, 8.78955, 4.39672
]


t = [
    519, 522, 525, 528, 531,
    534, 537, 540, 543, 546,
    549, 552, 555, 558 ,561,
    564, 567, 570, 573, 576,
    579,582,585
]


y_fit = [
    5.60417, 8.74951, 12.9907, 18.3426, 24.63,
    31.4518, 38.1947, 44.1102, 48.4454, 50.5991,
    50.2586, 47.4739, 42.6458, 36.4314, 29.5973,
    22.8668, 16.8011, 11.7394, 7.80065, 4.92939,
    2.96233, 1.69298, 0.920122
] 


count, bins_count = np.histogram(y_observed, bins=5)
pdf = count / sum(count)
cdf = np.cumsum(pdf)   
count_2, bins_count_2 = np.histogram(y_fit, bins=5)
pdf_2 = count_2 / sum(count_2)
cdf_2 = np.cumsum(pdf_2)
KS_stat, P_val = ks_2samp(cdf, cdf_2) 
n_1, n_2 = len(y_observed), len(y_fit)
critical_val = np.sqrt((n_1+n_2)/(n_1*n_2))*1.36
if P_val<0.05 and KS_stat>critical_val:
      print("Observed data does not follow a Gaussian distribution")
else:
      print("Observed data follows a Gaussian distribution")              

Upvotes: 0

Views: 166

Answers (1)

Matt Haberland
Matt Haberland

Reputation: 3873

Can someone please explain to me what I'm doing wrong?

stats.ks_2samp accepts two samples and performs a test of the null hypothesis that the two samples were drawn from the same (unknown) continuous distribution. So the first problem is that the code is not passin two samples - it appears to be passing in course-grained empirical CDFs of two arrays, one of which is your sample. (I don't know that y_fit is.) If you wanted to assess whether y_observed and y_fit were drawn from the same distribution (which doesn't look meaningful if you're after whether y_observed is drawn from a normal distribution), you would simply call

ks_2samp(y_observed, y_fit)

The second problem is that this would not be the appropriate test to assess whether the data is non-Gaussian.

If it must be a KS-test, you would want the one-sample KS test.

from scipy import stats
stats.ks_1samp(y_observed, stats.norm().cdf)

The problem is that as-written, this will test the null hypothesis that the data were drawn from the standard normal. If you know the parameters loc and scale of the hypothesized normal distribution, you can pass them in:

from scipy import stats
stats.ks_1samp(y_observed, stats.norm(loc=loc, scale=scale).cdf)

If you don't know the parameters (even if you can fit them to the data), then this test is not appropriate.

There are several other normality tests in SciPy that you could use instead. All you need to pass them is the sample to be tested.

stats.shapiro(y_observed)
stats.jarque_bera(y_observed)
stats.normaltest(y_observed)
stats.kurtosistest(y_observed)
stats.skewtest(y_observed)
stats.anderson(y_observed)

shapiro is often preferred. anderson would be another good option, but scipy.stats.anderson does not produce a p-value (it only returns a table of critical values), so it's not as easy to use. scipy.stats.goodness_of_fit is the most flexible (and could produce a p-value for the Anderson-Darling test), but you don't need it for a normality test.

Note that whatever you choose, a frequentist null hypothesis test cannot possibly conclude that the data were drawn from a Gaussian distribution. It can only provide evidence that the data were not drawn from a Gaussian distribution; otherwise it is inconclusive.

Upvotes: 0

Related Questions