Reputation: 2012
I have a distribution
This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.
BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test
data = np.concatenate((d1,d2))
xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))
# lets try the normal distribution first
m, s = stats.norm.fit(data) # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it
# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))
plt.show()
But I can't figure out how to use the default KS test for, that is
KS_D, KS_p = stats.kstest(data, "norm")
,
as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.
How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?
Upvotes: 1
Views: 290
Reputation: 158
"norm"
uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m
and s
for that, which are quite different. It is telling you they are very different from this standard reference distribution.
You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:
data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")
Upvotes: 1