Reputation: 1
I'm confused on how to use the scipy stat function ks_2samp. I have 2 samples of discrete data and want to see if they are significantly different. I have tried 3 things:
I'm getting different values each time and I do not know which is the right one and the right method. From the scipy.stats.ks_2samp documentation, it looks like data1 and data2 are raw data samples. But the Wikipedia page says that I need to input empirical data. I'm working with a huge dataset. Also, please suggest if there's any other method to achieve the same.
Following is my code snippet:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
from scipy.stats import norm
import pandas as pd
# x0 is a dataframe column
mean0 = np.mean(x0)
std0 = np.std(x0)
pdf0 = ss.norm.pdf(x0.sort_values(), mean0, std0)
mean1 = np.mean(x1)
std1 = np.std(x1)
pdf1 = ss.norm.pdf(x1.sort_values(), mean1, std1)
def ecdf(x):
xs = np.sort(x)
ys = np.arange(1, len(xs)+1)/float(len(xs))
return xs, ys
xs0, ys0 = ecdf(x0) xs1, ys1 = ecdf(x0)
rslt1=ss.ks_2samp(x0,x1)
rslt2=ss.ks_2samp(ys0,ys1)
rslt3=ss.ks_2samp(pdf0,pdf1)
print(rslt1)
print(rslt2)
print(rslt3)
Upvotes: 0
Views: 269
Reputation: 3873
scipy.stats.ks_2samp
accepts raw samples. You do not need to do any fitting, parametric or non-parametric (e.g. ecdf), beforehand. So #1 of the things you tried and rslt1
of your code look valid.
One way you can confirm this is by repeatedly generating samples under the null hypothesis and checking the distribution of p-values. Roughly speaking, they should be uniformly distributed (although there are some footnotes in the case of this test due to the discrete nature of the statistic).
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
rng = np.random.default_rng(459324692354693456)
pvals = []
for i in range(10000):
x = rng.normal(size=500)
y = rng.normal(size=500)
res = stats.ks_2samp(x, y, method='asymp')
pvals.append(res.pvalue)
res = stats.ecdf(pvals)
res.cdf.plot()
plt.plot([0, 1], [0, 1])
plt.title("ECDF of p-values under H0")
plt.xlabel("p-value")
plt.ylabel("Empirical CDF")
plt.legend(["ECDF", "Expected"])
If you were to pre-process the data according to ideas #2 or #3, the distribution of p-values would not be as expected.
Other tests of the null hypothesis that the two samples were drawn from the same distribution are scipy.stats.cramervonmises_2samp
, scipy.stats.anderson_ksamp
, and scipy.stats.epps_singleton_2samp
. The last two do not assume the underlying distribution is continuous.
There are many other independent sample tests that may be more sensitive to specific deviations, but these are the ones that are most similar in usage to Kolmogorov-Smirnov.
Note that there is also a scipy.stats.ecdf
as of SciPy 1.11.0, and if you know what distribution you want to fit, you can use the distribution's fit
method or scipy.stats.fit
.
Upvotes: 0