raven
raven

Reputation: 529

ks_2samp returns p-value of 1.0

I have two pandas DataFrame data1 and data2, and both DataFrame have an integer column h filled with different values varying from 1 to 50.

data1 has a sample size of roughly 55000, whereas data2 has a sample size of roughly 8000. I am not able to upload the exact data due to their sizes, but below are the histograms I created of data1['h'] vs. data2['h']:

enter image description here

(I applied matplotlib.yscale('log') for an easier observation)

enter image description here

To compare the distribution, I used ks_2samp from scipy.stats. I composed one two-tailed test and two one-tailed tests to observe both directions of superiority:

# h indices are significantly different
print(ks_2samp(data1['h'], data2['h']))

# data1 h indices are greater
print(ks_2samp(data1['h'], data2['h'], alternative='greater'))

# data2 h indices are greater
print(ks_2samp(data1['h'], data2['h'], alternative='less'))

The results were the following:

Ks_2sampResult(statistic=0.1293719140156916, pvalue=3.448839769104661e-105)
Ks_2sampResult(statistic=0.0, pvalue=1.0)
Ks_2sampResult(statistic=0.1293719140156916, pvalue=1.5636837258561576e-105)

I have practiced ks_2samp before for other projects, but seeing such obscure p-values is quite new to me. The second result, especially, makes me wonder if I'm performing the test incorrectly, as p-value being 1.0 seems extremely absurd.

I've researched some similar issues including the following StackOverflow question (scipy p-value returns 0.0), but unfortunately this issue is not identical to any reported issues of yet.

I'd love to get any insights to interpret such results or to fix my approach.

Upvotes: 2

Views: 1638

Answers (1)

amquack
amquack

Reputation: 887

The problem does not seem to be with your code, but with your interpretation. We can see that data1 is shifted to the right, so I construct normal distributions, plot their histograms, and run the ksmirnov test to show that the results you got are in line with our expectations.

Setup:

from scipy.stats import ks_2samp
from numpy import random
import pandas as pd
from matplotlib import pyplot

random.seed(1)

n=4000
l1=[random.normal(1) for x in range(n)]
l2=[random.normal() for x in range(n)]

df=pd.DataFrame(list(zip(l1,l2)),columns=['1','2'])

Tests:

print(ks_2samp(df['1'], df['2']))
print(ks_2samp(df['1'], df['2'], alternative='greater'))
print(ks_2samp(df['1'], df['2'], alternative='less'))

Returns:

KstestResult(statistic=0.3965, pvalue=3.8418108959960396e-281)
KstestResult(statistic=0.0, pvalue=1.0)
KstestResult(statistic=0.3965, pvalue=1.9209054479980054e-281)

Graphical representation:

bins=50
pyplot.hist(l1,bins, alpha=.5, label='Sample 1')
pyplot.hist(l2,bins, alpha=.5, label='Sample 2')
pyplot.legend()
pyplot.show

sample histograms

So what's going on here?

The first KS test rejects the null hypothesis that the distributions are equivalent, and it does this with high confidence (pvalue is basically zero). The second one tells us that we cannot reject the hypothesis that sample 1 is greater than sample 2. This is obvious from what we know - sample 1 is pulled from the same population as sample 2, but shifted to the right. The third again rejects a null hypothesis, but this h0 is that sample 1 is smaller than sample 2. Notice that the pvalue here is the smallest - there is a smaller chance that sample 1 is less than sample 2 than that they are pulled from equivalent distributions. This is again as expected.

Also notice, with this example, that both distributions are normal and very similar. But the KS test tells you that "the populations may differ in median, variability or the shape of the distribution"(reference). Here, they differ in median but not shape, and this is detected.

Upvotes: 3

Related Questions