Is the scipy.stats.ks_2samp function supposed to take in raw data arrays or the ECDF of the data?

Question

I'm confused on how to use the scipy stat function ks_2samp. I have 2 samples of discrete data and want to see if they are significantly different. I have tried 3 things:

directly input the raw data
input the ecdfs of the data
input pdf fit of the data

I'm getting different values each time and I do not know which is the right one and the right method. From the scipy.stats.ks_2samp documentation, it looks like data1 and data2 are raw data samples. But the Wikipedia page says that I need to input empirical data. I'm working with a huge dataset. Also, please suggest if there's any other method to achieve the same.

Following is my code snippet:

    import numpy as np 
    import matplotlib.pyplot as plt 
    import scipy.stats as ss 
    from scipy.stats import norm 
    import pandas as pd
        
    # x0 is a dataframe column
    
    mean0 = np.mean(x0) 
    std0 = np.std(x0) 
    pdf0 = ss.norm.pdf(x0.sort_values(), mean0, std0)
        
    mean1 = np.mean(x1) 
    std1 = np.std(x1) 
    pdf1 = ss.norm.pdf(x1.sort_values(), mean1, std1)
        
    def ecdf(x):
            xs = np.sort(x)
            ys = np.arange(1, len(xs)+1)/float(len(xs))
            return xs, ys
        
      
    xs0, ys0 = ecdf(x0) xs1, ys1 = ecdf(x0)
        
    rslt1=ss.ks_2samp(x0,x1) 
    rslt2=ss.ks_2samp(ys0,ys1) 
    rslt3=ss.ks_2samp(pdf0,pdf1)
        
        
    print(rslt1) 
    print(rslt2) 
    print(rslt3)

Is the scipy.stats.ks_2samp function supposed to take in raw data arrays or the ECDF of the data?

Answers (1)

Related Questions