How to use a proper normalization to have the right p_values and ks_values from Kolmogorov-Smirnov test (KS test)?

I am working on a finance problem and I am to implement a function to use Kolmogorov-Smirnov test(KS test) between a normal distribution of each stock's signal returns. I am to run the KS test on a normal distribution against each stock's signal return and for the test, I am to use scipy.stats.kstest to perform the KS test.

I am recommended to iterate over the groupby function
I am required to use only pandas, numpy and scipy.

My function is as below:

def calculate_kstest(long_short_signal_returns):
"""
Calculate the KS-Test against the signal returns with a long or short signal.

Parameters
----------
long_short_signal_returns : DataFrame
    The signal returns which have a signal.
    This DataFrame contains two columns, "ticker" and "signal_return"

Returns
-------
ks_values : Pandas Series
    KS static for all the tickers
p_values : Pandas Series
    P value for all the tickers
"""
#TODO: Implement function
ks_v = []
p_v = []
#print(long_short_signal_returns)
column = []
df = long_short_signal_returns.copy()
print(df)

#df['signal_return'] = (df['signal_return'] - df['signal_return'].mean()) / (df['signal_return'].max() - df['signal_return'].min())

for name, group in df.groupby('ticker'):
    sub_group = group['signal_return'].values
    ks,p = kstest(sub_group,'norm')

    ks_v.append(ks)
    p_v.append(p)
    column.append(name)
ks_values = pd.Series(ks_v,column)
p_values = pd.Series(p_v, column)


return ks_values, p_values

However, my answer doesn't match the expected output.

The input is:

INPUT long_short_signal_returns:
    signal_return ticker
0      0.12000000   DNTM
1     -0.83000000    EHX
2      0.37000000   VWER
3      0.83000000   DNTM
4     -0.34000000    EHX
5      0.27000000   VWER
6     -0.68000000   DNTM
7      0.29000000    EHX
8      0.69000000   VWER
9      0.57000000   DNTM
10     0.39000000    EHX
11     0.56000000   VWER
12    -0.97000000   DNTM
13    -0.72000000    EHX
14     0.26000000   VWER

My output is:

OUTPUT ks_values:
DNTM   0.20326939
EHX    0.34826827
VWER   0.60256811
dtype: float64

OUTPUT p_values:
DNTM   0.98593727
EHX    0.48009144
VWER   0.02898631
dtype: float64

The expected output is:

EXPECTED OUTPUT FOR ks_values:
DNTM   0.28999582
EHX    0.34484969
VWER   0.63466098
dtype: float64

EXPECTED OUTPUT FOR p_values:
DNTM   0.73186935
EHX    0.49345487
VWER   0.01775987
dtype: float64

I was told to use a proper normalization before I could have the right p_values and ks_values, but I don't understand what this proper normalization mean and how to solve this problem. Can anyone help??

Upvotes: 3

Answers (4)

Qi Zhao

Reputation: 46

normal_args = (np.mean(long_short_signal_returns['signal_return']), np.std(long_short_signal_returns['signal_return'], ddof =1))

Try add ddof = 1. This is a sample, not a population, so when you calculate the standard deviation, you need to use (n-1) rather than 'n'.

Upvotes: 1

Eyong Kevin Enowanyo

Reputation: 1032

I found the solution to my problem. Somehow, I had to take the mean and std of the column in entire dataset. In this case, I had to do

k, p = kstest(rvs=subgroup, 
              cdf='norm', 
              args=(np.mean(df['signal_return']), np.std(df['signal_return'])))

Where df = long_short_signal_returns

Upvotes: 2

der_flori

Reputation: 81

I faced the same issues and tried normalizing the distribution as well to mean = 0 and std = 1

However, the expected values in the test are still (slightly) different. Is there an error in the test or did you get the exact same values? How did you pass the tests?

groups = long_short_signal_returns.groupby('ticker')

normal_args = (np.mean(long_short_signal_returns['signal_return']),np.std(long_short_signal_returns['signal_return']))

for name, group in groups:
    ks_value, p_value = kstest(group['signal_return'].values, 'norm', normal_args)

Tests results are slightly off (+- 0.05):

OUTPUT ks_values:
AVYK   0.63919407
JUWZ   0.29787827
VXIK   0.35221525
dtype: float64

OUTPUT p_values:
AVYK   0.01650327
JUWZ   0.69536353
VXIK   0.46493498
dtype: float64

EXPECTED OUTPUT FOR ks_values:
JUWZ   0.28999582
VXIK   0.34484969
AVYK   0.63466098
dtype: float64

EXPECTED OUTPUT FOR p_values:
JUWZ   0.73186935
VXIK   0.49345487
AVYK   0.01775987
dtype: float64

Upvotes: 1

Joos Korstanje

Reputation: 194

The KS test without any additional parameters will test your data against a standard normal distribution (a normal distribution with mean 0 and standard deviation 1). If your data are normally distributed with a different mean and standard deviation, KS test will tell you that the data's distribution is significantly different (you'll get a small p-value).

What you want to test is the 'shape' of your distribution, not the mean and standard deviation. Two options: standardize your data before putting them into kstest (substract the mean then divide by standard deviation), or add the arguments in the call to KS test ( scipy.stats.kstest(data, 'norm', args=(mean, standard deviation))

Upvotes: 1

How to use a proper normalization to have the right p_values and ks_values from Kolmogorov-Smirnov test (KS test)?

Answers (4)

Related Questions