Reputation: 1032
I am working on a finance problem and I am to implement a function to use Kolmogorov-Smirnov test(KS test) between a normal distribution of each stock's signal returns. I am to run the KS test on a normal distribution against each stock's signal return and for the test, I am to use scipy.stats.kstest to perform the KS test.
My function is as below:
def calculate_kstest(long_short_signal_returns):
"""
Calculate the KS-Test against the signal returns with a long or short signal.
Parameters
----------
long_short_signal_returns : DataFrame
The signal returns which have a signal.
This DataFrame contains two columns, "ticker" and "signal_return"
Returns
-------
ks_values : Pandas Series
KS static for all the tickers
p_values : Pandas Series
P value for all the tickers
"""
#TODO: Implement function
ks_v = []
p_v = []
#print(long_short_signal_returns)
column = []
df = long_short_signal_returns.copy()
print(df)
#df['signal_return'] = (df['signal_return'] - df['signal_return'].mean()) / (df['signal_return'].max() - df['signal_return'].min())
for name, group in df.groupby('ticker'):
sub_group = group['signal_return'].values
ks,p = kstest(sub_group,'norm')
ks_v.append(ks)
p_v.append(p)
column.append(name)
ks_values = pd.Series(ks_v,column)
p_values = pd.Series(p_v, column)
return ks_values, p_values
However, my answer doesn't match the expected output.
The input is:
INPUT long_short_signal_returns:
signal_return ticker
0 0.12000000 DNTM
1 -0.83000000 EHX
2 0.37000000 VWER
3 0.83000000 DNTM
4 -0.34000000 EHX
5 0.27000000 VWER
6 -0.68000000 DNTM
7 0.29000000 EHX
8 0.69000000 VWER
9 0.57000000 DNTM
10 0.39000000 EHX
11 0.56000000 VWER
12 -0.97000000 DNTM
13 -0.72000000 EHX
14 0.26000000 VWER
My output is:
OUTPUT ks_values:
DNTM 0.20326939
EHX 0.34826827
VWER 0.60256811
dtype: float64
OUTPUT p_values:
DNTM 0.98593727
EHX 0.48009144
VWER 0.02898631
dtype: float64
The expected output is:
EXPECTED OUTPUT FOR ks_values:
DNTM 0.28999582
EHX 0.34484969
VWER 0.63466098
dtype: float64
EXPECTED OUTPUT FOR p_values:
DNTM 0.73186935
EHX 0.49345487
VWER 0.01775987
dtype: float64
I was told to use a proper normalization before I could have the right p_values and ks_values, but I don't understand what this proper normalization mean and how to solve this problem. Can anyone help??
Upvotes: 3
Views: 2390
Reputation: 46
normal_args = (np.mean(long_short_signal_returns['signal_return']), np.std(long_short_signal_returns['signal_return'], ddof =1))
Try add ddof = 1
. This is a sample, not a population, so when you calculate the standard deviation, you need to use (n-1) rather than 'n'.
Upvotes: 1
Reputation: 1032
I found the solution to my problem. Somehow, I had to take the mean and std of the column in entire dataset. In this case, I had to do
k, p = kstest(rvs=subgroup,
cdf='norm',
args=(np.mean(df['signal_return']), np.std(df['signal_return'])))
Where df = long_short_signal_returns
Upvotes: 2
Reputation: 81
I faced the same issues and tried normalizing the distribution as well to mean = 0 and std = 1
However, the expected values in the test are still (slightly) different. Is there an error in the test or did you get the exact same values? How did you pass the tests?
groups = long_short_signal_returns.groupby('ticker')
normal_args = (np.mean(long_short_signal_returns['signal_return']),np.std(long_short_signal_returns['signal_return']))
for name, group in groups:
ks_value, p_value = kstest(group['signal_return'].values, 'norm', normal_args)
Tests results are slightly off (+- 0.05):
OUTPUT ks_values:
AVYK 0.63919407
JUWZ 0.29787827
VXIK 0.35221525
dtype: float64
OUTPUT p_values:
AVYK 0.01650327
JUWZ 0.69536353
VXIK 0.46493498
dtype: float64
EXPECTED OUTPUT FOR ks_values:
JUWZ 0.28999582
VXIK 0.34484969
AVYK 0.63466098
dtype: float64
EXPECTED OUTPUT FOR p_values:
JUWZ 0.73186935
VXIK 0.49345487
AVYK 0.01775987
dtype: float64
Upvotes: 1
Reputation: 194
The KS test without any additional parameters will test your data against a standard normal distribution (a normal distribution with mean 0 and standard deviation 1). If your data are normally distributed with a different mean and standard deviation, KS test will tell you that the data's distribution is significantly different (you'll get a small p-value).
What you want to test is the 'shape' of your distribution, not the mean and standard deviation. Two options: standardize your data before putting them into kstest (substract the mean then divide by standard deviation), or add the arguments in the call to KS test ( scipy.stats.kstest(data, 'norm', args=(mean, standard deviation))
Upvotes: 1