oliver
oliver

Reputation: 147

How to get the Monte Carlo simulation of power for Kolmogorov-Smirnov test?

I am trying to reproduce the simulation results on Razali & Wah's 2011 article "Power comparisons of Shapiro-Wilk, Kolmogorov-Smirov, Lilliefors, and Anderson-Darling tests" which appeared in the Journal of Statistical Modeling and Analytics. My question is that how to get the following picture in R (or Python) only for Kolmogorov-Smirnov test and Lilliefors test.

Our hypotheses test question is that given $X_1,\dots, X_n \sim$ unknown CDF $F$. Let $F_n$ be the empirical CDF. We want to test $$H_0: F=F_0, v.s., F\neq F_0$$ where $F_0$ is the normal distribution $N(0,1)$. Our test statistic as $$ T=\sup_{x\in R} |F_n-F_0(x)|$$

If $T>1-\alpha$ quantile, then we reject $H_0$ at level of significance $\alpha$.

In our Monte Carlo simulation, we take $\alpha=5%$, sample size is 100, $F$ is $Beta(2,2)$.

This paper says that they use "scipy.stats.ksone".

enter image description here

Upvotes: 1

Views: 275

Answers (1)

Matt Haberland
Matt Haberland

Reputation: 3873

This can be done in Python something like this.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
rng = np.random.default_rng(7489746759637963126)

alpha = 0.05  # threshhold for significance
dist = stats.beta(2, 2)  # power is against Beta(2, 2)
n_replications = 100  # power simulated with 100 replications
n_points = 15  # 15 different sample sizes
ns = np.logspace(np.log10(10), np.log10(3000), n_points).astype(int)

# Allocate arrays to store p-value
pvalues_lilliefors = np.zeros((n_points, n_replications))
pvalues_ks = np.zeros((n_points, n_replications))
pvalues_shapiro = np.zeros((n_points, n_replications))

for i, n in enumerate(ns):
  rvs = dist.rvs(size=n, random_state=rng)

  # perform Lillifors test once just to get null distribution of test statistic
  res = stats.goodness_of_fit(stats.norm, rvs, statistic='ks', random_state=rng)
  null_distribution = res.null_distribution

  for j in range(n_replications):
    rvs = dist.rvs(size=n, random_state=rng)

    # use goodness_of_fit just to evaluate Lilliefors statistic
    res = stats.goodness_of_fit(stats.norm, rvs, statistic='ks', 
                                random_state=rng, n_mc_samples=1)
    # get p-value of Lilliefors test based on null distribution
    p = ((np.sum(null_distribution >= res.statistic) + 1)
         / (len(null_distribution)+ 1))
    pvalues_lilliefors[i, j] = p

    # Compute KS p-value
    # (assuming null hypothesized distribution is normal fitted to data)
    loc = np.mean(rvs)
    scale = np.std(rvs)
    p = stats.kstest(rvs, stats.norm(loc=loc, scale=scale).cdf).pvalue
    pvalues_ks[i, j] = p

    # Compute Shapiro-Wilk p-value
    p = stats.shapiro(rvs).pvalue
    pvalues_shapiro[i, j] = p

# percentage of replications in which H0 was rejected
rejections_lilliefors = (pvalues_lilliefors < alpha).mean(axis=1)
rejections_ks = (pvalues_ks < alpha).mean(axis=1)
rejections_shapiro = (pvalues_shapiro < alpha).mean(axis=1)

# plot the results
plt.semilogx(ns, rejections_shapiro, label='SW')
plt.semilogx(ns, rejections_lilliefors, label='Lilliefors')
plt.semilogx(ns, rejections_ks, label='KS')
plt.legend()
plt.xlabel('Sample size `n`')
plt.ylabel('Simulated Power')
plt.title('Simulated Power of Normality Tests for Beta(2, 2), 100 replications')

enter image description here

A few notes:

  • The KS test compares the data against a specific null-hypothesized normal distribution, and I'm not certain what parameters (location and scale) of that null-hypothesized were used in the original paper. Your description says N(0, 1), but this is quite different from Beta(2, 2), so that null hypothesis would always be rejected. The paper says " In this case, the parameters need to be estimated based on the sample data.", so I've used the MLE parameters. (This is basically what Lilliefors argued against doing because it generally results in very low power.)
  • I've included the Shapiro-Wilk curve, too, since it's easy to do, and it makes the plot easier to compare against the reference.
  • I've simulated the power at 15 logspaced points, but you can use specific points if you prefer.
  • I'm also simulating the power by computing the p-value of the tests for 100 random samples from the beta distribution; you can perform more replications.
  • Finally, the Lilliefors test p-value is computed by simulating the null distribution of the test statistic using scipy.stats.goodness_of_fit with 9999 Monte Carlo sample, the default; you may wish to use more.

Upvotes: 0

Related Questions