Reputation: 45
I want to get the P-value of two randomly distributed observations x and y, for example :
> set.seed(0)
> x <- rnorm(1000, 3, 2)
> y <- rnorm(2000, 4, 3)
or:
> set.seed(0)
> x <- rexp(50, 10)
> y <- rexp(100, 11)
let's say that T is my test-statistic defined as mean(x) - mean(y) = 0 (this is H0), the P-value is then defined as : p-value = P[T>T_observed | H0 holds].
I tried doing this :
> z <- c(x,y) # if H0 holds then x and y are distributed with the same distribution
> f <- function(x) ecdf(z) # this will get the distribution of z (x and y)
then to calculate the p-value i tried this:
> T <- replicate(10000, mean(sample(z,1000,TRUE))-mean(sample(z,2000,TRUE))) # this is
supposed to get the null distribution of mean(x) - mean(y)
> f(quantile(T,0.05)) # calculating the p-value for a significance of 5%
obviously this doesn't seem to work, what am i missing ?
Upvotes: 1
Views: 2200
Reputation: 567
Your intention is very good -- to calculate statistical significance via bootstrap sampling (aka bootstrapping). However, the mean(sample(x,1000,TRUE))-mean(sample(z,2000,TRUE)) can't work because this is taking an average of 1000 samples of z - an average of 2000 samples of z. This will most certainly be quite close to 0 regardless of the true means of x and y.
I would suggest the following:
diff <- (sample(x, size = 2000, replace = TRUE) - sample(y, size = 2000, replace = TRUE))
2000 samples (with replacement) of both x and y are taken and the difference is calculated. Of course you can increase confidence too by adding replications as you suggested. As opposed to pvalue, I prefer confidence intervals (CI) as I think they are more informative (and equivalent in statistical accuracy to p-values). The CIs can then be calculated as follows using the means and standard errors:
stderror <- sd(diff)/sqrt(length(x))
upperCI <- mean(diff)+stderror
lowerCI <- mean(diff)-stderror
cat(lowerCI, upperCI)
Since the CI does not include 0, the null hypothesis is rejected. Notice that the result will be close to t-test (for your normal example) CI results in R:
t <- t.test(x, y)
cat(t$conf.int)
Upvotes: 1