SL42
SL42

Reputation: 221

How to perform a bootstrap and find confidence interval in R

I want to create a custom bootstrap function because I want to better understand what bootstrap is doing and it seems like the other bootstrap libraries out there does not solve my issue.

The Problem: I would like to create my own wald confidence interval function where it takes in the bootstrap data, outputs the confidence interval, test the confidence interval is within a range, and gets the coverage.

Right now, I am getting this type of error:

Error in bootresults[i,}<-waldCI(y=bootdata[i], n=numTrials):number of 
  items to replace is not a multiple of replacement length

The goal: My goal is to get the bootresults dataset to return 4 columns(p value,One that shows the upper bound, lower bound, and whether or not the p is in the interval) and get a graph similar to this one:

Wald interval chart

enter image description here

Code:

set.seed(42)
samples10 <- list()
i <- 1
while(i < 100) {
  sample10[[i]] <- rbinom(1500, size=10, prob=i*.01)  ## rows=1500 ;columns=10
  i <- i + 1
}
sample10 <- data.frame(samples10)
colnames(sample10) <- c(seq(.01, .99, .01)) ## p-values

waldconfidenceinterval <- function(y, n, alpha=0.05) {
  p <- colSums(y)/(n*200)
  sd <- sqrt(p*((1 - p)/(n*200)))
  z <- qnorm(c(alpha/2, 1 - alpha/2))
  ci <- p + z*sd
  return(ci)
}

B <- 200
numTrials <- 10
bootresults <- matrix(ncol=length(sample10), nrow=B)  ## rows=200, cols=99
                                                      ## empty matrix in the beginning
set.seed(42)

for(i in seq_len(B)) {
  bootdata <- sample10[sample(B, replace=T), ]
  bootresults[i, ] <- waldCI(y=bootdata[i], n=numTrials)
  ## Pseudocode:
  # boot_test_data$in_interval <- 
  #   ifelse(boot_test_data$lower1 < i/100 & i/100 < boot_test_data$upper1, 1, 0)
  # coverage[i] <- sum(boot_test_data$in_interval) / length(boot_test_data$in_interval)
}

Any help is greatly appreciated since I am fairly new to R.

Upvotes: 1

Views: 286

Answers (1)

jay.sf
jay.sf

Reputation: 73397

Looks like that you want to initialize a three-dimensional array bootresults rather than a two-dimensional matrix. In your waldCI() you may use colMeans.

waldCI <- function(y, alpha=0.05) {
  p <- colMeans(y)
  se <- sqrt(p*(1 - p)/nrow(y))
  z <- qnorm(1 - alpha/2)
  ci <- p + z*se %*% cbind(lower=-1, upper=1)
  return(ci)
}

B <- 200
numTrials <- 10
## initialize array
bootresults1 <- array(dim=c(ncol(samples10), 4, B), 
                     dimnames=list(c(), c("p.values", "lower", "upper", "in.int"), c()))

set.seed(42)
for(i in seq_len(B)) {
  samp <- samples10[sample(nrow(samples10), numTrials, replace=F), ]
  ci <- waldCI(samp)
  bootresults1[,,i] <- cbind(p.values, ci, in.int=ci[, 1] < p.values & p.values < ci[, 2])
}

coverage <- rowMeans(bootresults[,4,])
plot(p.values, coverage, type="l", main="My Plot")

enter image description here

Similar approach, more R-ish, though:

p.values <- seq(.01, .99, .01)
set.seed(42)
samples10 <- `colnames<-`(sapply(p.values, function(pr) rbinom(1.5e3, 1, pr)), p.values)

BOOT <- function(numTrials, ...) {
  samp <- samples10[sample(nrow(samples10), numTrials, replace=F), ]
  ci <- waldCI(samp, ...)
  cbind(p.values, ci, in.int=ci[, 1] < p.values & p.values < ci[, 2])
}

B <- 200
numTrials <- 10

set.seed(42)
bootresults2 <- replicate(B, BOOT(numTrials=10))

stopifnot(all.equal(bootresults1, bootresults2))

Data:

Note, that I used rbinom(..., size=1, ...) to create your sample data. The use of "p" as an object name suggested that the data should be binomial.

set.seed(42)
samples10 <- matrix(nrow=1500, ncol=99, dimnames=list(c(), c(seq(.01, .99, .01))))
i <- 1
while (i < 100) {
  samples10[, i] <- rbinom(1500, size=1, prob=i*.01)  ## rows=1500 ;columns=10
  i <- i + 1
}

Without a while loop, you could proceed vectorized:

p.values <- seq(.01, .99, .01)
set.seed(42)
samples10 <- `colnames<-`(sapply(p.values, function(pr) rbinom(1.5e3, 1, pr)), p.values)

Upvotes: 0

Related Questions