Get multiple regression coefficient values calculated by resampling

Question

I am calculating the coefficients of multiple linear models using resampling. Before I was using the boot function, but in the future need to include new statistics in the analysis so I think this way is better. A replicable example:

iris <- iris[,1:4]

nboots <- 100
ncol = ncol(iris)
boot.r.squared <- numeric(nboots)      
boot.p_model <- numeric(nboots)
boot.coef_p <- numeric(nboots)
boot.coef_estimate <- matrix(nrow= nboots,ncol=ncol)
boot.coef_error <- matrix(nrow= nboots,ncol=ncol)
boot.coef_t <- matrix(nrow= nboots,ncol=ncol)
boot.coef_p <- matrix(nrow= nboots,ncol=ncol)

for(i in 1:nboots){
  boot.samp <- iris[sample(nrow(iris),size = 100, replace = TRUE,), ] 
  model <- lm(boot.samp$Sepal.Length ~ .,boot.samp)
  model.sum <- summary(model)

  boot.r.squared[i] <- model.sum$r.squared
  stat <- model.sum$fstatistic
  boot.p_model[i] <- pf(stat[1], stat[2], stat[3], lower.tail = FALSE)

  boot.coef_estimate[i, 1:length(model$coefficients)] <- model$coefficients[1]
  boot.coef_error[i, 1:length(model$coefficients)] <- model$coefficients[2]
  boot.coef_t[i, 1:length(model$coefficients)] <- model$coefficients[3]
  boot.coef_p[i, 1:length(model$coefficients)] <- model$coefficients[4] 
}

But I can't correctly save the coefficients in the form of a matrix. I would like to save 4 matrices containing in each: parameter, the error, statistic t and p value.

With this code all columns are the same. I tried put [,1] to save the first column, but this error happens. How can i fix this error?

Error in model $ coefficients [, 1]: incorrect number of dimensions

jay.sf · Accepted Answer

Your code works for me without errors. You probably experimented with different numbers of repetitions and forgot to update the sizes of the empty objects you define before the loop, since I could reprocude your error when I changed to nboots <- 1000.

You could do that without a for loop and use replicate() instead (which is basically a wrapper for lapply()). The advantage is that it might be a little more straightforward and you won't need to define the empty objects beforehand.

The procedure is, first, to define a bootstrap function, say bootFun(), second, the actual bootstrap using replicate() which yields an array, and finally third, to process the data in the array into the desired matrices.

# Step 1 bootstrap function
bootFun <- function(X) {
  fit <- lm(Sepal.Length ~ ., data=X)
  s <- summary(fit)
  r2 <- s$r.squared
  f <- s$fstatistic
  p.f <- pf(f[1], f[2], f[3], lower.tail = FALSE)
  ms <- s$coefficients
  return(cbind(ms, r2, p.f))
}

# Step 2 bootstrap
nboots <- 1e2
set.seed(42)  # for sake of reproducibility
A <- replicate(nboots, bootFun(iris[sample(1:nrow(iris), size=nrow(iris), replace=TRUE), ]))
# Note: I used here `size=nrow(iris)`, but you also could use size=100 if this is your method

# Step 3 process array ("de-transpose" and transform to list)
Ap <- aperm(A, c(3, 1, 2))  # aperm() is similar to t() for matrices
Aps <- asplit(Ap, 3)  # split the array into a handy list

# or in one step
Aps <- asplit(aperm(A, c(3, 1, 2)), 3)

Result

Now you have a list containing all the matrices, look at the names of the list objects.

names(Aps)
# [1] "Estimate"   "Std. Error" "t value"    "Pr(>|t|)"   "r2"         "p.f"

So, if we e.g. want the matrix of our estimates, we simply do:

estimates <- Aps$Estimate

estimates[1:3, ]
#      (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,]    1.353531   0.7655760    0.8322749  -0.7775090
# [2,]    1.777431   0.6653308    0.7353491  -0.6024095
# [3,]    2.029428   0.5825554    0.6941457  -0.4795787

Note, that the F-statistics and the R² are also matrices and duplicated in each row for every coefficient. Since you only need one, just extract e.g. the first column:

Aps$p.f[1:3, ]
#       (Intercept)  Sepal.Width Petal.Length  Petal.Width
# [1,] 2.759019e-65 2.759019e-65 2.759019e-65 2.759019e-65
# [2,] 5.451912e-66 5.451912e-66 5.451912e-66 5.451912e-66
# [3,] 3.288712e-54 3.288712e-54 3.288712e-54 3.288712e-54

Aps$p.f[1:3, 1]
# [1] 2.759019e-65 5.451912e-66 3.288712e-54

Benchmark

The speed of both approaches is practically the same with a small advantage for replicate(). Here a benchmark comparing both methods where I used nboot=1,000 replications.

# Unit: seconds
#      expr      min       lq     mean   median       uq      max neval cld
#   forloop 2.215259 2.235797 2.332234 2.381035 2.401933 2.700622   100   a
# replicate 2.218291 2.240570 2.313526 2.257905 2.400532 2.601958   100   a

Get multiple regression coefficient values calculated by resampling

Answers (2)

Result

Benchmark

Related Questions