j_3265
j_3265

Reputation: 207

Get multiple regression coefficient values calculated by resampling

I am calculating the coefficients of multiple linear models using resampling. Before I was using the boot function, but in the future need to include new statistics in the analysis so I think this way is better. A replicable example:

iris <- iris[,1:4]

nboots <- 100
ncol = ncol(iris)
boot.r.squared <- numeric(nboots)      
boot.p_model <- numeric(nboots)
boot.coef_p <- numeric(nboots)
boot.coef_estimate <- matrix(nrow= nboots,ncol=ncol)
boot.coef_error <- matrix(nrow= nboots,ncol=ncol)
boot.coef_t <- matrix(nrow= nboots,ncol=ncol)
boot.coef_p <- matrix(nrow= nboots,ncol=ncol)

for(i in 1:nboots){
  boot.samp <- iris[sample(nrow(iris),size = 100, replace = TRUE,), ] 
  model <- lm(boot.samp$Sepal.Length ~ .,boot.samp)
  model.sum <- summary(model)

  boot.r.squared[i] <- model.sum$r.squared
  stat <- model.sum$fstatistic
  boot.p_model[i] <- pf(stat[1], stat[2], stat[3], lower.tail = FALSE)

  boot.coef_estimate[i, 1:length(model$coefficients)] <- model$coefficients[1]
  boot.coef_error[i, 1:length(model$coefficients)] <- model$coefficients[2]
  boot.coef_t[i, 1:length(model$coefficients)] <- model$coefficients[3]
  boot.coef_p[i, 1:length(model$coefficients)] <- model$coefficients[4] 
}

But I can't correctly save the coefficients in the form of a matrix. I would like to save 4 matrices containing in each: parameter, the error, statistic t and p value.

With this code all columns are the same. I tried put [,1] to save the first column, but this error happens. How can i fix this error?

Error in model $ coefficients [, 1]: incorrect number of dimensions

Upvotes: 2

Views: 209

Answers (2)

StupidWolf
StupidWolf

Reputation: 46898

Regarding your original code, you mixed up model and model.sum. You need to get the coefficients of summary(model) when you slot in coefficients, error, etc.

See below:

for(i in 1:nboots){
  boot.samp <- iris[sample(nrow(iris),size = 100, replace = TRUE,), ] 
  model <- lm(boot.samp$Sepal.Length ~ .,boot.samp)
  model.sum <- summary(model)

  boot.r.squared[i] <- model.sum$r.squared
  stat <- model.sum$fstatistic
  boot.p_model[i] <- pf(stat[1], stat[2], stat[3], lower.tail = FALSE)

  # this is the part where you need to use model.sum
  boot.coef_estimate[i, 1:length(model$coefficients)] <- model.sum$coefficients[,1]
  boot.coef_error[i, 1:length(model$coefficients)] <- model.sum$coefficients[,2]
  boot.coef_t[i, 1:length(model$coefficients)] <- model.sum$coefficients[,3]
  boot.coef_p[i, 1:length(model$coefficients)] <- model.sum$coefficients[,4] 
}

As a side comment,you want to know how variable is the estimate of the coefficient using boot. For the other statistics like r-square, p-value, interpreting these after bootstrap, is not so useful. So think about what you want to know from the bootstrap.

Also check out @jay.sf's answer, it's a much cleaner way to wrap your calculation functions..

Upvotes: 1

jay.sf
jay.sf

Reputation: 72813

Your code works for me without errors. You probably experimented with different numbers of repetitions and forgot to update the sizes of the empty objects you define before the loop, since I could reprocude your error when I changed to nboots <- 1000.

You could do that without a for loop and use replicate() instead (which is basically a wrapper for lapply()). The advantage is that it might be a little more straightforward and you won't need to define the empty objects beforehand.

The procedure is, first, to define a bootstrap function, say bootFun(), second, the actual bootstrap using replicate() which yields an array, and finally third, to process the data in the array into the desired matrices.

# Step 1 bootstrap function
bootFun <- function(X) {
  fit <- lm(Sepal.Length ~ ., data=X)
  s <- summary(fit)
  r2 <- s$r.squared
  f <- s$fstatistic
  p.f <- pf(f[1], f[2], f[3], lower.tail = FALSE)
  ms <- s$coefficients
  return(cbind(ms, r2, p.f))
}

# Step 2 bootstrap
nboots <- 1e2
set.seed(42)  # for sake of reproducibility
A <- replicate(nboots, bootFun(iris[sample(1:nrow(iris), size=nrow(iris), replace=TRUE), ]))
# Note: I used here `size=nrow(iris)`, but you also could use size=100 if this is your method

# Step 3 process array ("de-transpose" and transform to list)
Ap <- aperm(A, c(3, 1, 2))  # aperm() is similar to t() for matrices
Aps <- asplit(Ap, 3)  # split the array into a handy list

# or in one step
Aps <- asplit(aperm(A, c(3, 1, 2)), 3)

Result

Now you have a list containing all the matrices, look at the names of the list objects.

names(Aps)
# [1] "Estimate"   "Std. Error" "t value"    "Pr(>|t|)"   "r2"         "p.f"

So, if we e.g. want the matrix of our estimates, we simply do:

estimates <- Aps$Estimate

estimates[1:3, ]
#      (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,]    1.353531   0.7655760    0.8322749  -0.7775090
# [2,]    1.777431   0.6653308    0.7353491  -0.6024095
# [3,]    2.029428   0.5825554    0.6941457  -0.4795787

Note, that the F-statistics and the R2 are also matrices and duplicated in each row for every coefficient. Since you only need one, just extract e.g. the first column:

Aps$p.f[1:3, ]
#       (Intercept)  Sepal.Width Petal.Length  Petal.Width
# [1,] 2.759019e-65 2.759019e-65 2.759019e-65 2.759019e-65
# [2,] 5.451912e-66 5.451912e-66 5.451912e-66 5.451912e-66
# [3,] 3.288712e-54 3.288712e-54 3.288712e-54 3.288712e-54

Aps$p.f[1:3, 1]
# [1] 2.759019e-65 5.451912e-66 3.288712e-54

Benchmark

The speed of both approaches is practically the same with a small advantage for replicate(). Here a benchmark comparing both methods where I used nboot=1,000 replications.

# Unit: seconds
#      expr      min       lq     mean   median       uq      max neval cld
#   forloop 2.215259 2.235797 2.332234 2.381035 2.401933 2.700622   100   a
# replicate 2.218291 2.240570 2.313526 2.257905 2.400532 2.601958   100   a

Upvotes: 2

Related Questions