Reputation: 207
I am calculating the coefficients of multiple linear models using resampling. Before I was using the boot
function, but in the future need to include new statistics in the analysis so I think this way is better. A replicable example:
iris <- iris[,1:4]
nboots <- 100
ncol = ncol(iris)
boot.r.squared <- numeric(nboots)
boot.p_model <- numeric(nboots)
boot.coef_p <- numeric(nboots)
boot.coef_estimate <- matrix(nrow= nboots,ncol=ncol)
boot.coef_error <- matrix(nrow= nboots,ncol=ncol)
boot.coef_t <- matrix(nrow= nboots,ncol=ncol)
boot.coef_p <- matrix(nrow= nboots,ncol=ncol)
for(i in 1:nboots){
boot.samp <- iris[sample(nrow(iris),size = 100, replace = TRUE,), ]
model <- lm(boot.samp$Sepal.Length ~ .,boot.samp)
model.sum <- summary(model)
boot.r.squared[i] <- model.sum$r.squared
stat <- model.sum$fstatistic
boot.p_model[i] <- pf(stat[1], stat[2], stat[3], lower.tail = FALSE)
boot.coef_estimate[i, 1:length(model$coefficients)] <- model$coefficients[1]
boot.coef_error[i, 1:length(model$coefficients)] <- model$coefficients[2]
boot.coef_t[i, 1:length(model$coefficients)] <- model$coefficients[3]
boot.coef_p[i, 1:length(model$coefficients)] <- model$coefficients[4]
}
But I can't correctly save the coefficients in the form of a matrix. I would like to save 4 matrices containing in each: parameter, the error, statistic t and p value.
With this code all columns are the same. I tried put [,1] to save the first column, but this error happens. How can i fix this error?
Error in model $ coefficients [, 1]: incorrect number of dimensions
Upvotes: 2
Views: 209
Reputation: 46898
Regarding your original code, you mixed up model and model.sum. You need to get the coefficients of summary(model) when you slot in coefficients, error, etc.
See below:
for(i in 1:nboots){
boot.samp <- iris[sample(nrow(iris),size = 100, replace = TRUE,), ]
model <- lm(boot.samp$Sepal.Length ~ .,boot.samp)
model.sum <- summary(model)
boot.r.squared[i] <- model.sum$r.squared
stat <- model.sum$fstatistic
boot.p_model[i] <- pf(stat[1], stat[2], stat[3], lower.tail = FALSE)
# this is the part where you need to use model.sum
boot.coef_estimate[i, 1:length(model$coefficients)] <- model.sum$coefficients[,1]
boot.coef_error[i, 1:length(model$coefficients)] <- model.sum$coefficients[,2]
boot.coef_t[i, 1:length(model$coefficients)] <- model.sum$coefficients[,3]
boot.coef_p[i, 1:length(model$coefficients)] <- model.sum$coefficients[,4]
}
As a side comment,you want to know how variable is the estimate of the coefficient using boot. For the other statistics like r-square, p-value, interpreting these after bootstrap, is not so useful. So think about what you want to know from the bootstrap.
Also check out @jay.sf's answer, it's a much cleaner way to wrap your calculation functions..
Upvotes: 1
Reputation: 72813
Your code works for me without errors. You probably experimented with different numbers of repetitions and forgot to update the sizes of the empty objects you define before the loop, since I could reprocude your error when I changed to nboots <- 1000
.
You could do that without a for
loop and use replicate()
instead (which is basically a wrapper for lapply()
). The advantage is that it might be a little more straightforward and you won't need to define the empty objects beforehand.
The procedure is, first, to define a bootstrap function, say bootFun()
, second, the actual bootstrap using replicate()
which yields an array, and finally third, to process the data in the array into the desired matrices.
# Step 1 bootstrap function
bootFun <- function(X) {
fit <- lm(Sepal.Length ~ ., data=X)
s <- summary(fit)
r2 <- s$r.squared
f <- s$fstatistic
p.f <- pf(f[1], f[2], f[3], lower.tail = FALSE)
ms <- s$coefficients
return(cbind(ms, r2, p.f))
}
# Step 2 bootstrap
nboots <- 1e2
set.seed(42) # for sake of reproducibility
A <- replicate(nboots, bootFun(iris[sample(1:nrow(iris), size=nrow(iris), replace=TRUE), ]))
# Note: I used here `size=nrow(iris)`, but you also could use size=100 if this is your method
# Step 3 process array ("de-transpose" and transform to list)
Ap <- aperm(A, c(3, 1, 2)) # aperm() is similar to t() for matrices
Aps <- asplit(Ap, 3) # split the array into a handy list
# or in one step
Aps <- asplit(aperm(A, c(3, 1, 2)), 3)
Now you have a list containing all the matrices, look at the names of the list objects.
names(Aps)
# [1] "Estimate" "Std. Error" "t value" "Pr(>|t|)" "r2" "p.f"
So, if we e.g. want the matrix of our estimates, we simply do:
estimates <- Aps$Estimate
estimates[1:3, ]
# (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,] 1.353531 0.7655760 0.8322749 -0.7775090
# [2,] 1.777431 0.6653308 0.7353491 -0.6024095
# [3,] 2.029428 0.5825554 0.6941457 -0.4795787
Note, that the F-statistics and the R2 are also matrices and duplicated in each row for every coefficient. Since you only need one, just extract e.g. the first column:
Aps$p.f[1:3, ]
# (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,] 2.759019e-65 2.759019e-65 2.759019e-65 2.759019e-65
# [2,] 5.451912e-66 5.451912e-66 5.451912e-66 5.451912e-66
# [3,] 3.288712e-54 3.288712e-54 3.288712e-54 3.288712e-54
Aps$p.f[1:3, 1]
# [1] 2.759019e-65 5.451912e-66 3.288712e-54
The speed of both approaches is practically the same with a small advantage for replicate()
. Here a benchmark comparing both methods where I used nboot=1,000
replications.
# Unit: seconds
# expr min lq mean median uq max neval cld
# forloop 2.215259 2.235797 2.332234 2.381035 2.401933 2.700622 100 a
# replicate 2.218291 2.240570 2.313526 2.257905 2.400532 2.601958 100 a
Upvotes: 2