Reputation: 65
I would like to perform a bootstrap linear regression, because of concerns about normally distributed error terms. I have a data set that is big enough to ignore this fact, but I just want to double check if a linear regression model relying on analytically computed standard errors yields the same results as results obtained from a bootstrapped linear regression.
So far, I have used the follwing code:
Rel01 <- subset(
Relevante_Variablen2,
select = c(intr, sale_py_at_py, R_at_py,
inflr, dt, re, txt)
)
x = as.data.frame(sapply(Rel01, as.numeric))
boxplot(x)
library(boot)
i<-nrow(x) #count number of rows for resampling
g<-ncol(x) #count number of columns to step through with bootstrapping
boot.mean<-function(x,i){boot.mean<-mean(x[i])} #bootstrapping function to get the mean
boot_tabl <- apply(x,2,function(y){
b<-boot(y,boot.mean,R=50000);
c(mean(b$t),boot.ci(b,type="perc", conf=0.95)$percent[4:5])
})
View(boot_tabl)
Everything seems to work (output is in the table shown below), but I do not know how to interpret the output from my approach.
^ | sale_py_at_py | intr | inflr | dt | re | txt |
---|---|---|---|---|---|---|
1 | 0.0984438 | 0.01223 | 0.04243 | 200 | 400 | 1200 |
2 | 0.0974530 | 0.01191 | 0.04122 | 190 | 300 | 1000 |
3 | 0.0993230 | 0.01291 | 0.04256 | 210 | 405 | 1210 |
My data looks something like this:
Name | Segment | Sale | Year | Asset | Another header |
---|---|---|---|---|---|
A | 3401 | 10000 | 2000 | 200000 | x |
A | 3401 | 20000 | 2001 | 250000 | x |
B | 2201 | 15000 | 2004 | 280000 | x |
B | 2201 | 23000 | 2009 | 320000 | x |
B | 2201 | 28000 | 2010 | 390000 | x |
C | 2201 | 30000 | 2000 | 210000 | x |
C | 2201 | 18000 | 2004 | 200000 | x |
D | 1 | 28000 | 2000 | 400000 | x |
D | 1 | 38000 | 2001 | 521000 | x |
Could anyone provide some directions on how I could bootstrap my linear regression analysis? And also tell me what the output of my regression is telling me exactly?
Edit:
original_data = Relevante_V03
original_model = lm01
set.seed(123) # fix random number generator for reproducibility
boot_lm <- function(original_data, original_model,
type = c('ordinary', 'param'),
B = 1000L, seed = 123) {
set.seed(seed)
betas_original_model <- coef(original_model)
len_coef <- length(betas_original_model)
mat <- matrix(rep(0L, B * len_coef), ncol = len_coef)
if (type %in% 'ordinary') {
n_rows <- length(residuals(original_model))
for (i in seq_len(B)) {
boot_dat <- original_data[sample(seq_len(n_rows), replace = TRUE), ]
mat[i, ] <- coef(lm(terms(original_model), data = boot_dat))
}
}
if (type %in% 'param') {
X <- model.matrix(delete.response(terms(original_model)),
data = original_data)[, -1L]
for (i in seq_len(B)) {
mat[i, ] <- coef(lm(unlist(simulate(original_model)) ~ X,
data = original_data))
}
}
confints <- matrix(rep(0L, 2L * len_coef), ncol = 2L)
pvals <- numeric(len_coef)
for (i in seq_len(len_coef)) {
pvals[i] <- mean(abs(mat[, i] - mean(mat[, i])) > abs(betas_original_model[i]))
confints[i, ] <- quantile(mat[, i], c(.025, 0.975))
}
names(pvals) <- names(betas_original_model)
out <- data.frame(estimate = betas_original_model,
'lwr' = confints[, 1], 'upr' = confints[, 2],
p_value = pvals)
return(out)
}
# linear model to be bootstrapped
my_split <- split(Relevante_V03, Relevante_V03$sic & Relevante_V03$fdyear) # split Relevante_03 by sic(Segment) & (fd)year
out <- lapply(my_split, function(x) {
lm(marketingspending ~ intr + sale_py_at_py + R_at_py, data = x) # perform linear regression on each company separately
})
ordinary <- lapply(out, function(x) coef(summary(x))) # obtain summary from linear models
# run bootstrap function on each of the levels of Name (company)
## this may take a while, as we have 50 Names (companies)...
for (i in seq_along(out)) {
ordinary[[i]] <- boot_lm(original_data = my_split[[i]], original_model = out[[i]],
type = 'ordinary', B = 10000) # B is number of bootstrap samples
}
# output
ordinary
The output looks like this:
$TRUE
row. | estimate | lwr | upr | p_value |
---|---|---|---|---|
(Intercept) | 15647649 | 14131807 | 17268286 | 0 |
intr | -64880946 | -92974339 | -36369417 | 0 |
sale_py_at_py | -4520320 | -5741252 | -3359742 | 0 |
R_at_py | -1904298824 | -23372044347 | -1564292455 | 0 |
My questions to this are:
-Does everything looks to you fine?
-Why are all the p-Values 0 and how can I get to see a more detailed value of p? Because they should not be 0
Thanks to the comment (@Dion Groothof) I tried the approach. Could someone tell me if I am doing everything right in this approach?
Upvotes: 0
Views: 1063
Reputation: 1456
As requested by the OP, this should be the appropriate way to proceed with in his specific case.
It is not necessary (and I would even advise against it) to change the arguments within the function itself. Instead, specify appropriate character stings and integers as arguments in a call to the function. This should be done as follows.
First we specify our linear model.
# linear model
fm0 <- lm(marketingspending ~ intr + inflr + sale_py_at_py+ R_at_py +
+ dt + re + txt , data = Relevante_V03)
Then, we run the function as is. For more information on the function's arguments, refer to my recent answer.
boot_lm <- function(original_data, original_model,
type = c('ordinary', 'param'),
B = 1000L, seed = 1) {
set.seed(seed)
betas_original_model <- coef(original_model)
len_coef <- length(betas_original_model)
mat <- matrix(rep(0L, B * len_coef), ncol = len_coef)
if (type %in% 'ordinary') {
n_rows <- length(residuals(original_model))
for (i in seq_len(B)) {
boot_dat <- original_data[sample(seq_len(n_rows), replace = TRUE), ]
mat[i, ] <- coef(lm(terms(original_model), data = boot_dat))
}
}
if (type %in% 'param') {
X <- model.matrix(delete.response(terms(original_model)),
data = original_data)[, -1L]
for (i in seq_len(B)) {
mat[i, ] <- coef(lm(unlist(simulate(original_model)) ~ X,
data = original_data))
}
}
confints <- matrix(rep(0L, 2L * len_coef), ncol = 2L)
pvals <- numeric(len_coef)
for (i in seq_len(len_coef)) {
pvals[i] <- mean(abs(mat[, i] - mean(mat[, i])) > abs(betas_original_model[i]))
confints[i, ] <- quantile(mat[, i], c(.025, 0.975))
}
names(pvals) <- names(betas_original_model)
out <- data.frame(estimate = betas_original_model,
'lwr' = confints[, 1], 'upr' = confints[, 2],
p_value = pvals)
return(out)
}
Finally, we specify character strings and integers as arguments in a call to boot_lm()
to have it tailor-made for your specific case.
# non-parametric bootstrap
boot_lm(original_data = Relevante_V03, original_model = fm0,
type = 'ordinary', B = 1e4, seed = 59385)
# parametric bootstrap
boot_lm(original_data = Relevante_V03, original_model = fm0,
type = 'param', B = 1e4, seed = 59385)
Upvotes: 1