Bootstrapping with glm model

Question

I have a negative binomial regression model where I predict Twitter messages' retweet count based on their use of certain word types (ME words, Moral words, and Emotional words):

M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + Emo_words, data = Tweets)

I now want to sample with bootstrapping (for instance, samples of 1,000 with replacement from the dataframe's original 500,000 messages) from the large dataset, Tweets, to run iterations of the model and analyse the variance of the coefficients. What is the best approach to doing this? I am assuming I need to use the boot package, but I am a bit lost with where to start.

Ideally, I would like to create a for loop that can run a number of iterations, and then store the coefficients of each model iteration in a new dataframe. This would be extremely useful for future analyses.

Here is some reproducible data from the much much large dataframe Tweets:

>dput((head(Tweets, 100)))

structure(list(retweetCount = c(1388, 762, 748, 436, 342, 320, 
312, 295, 264, 251, 196, 190, 175, 167, 165, 163, 149, 148, 148, 
146, 133, 132, 126, 124, 122, 122, 121, 120, 118, 118, 114, 113, 
112, 110, 108, 107, 104, 101, 100, 96, 95, 94, 93, 92, 90, 90, 
89, 89, 87, 86, 84, 83, 83, 83, 82, 82, 82, 82, 78, 78, 78, 76, 
76, 76, 76, 74, 74, 73, 73, 72, 72, 71, 70, 70, 70, 70, 69, 69, 
69, 68, 68, 67, 65, 65, 65, 65, 63, 62, 62, 61, 61, 61, 61, 60, 
60, 59, 59, 59, 59, 58), ME_words = c(2, 2, 2, 0, 0, 1, 1, 0, 
1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 
0, 3, 0, 1, 0, 1, 1, 4, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 
0, 0, 2, 2, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 0, 
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 1, 0, 0, 0, 1, 0, 0), Moral_words = c(0, 0, 1, 1, 1, 2, 0, 
0, 0, 1, 0, 1, 2, 0, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 2, 0, 1, 1, 
1, 0, 1, 1, 1, 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0, 1, 0, 
0, 5, 1, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 
1, 1, 2, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 
1, 1, 0, 0, 2, 2, 1, 0, 0), Emo_words = c(0, 0, 1, 1, 0, 0, 2, 
0, 1, 0, 2, 0, 1, 0, 1, 2, 0, 3, 1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 
1, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 1, 0, 1, 1, 2, 0, 0, 
1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 0, 2, 0, 0, 1, 0, 
1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 2, 1, 0, 0, 
1, 0, 0, 1, 2, 2, 0, 0, 0)), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))

Ben Bolker · Accepted Answer

You can use the boot package, but for simple versions of the bootstrap it's almost simpler to roll your own.

fit initial model

library(MASS)
M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + 
                 Emo_words, data = Tweets)

set up data structure for results

nboot <- 1000
bres <- matrix(NA,nrow=nboot,
                  ncol=length(coef(M1)),
                  dimnames=list(rep=seq(nboot),
                                coef=names(coef(M1))))

bootstrap

set.seed(101)
bootsize <- 200
for (i in seq(nboot)) {
  bdat <- Tweets[sample(nrow(Tweets),size=bootsize,replace=TRUE),]
  bfit <- update(M1, data=bdat)  ## refit with new data
  bres[i,] <- coef(bfit)
}

structure output

data.frame(mean_est=colMeans(bres),
      t(apply(bres,2,quantile,c(0.025,0.975))))

Bootstrapping with glm model

Answers (1)

fit initial model

set up data structure for results

bootstrap

structure output

Related Questions