user72716
user72716

Reputation: 273

Bootstrapping with glm model

I have a negative binomial regression model where I predict Twitter messages' retweet count based on their use of certain word types (ME words, Moral words, and Emotional words):

M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + Emo_words, data = Tweets)

I now want to sample with bootstrapping (for instance, samples of 1,000 with replacement from the dataframe's original 500,000 messages) from the large dataset, Tweets, to run iterations of the model and analyse the variance of the coefficients. What is the best approach to doing this? I am assuming I need to use the boot package, but I am a bit lost with where to start.

Ideally, I would like to create a for loop that can run a number of iterations, and then store the coefficients of each model iteration in a new dataframe. This would be extremely useful for future analyses.


Here is some reproducible data from the much much large dataframe Tweets:

>dput((head(Tweets, 100)))

structure(list(retweetCount = c(1388, 762, 748, 436, 342, 320, 
312, 295, 264, 251, 196, 190, 175, 167, 165, 163, 149, 148, 148, 
146, 133, 132, 126, 124, 122, 122, 121, 120, 118, 118, 114, 113, 
112, 110, 108, 107, 104, 101, 100, 96, 95, 94, 93, 92, 90, 90, 
89, 89, 87, 86, 84, 83, 83, 83, 82, 82, 82, 82, 78, 78, 78, 76, 
76, 76, 76, 74, 74, 73, 73, 72, 72, 71, 70, 70, 70, 70, 69, 69, 
69, 68, 68, 67, 65, 65, 65, 65, 63, 62, 62, 61, 61, 61, 61, 60, 
60, 59, 59, 59, 59, 58), ME_words = c(2, 2, 2, 0, 0, 1, 1, 0, 
1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 
0, 3, 0, 1, 0, 1, 1, 4, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 
0, 0, 2, 2, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 0, 
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 1, 0, 0, 0, 1, 0, 0), Moral_words = c(0, 0, 1, 1, 1, 2, 0, 
0, 0, 1, 0, 1, 2, 0, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 2, 0, 1, 1, 
1, 0, 1, 1, 1, 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0, 1, 0, 
0, 5, 1, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 
1, 1, 2, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 
1, 1, 0, 0, 2, 2, 1, 0, 0), Emo_words = c(0, 0, 1, 1, 0, 0, 2, 
0, 1, 0, 2, 0, 1, 0, 1, 2, 0, 3, 1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 
1, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 1, 0, 1, 1, 2, 0, 0, 
1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 0, 2, 0, 0, 1, 0, 
1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 2, 1, 0, 0, 
1, 0, 0, 1, 2, 2, 0, 0, 0)), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Views: 4689

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226007

You can use the boot package, but for simple versions of the bootstrap it's almost simpler to roll your own.

fit initial model

library(MASS)
M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + 
                 Emo_words, data = Tweets)

set up data structure for results

nboot <- 1000
bres <- matrix(NA,nrow=nboot,
                  ncol=length(coef(M1)),
                  dimnames=list(rep=seq(nboot),
                                coef=names(coef(M1))))

bootstrap

set.seed(101)
bootsize <- 200
for (i in seq(nboot)) {
  bdat <- Tweets[sample(nrow(Tweets),size=bootsize,replace=TRUE),]
  bfit <- update(M1, data=bdat)  ## refit with new data
  bres[i,] <- coef(bfit)
}

structure output

data.frame(mean_est=colMeans(bres),
      t(apply(bres,2,quantile,c(0.025,0.975))))

Upvotes: 3

Related Questions