Reputation: 273
I have a negative binomial regression model where I predict Twitter messages' retweet count based on their use of certain word types (ME words, Moral words, and Emotional words):
M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + Emo_words, data = Tweets)
I now want to sample with bootstrapping (for instance, samples of 1,000 with replacement from the dataframe's original 500,000 messages) from the large dataset, Tweets
, to run iterations of the model and analyse the variance of the coefficients. What is the best approach to doing this? I am assuming I need to use the boot
package, but I am a bit lost with where to start.
Ideally, I would like to create a for loop that can run a number of iterations, and then store the coefficients of each model iteration in a new dataframe. This would be extremely useful for future analyses.
Here is some reproducible data from the much much large dataframe Tweets
:
>dput((head(Tweets, 100)))
structure(list(retweetCount = c(1388, 762, 748, 436, 342, 320,
312, 295, 264, 251, 196, 190, 175, 167, 165, 163, 149, 148, 148,
146, 133, 132, 126, 124, 122, 122, 121, 120, 118, 118, 114, 113,
112, 110, 108, 107, 104, 101, 100, 96, 95, 94, 93, 92, 90, 90,
89, 89, 87, 86, 84, 83, 83, 83, 82, 82, 82, 82, 78, 78, 78, 76,
76, 76, 76, 74, 74, 73, 73, 72, 72, 71, 70, 70, 70, 70, 69, 69,
69, 68, 68, 67, 65, 65, 65, 65, 63, 62, 62, 61, 61, 61, 61, 60,
60, 59, 59, 59, 59, 58), ME_words = c(2, 2, 2, 0, 0, 1, 1, 0,
1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 3, 0, 1, 0, 1, 1, 4, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 2, 2, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 0, 0), Moral_words = c(0, 0, 1, 1, 1, 2, 0,
0, 0, 1, 0, 1, 2, 0, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 2, 0, 1, 1,
1, 0, 1, 1, 1, 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0, 1, 0,
0, 5, 1, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 2, 0, 0, 0,
1, 1, 2, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 2, 2, 1, 0, 0), Emo_words = c(0, 0, 1, 1, 0, 0, 2,
0, 1, 0, 2, 0, 1, 0, 1, 2, 0, 3, 1, 1, 2, 0, 0, 0, 0, 0, 1, 1,
1, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 1, 0, 1, 1, 2, 0, 0,
1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 0, 2, 0, 0, 1, 0,
1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 2, 1, 0, 0,
1, 0, 0, 1, 2, 2, 0, 0, 0)), row.names = c(NA, -100L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 2
Views: 4689
Reputation: 226007
You can use the boot
package, but for simple versions of the bootstrap it's almost simpler to roll your own.
library(MASS)
M1 <- glm.nb(retweetCount ~ ME_words + Moral_words +
Emo_words, data = Tweets)
nboot <- 1000
bres <- matrix(NA,nrow=nboot,
ncol=length(coef(M1)),
dimnames=list(rep=seq(nboot),
coef=names(coef(M1))))
set.seed(101)
bootsize <- 200
for (i in seq(nboot)) {
bdat <- Tweets[sample(nrow(Tweets),size=bootsize,replace=TRUE),]
bfit <- update(M1, data=bdat) ## refit with new data
bres[i,] <- coef(bfit)
}
data.frame(mean_est=colMeans(bres),
t(apply(bres,2,quantile,c(0.025,0.975))))
Upvotes: 3