Reputation: 1716

how to compare zero-inflated negative binomial regression outputs side-by-side, with and without clustered errors

There are many ways to compare regression models in a side-by-side table in R, including the packages stargazer, huxtable, and gtsummary. I'm struggling to do this with two zero-inflated negative binomial models when one has clustered standard errors and one doesn't.

The trouble lies in the different object classes. The clustered error model is in class "coeftest" from the pscl package while the unclustered model is class "zeroinfl" from the lmtest package.

Here is a simple example.

library(tidyverse)
library(pscl)
library(sandwich)
library(lmtest)

## data
data("bioChemists", package = "pscl")

set.seed(3.14)
zinb <- read.csv("https://stats.idre.ucla.edu/stat/data/fish.csv") %>%
  # add some random data representing lakes
  mutate(lake = sample(c("lake1", "lake2", "lake3"), n(), replace = TRUE))

# zero inflated model without clusters
no.clusters <- zeroinfl(count ~ child + camper,
                        data = zinb, dist = "negbin")

# cluster by lake
with.clusters <- zeroinfl(count ~ child + camper + factor(lake),
                          data = zinb, dist = "negbin")
v_lake = vcovCL(with.clusters, type = "HC1", cluster = ~lake)
with.clusters.final <- coeftest(with.clusters, v_lake)

How can I generate a table comparing the objects no.cluster and with.clusters.final?

Upvotes: 1

Answers (3)

Vincent

Reputation: 17715

The modelsummary package accepts a vcov argument which can be a list of strings (e.g., "robust" or "classical") or a one-sided formula with the cluster variable(s). Under the hood, it will then automagically call functions from the sandwich package to compute the uncertainty estimates. (Disclaimer: I am the package maintainer).

Note that a note is automatically appended to the bottom of the table to indicate the kind of standard error.

library(pscl)
library(modelsummary)

set.seed(3.14)
zinb <- read.csv("https://stats.idre.ucla.edu/stat/data/fish.csv")
zinb$lake <- sample(c("lake1", "lake2", "lake3"), nrow(zinb), replace = TRUE)

mod1 <- zeroinfl(count ~ child + camper,
                 data = zinb, dist = "negbin")

mod2 <- zeroinfl(count ~ child + camper + factor(lake),
                 data = zinb, dist = "negbin")

modelsummary(list(mod1, mod2),
             vcov = list("classical", ~lake),
             stars = TRUE)

Upvotes: 1

dash2

Reputation: 2262

Here's a simpler approach: use coeftest for the no-cluster object, but without clusters. You can then use e.g. huxtable::huxreg() to print out your coefficients:

with.clusters.final <- coeftest(with.clusters, v_lake, save = TRUE)
no.clusters.coeftest <- coeftest(no.clusters, save = TRUE)
huxreg(no.clusters.coeftest, with.clusters.final)
                        ───────────────────────────────────────────────────────
                                                            (1)           (2)  
                                                  ─────────────────────────────
                          count_(Intercept)           1.052 ***     1.087 ***  
                                                     (0.270)       (0.116)     
                          count_child                -0.911 **     -0.821 *    
                                                     (0.285)       (0.336)     
                          count_camper                0.798 **      0.806 ***  
                                                     (0.305)       (0.159)     
                          zero_(Intercept)          -11.499       -21.326 ***  
                                                    (55.699)       (5.362)     
                          zero_child                 10.483        11.734 ***  
                                                    (55.659)       (2.420)     
                          zero_camper                -9.501        -3.071 ***  
                                                    (55.663)       (0.852)     
                          count_factor(lake)lake2                   0.182      
                                                                   (0.195)     
                          count_factor(lake)lake3                  -0.595 ***  
                                                                   (0.108)     
                          zero_factor(lake)lake2                   10.878 ***  
                                                                   (2.609)     
                          zero_factor(lake)lake3                    1.353 ***  
                                                                   (0.223)     
                                                  ─────────────────────────────
                          nobs                                                 
                        ───────────────────────────────────────────────────────
                          *** p < 0.001; ** p < 0.01; * p < 0.05.

Selecting rows, and giving them nice names, can be done with the coefs argument.

Upvotes: 0

John J.

Reputation: 1716

I realize I was thinking about this problem incorrectly. Instead of trying to directly compare no.cluster with the object generated using coeftest, I just need to replace the standard errors in with.clusters with the robust values generated by coeftest.

This is a little more complicated than it sounds, since I'm using a zero-inflated regression.

Zero-inflated regression consists of two models, both of which are included in the output of coeftest. I only care about the values beginning with count_.

> with.clusters.final

t test of coefficients:

                         Estimate Std. Error t value              Pr(>|t|)    
count_(Intercept)         1.08748    0.11608  9.3686 < 0.00000000000000022 ***
count_child              -0.82116    0.33572 -2.4460             0.0151678 *  
count_camper              0.80562    0.15933  5.0563        0.000000850945 ***
count_factor(lake)lake2   0.18206    0.19467  0.9352             0.3506175    
count_factor(lake)lake3  -0.59536    0.10789 -5.5184        0.000000088810 ***
zero_(Intercept)        -21.32602    5.36229 -3.9770        0.000092516485 ***
zero_child               11.73360    2.42002  4.8486        0.000002241063 ***
zero_camper              -3.07137    0.85151 -3.6070             0.0003772 ***
zero_factor(lake)lake2   10.87828    2.60947  4.1688        0.000042846049 ***
zero_factor(lake)lake3    1.35318    0.22321  6.0624        0.000000005192 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I need to drop the rows beginning with zero_, then remove the "count_" prefix from the remaining names.

> clustered.se <- with.clusters.final[,2]
> clustered.se <- clustered.se[str_sub(names(clustered.se), 1, 5) == "count"]
> names(clustered.se) <- str_remove(names(clustered.se), "count_")
> clustered.se
      (Intercept)             child            camper factor(lake)lake2 factor(lake)lake3 
        0.1160763         0.3357177         0.1593300         0.1946651         0.1078864

We can do the same thing for test statistics and p-values. Here is a function so simplify the process.

clustered_stat <- function(coeftest_object, statistic){
  statistic.row <- if(statistic == "se"){
    2
  } else if(statistic == "t") {
    3
  } else if(statistic == "p") {
    4
  }
  
  # pull the appropraite row out of the coeftest object
  new.values <- coeftest_object[,statistic.row]
  # just keep the values from the count model, not the logit model
  new.values <- new.values[str_sub(names(new.values), 1, 5) == "count"]
  # remove the "count_" prefix from the names of the values
  names(new.values) <- str_remove(names(new.values), "count_")
  new.values
}

Now I can supply the original clustered zeroinfl object to stargazer, replacing the statistics with their robust versions created by coeftest.

> stargazer::stargazer(no.clusters, with.clusters,
+                      se = list(NULL, clustered_stat(with.clusters.final, "se")),
+                      t = list(NULL, clustered_stat(with.clusters.final, "t")),
+                      p = list(NULL, clustered_stat(with.clusters.final, "p")),
+                      type = "text",
+                      column.labels = c("no clusters", "with clusters"))

==============================================
                      Dependent variable:     
                  ----------------------------
                             count            
                   no clusters   with clusters
                       (1)            (2)     
----------------------------------------------
child               -0.911***      -0.821**   
                     (0.285)        (0.336)   
                                              
camper               0.798***      0.806***   
                     (0.305)        (0.159)   
                                              
factor(lake)lake2                    0.182    
                                    (0.195)   
                                              
factor(lake)lake3                  -0.595***  
                                    (0.108)   
                                              
Constant             1.052***      1.087***   
                     (0.270)        (0.116)   
                                              
----------------------------------------------
Observations           250            250     
Log Likelihood       -434.906      -430.784   
==============================================
Note:              *p<0.1; **p<0.05; ***p<0.01

Upvotes: 0

how to compare zero-inflated negative binomial regression outputs side-by-side, with and without clustered errors

Answers (3)

Related Questions