Florian
Florian

Reputation: 61

Cluster standard errors for Heckman selection model in R

I am trying to obtain clustered standard errors for a Heckman selection model given the output from the "sampleSelection" package (selection command).

For replication, I am using the examples given in the STATA documentation (see Examples 1 & 2 on pages 7 & 9 - http://www.stata.com/manuals13/rheckman.pdf).

In R, I obtain the results from Example 1 as follows:

install.packages("readstata13")
library(readstata13)

install.packages("sampleSelection")
library(sampleSelection)

## Read STATA data
dat <- data.table(read.dta13("http://www.stata-press.com/data/r13/womenwk.dta"))

## Summary statistics
summary(dat[,.(age, education, married, children, wage, county)])

## Define indicator whether wage variable is defined
dat[, lfp := !is.na(wage)]

## STATA command Example 1: heckman wage educ age, select(married children educ age)
heckmanML <- selection(selection = lfp ~ married + children + education + age, outcome = wage ~ education + age, data = dat)

## Results Example 1
summary(heckmanML)

## STATA command Example 2: heckman wage educ age, select(married children educ age) vce(cluster county)
## <<stuck here>>

Any ideas how I can replicate the last command using the vce(cluster) option? I tried to play around with cluster.vcov from the multiwayvcov package but got stuck with the following error:

cluster.vcov(heckmanML, eval(heckmanML$call$data)[,county])
Error in `[<-.data.frame`(`*tmp*`, i, "K", value = numeric(0)) : replacement has length zero

Upvotes: 1

Views: 1073

Answers (1)

Ian Gow
Ian Gow

Reputation: 3535

I adapted code from Mahmoud Arai. I fiddled with the degrees-of-freedom to match the output from the Stata manual and I replace the variance-covariance matrix stored with the fitted model so that the summary output matches Stata's output (see second set of output below).

library(haven)
library(dplyr, warn.conflicts = FALSE)
library(sampleSelection)
#> Loading required package: maxLik
#> Loading required package: miscTools
#> 
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#> 
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/
library(sandwich)

## Read STATA data
dat <- 
  read_stata("http://www.stata-press.com/data/r13/womenwk.dta") %>%
  ## Define indicator whether wage variable is defined
  mutate(lfp = !is.na(wage))

## STATA command Example 1: heckman wage educ age, select(married children educ age)
heckmanML <- selection(selection = lfp ~ married + children + education + age, 
                       outcome = wage ~ education + age, data = dat)

## Results Example 1
summary(heckmanML)
#> --------------------------------------------
#> Tobit 2 model (sample selection model)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 3 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -5178.304 
#> 2000 observations (657 censored and 1343 observed)
#> 10 free parameters (df = 1990)
#> Probit selection equation:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -2.491015   0.189340 -13.156  < 2e-16 ***
#> married      0.445171   0.067395   6.605 5.07e-11 ***
#> children     0.438707   0.027783  15.791  < 2e-16 ***
#> education    0.055732   0.010735   5.192 2.30e-07 ***
#> age          0.036510   0.004153   8.790  < 2e-16 ***
#> Outcome equation:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.48579    1.07704   0.451    0.652    
#> education    0.98995    0.05326  18.588   <2e-16 ***
#> age          0.21313    0.02060  10.345   <2e-16 ***
#>    Error terms:
#>       Estimate Std. Error t value Pr(>|t|)    
#> sigma  6.00479    0.16572   36.23   <2e-16 ***
#> rho    0.70350    0.05123   13.73   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> --------------------------------------------

vcovCL <- function(fm, cluster) {
  
  M <- length(unique(cluster))
  N <- length(cluster)
  dfc <- M/(M-1)
  
  u  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum))
  dfc * sandwich(fm, meat=crossprod(u)/N)
}

heckmanML[["vcovAll"]] <- vcovCL(heckmanML, dat$county)
summary(heckmanML)
#> --------------------------------------------
#> Tobit 2 model (sample selection model)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 3 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -5178.304 
#> 2000 observations (657 censored and 1343 observed)
#> 10 free parameters (df = 1990)
#> Probit selection equation:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -2.491015   0.115330 -21.599  < 2e-16 ***
#> married      0.445171   0.073147   6.086 1.39e-09 ***
#> children     0.438707   0.031239  14.044  < 2e-16 ***
#> education    0.055732   0.011004   5.065 4.47e-07 ***
#> age          0.036510   0.004038   9.042  < 2e-16 ***
#> Outcome equation:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.48579    1.30210   0.373    0.709    
#> education    0.98995    0.06001  16.498   <2e-16 ***
#> age          0.21313    0.02099  10.151   <2e-16 ***
#>    Error terms:
#>       Estimate Std. Error t value Pr(>|t|)    
#> sigma  6.00479    0.15520  38.691   <2e-16 ***
#> rho    0.70350    0.07088   9.925   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> --------------------------------------------

Created on 2021-05-08 by the reprex package (v2.0.0)

Upvotes: 0

Related Questions