Wrong VCOV matrix from sandwich::vcovHC with observation counts in R

I am trying to estimate a weighted LM with repeated observations. Naturally, one can exploit this redundancy information to save space and memory. I am working on an application with 400k observations, many of which are duplicated, and said LMs (and more complicated methods) are invoked many times, which is why weighted OLS is equivalent to OLS with duplicated observations. Without this memory-saving trick, the analysis cannot proceed because this robust VCOV is invoked hundreds of times. The idea of using observation counts to reduce computational burden is not new; a good example where it shines is Owen, 2017 (‘A weighted self-concordant optimization for empirical likelihood’).

The point estimates in the two variants of the model, unweighted (m) and weighted (m.w), are identical. However, sandwich::vcovHC does not handle the weights as if they were observations weights.

MWE:

rm(list = ls())
x <- c(1, 1, 1, 1, 2:10) # Observation 1 has a count of 4
y <- c(15, 15, 15, 15, 13, 36, 31, 37, 63, 66, 80, 91, 94)
x.uniq <- x[-(1:3)]
y.uniq <- y[-(1:3)]
w <- c(4, rep(1, 9))
m <- lm(y ~ x)
m.w <- lm(y.uniq ~ x.uniq, weights = w)
coef(m) - coef(m.w) # Indentical
round(sandwich::vcovHC(m, type = "HC0"), 2)
#            (Intercept)     x
#(Intercept)        5.06 -0.53
#x                 -0.53  0.11
round(sandwich::vcovHC(m.w, type = "HC0"), 2) # Very different

This is how the correct VCOV should be computed in theory:

# White's asymptotic sandwich formula manually:
br <- solve(crossprod(model.matrix(m)))
me <- crossprod(model.matrix(m) * resid(m))
V <- br %*% me %*% br
round(V, 2)
#            (Intercept) x.uniq
#(Intercept)        5.06  -0.53
#x.uniq            -0.53   0.11
# Same formula, with observation counts
br.w <- solve(crossprod(model.matrix(m.w) * sqrt(weights(m.w))))
me.w <- crossprod(model.matrix(m.w) * resid(m.w) * sqrt(weights(m.w)))
V.w <- br.w %*% me.w %*% br.w
round(V.w, 2)
#            (Intercept) x.uniq
#(Intercept)        5.06  -0.53
#x.uniq            -0.53   0.11

The problem can be seen in the sandwich::meatHC function. The following code computes the meat manually (simplest case, corresponding to type = "HC0")

X <- model.matrix(m)
X.w <- model.matrix(m.w)
res <- rowMeans(estfun(m)/X)
res.w <- rowMeans(estfun(m.w)/X.w)
omega <- res^2
omega.w <- res.w^2
meat <- crossprod(sqrt(omega) * X) / nrow(X)
meat.w <- crossprod(sqrt(omega.w) * X.w) / nrow(X.w)
all.equal(meat, meatHC(m, type = "HC0")) # TRUE
all.equal(meat.w, meatHC(m.w, type = "HC0")) # TRUE

Am I being silly and not seeing some obvious arguments that should be passed as ... to estfun, as well as the need for a custom omega vector?

Yours sincerely, Andreï.

Upvotes: 0

Views: 151

Answers (1)

Case 1: no observation weighting, only counts

V.good <- sandwich::vcovHC(m.w, type = "HC0", omega = resid(m.w)^2 * weights(m.w))
round(V.good, 2)
#            (Intercept) x.uniq
#(Intercept)        5.06  -0.53
#x.uniq            -0.53   0.11

The only necessary change here is supplying the correct omega vector. In this application, it suffices to provide weighted squared residuals. I could not find any similar examples online; maybe this could be added as an example to the vignette because Stata users have been enjoying their [f/a/p]weights for too long, and OLS with observation weights is very common in applied statistics.

Case 2: counts AND observation weights (e.g. IPW)

Now suppose that there are observation weights (sampling weights, inverse-variance weights, propensity score etc.), and in addition to that, some observations are duplicated, which is why the final weights, used as the argument of lm(), are their product:

x <- c(1, 1, 1, 1, 2:10) # Observation 1 has a count of 4
y <- c(15, 15, 15, 15, 13, 36, 31, 37, 63, 66, 80, 91, 94)
iv <- 1 / x # If Var(U | X) ~ X, then, these weights are optimal
x.uniq <- x[-(1:3)]
y.uniq <- y[-(1:3)]
iv.uniq <- iv[-(1:3)]
w <- c(4, rep(1, 9))
mw   <- lm(y ~ x, weights = iv)
mwc <- lm(y.uniq ~ x.uniq, weights = iv.uniq * w)
coef(mw) - coef(mwc) # Indentical
round(sandwich::vcovHC(mw, type = "HC0"), 3)
#             (Intercept)      x
# (Intercept)       1.344 -0.189
# x                -0.189  0.151
round(sandwich::vcovHC(mwc, type = "HC0"), 3) # Very different
#             (Intercept) x.uniq
# (Intercept)       2.621 -0.389
# x.uniq           -0.389  0.183

The old solution will not work because now the weights should enter the bread and the meat matrices diffrently:

round(sandwich::vcovHC(mwc, type = "HC0", omega = resid(mwc)^2 * weights(mwc)), 3)
#             (Intercept) x.uniq
# (Intercept)       3.148 -0.933
# x.uniq           -0.933  0.912

Now, the bread matrix should take into account only the observation counts, and the meat matrix, both counts and weights (in this example, inverse-variance):

n <- sum(w) # True number of observations, 13
mm <- model.matrix(mwc)
ef <- mm * resid(mwc) * iv.uniq * sqrt(w) # estfun
me <- crossprod(ef) / n # Meat matrix
max(abs(me - meat(mw))) # 7.1e-15, good
br <- solve(crossprod(mm * sqrt(weights(mwc))) / n) # Bread matrix
max(abs(br - bread(mw))) # 2.2e-16, good
V <- br %*% me %*% br / n
round(V, 3)
#             (Intercept) x.uniq
# (Intercept)       1.344 -0.189
# x.uniq           -0.189  0.151

Upvotes: 2

Related Questions