Jack
Jack

Reputation: 857

How to run the predicted probabilities (or average marginal effects) for individuals fixed effects in panel data using R?

These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.

As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.

I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.

library(lfe)
library(plm)
library(ggeffects)
data("Males")  

model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)

pred_ggeffects <- ggpredict(model_lm, c("residence","union"), 
                    vcov.fun = "vcovCL", 
                    vcov.type = "HC1",
                    vcov.args = list(cluster = Males$nr))

Upvotes: 8

Views: 1102

Answers (3)

Russ Lenth
Russ Lenth

Reputation: 6810

The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.) So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:

library(emmeans)
library(plm)
data("Males")

mod <- plm(wage ~ exper + residence + health + residence*union,
           model = "within", 
           index = "nr", 
           data = Males)

BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)

RG <- qdrg(~ -1 + exper + residence + health + residence*union, 
           data = Males, coef = BB, vcov = VV, df = df.residual(mod))

Verify that things line up:

> names(RG@bhat)
 [1] "exper"                             ""                                 
 [3] "residencenorth_east"               "residencenothern_central"         
 [5] "residencesouth"                    "healthyes"                        
 [7] "unionyes"                          "residencenorth_east:unionyes"     
 [9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG@linfct)
 [1] "exper"                             "residencerural_area"              
 [3] "residencenorth_east"               "residencenothern_central"         
 [5] "residencesouth"                    "healthyes"                        
 [7] "unionyes"                          "residencenorth_east:unionyes"     
 [9] "residencenothern_central:unionyes" "residencesouth:unionyes"

They do line up, so we can get the results we need:

(EMM <- emmeans(RG, ~ residence * union))
 residence       union emmean     SE   df lower.CL upper.CL
 rural_area      no     0.378 0.0335 2677  0.31201    0.443
 north_east      no     0.330 0.1636 2677  0.00929    0.651
 nothern_central no     0.192 0.1483 2677 -0.09834    0.483
 south           no     0.260 0.1514 2677 -0.03732    0.557
 rural_area      yes    0.287 0.1473 2677 -0.00144    0.576
 north_east      yes    0.385 0.1647 2677  0.06155    0.708
 nothern_central yes    0.333 0.1539 2677  0.03091    0.634
 south           yes    0.341 0.1534 2677  0.04024    0.642

Results are averaged over the levels of: health 
Confidence level used: 0.95 

In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.

Update: a general function

Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.

plmrg = function(object, ...) {
    form = formula(formula(object))
    if (!("(Intercept)" %in% names(coef(object))))
        form = update(form, ~ . - 1)
    data = eval(object$call$data, environment(form))
    mmat = model.matrix(form, data)
    sel = which(colnames(mmat) %in% names(coef(object)))
    k = ncol(mmat)
    b = rep(0, k)
    b[sel] = coef(object)
    v = matrix(0, nrow = k, ncol = k)
    v[sel, sel] = vcov(object)
    emmeans::qdrg(formula = form, data = data, 
        coef = b, vcov = v, df = df.residual(object), ...)
}

Test run:

> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
    exper = 3, 6, 9
    residence = rural_area, north_east, nothern_central, south
    health = no, yes
    union = no, yes

> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
 residence       emmean     SE   df lower.CL upper.CL
 rural_area       0.313 0.0791 2677   0.1579    0.468
 north_east       0.338 0.1625 2677   0.0190    0.656
 nothern_central  0.243 0.1494 2677  -0.0501    0.536
 south            0.281 0.1514 2677  -0.0161    0.578

Results are averaged over the levels of: exper, health, union 
Confidence level used: 0.95 

Upvotes: 2

swihart
swihart

Reputation: 2738

I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.

library(emmeans)
library(plm)
data("Males")  

## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)

## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
                     model.matrix(wage ~ exper + residence + health + residence*union, Males), 
                     nr=Males[complete.cases(Males),"nr"])


(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))

## expanded formula
model_plm2 <- plm(fmla2,
                  model = "within",
                  index=c("nr"), 
                  data=Males2)

(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))

plm2_rg <- qdrg(fmla2_rg,
                data = Males2,
                coef = coef(model_plm2),
                vcov = vcov(model_plm2),
                df = model_plm2$df.residual)

plm2_rg

### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))

Which gives, after some row-deletion:

> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
 residencenorth_east residencenothern_central residencesouth unionyes emmean     SE   df lower.CL upper.CL
                   0                        0              0        0 0.3777 0.0335 2677  0.31201    0.443
                   1                        0              0        0 0.3301 0.1636 2677  0.00929    0.651
                   0                        1              0        0 0.1924 0.1483 2677 -0.09834    0.483
                   0                        0              1        0 0.2596 0.1514 2677 -0.03732    0.557
                   0                        0              0        1 0.2875 0.1473 2677 -0.00144    0.576
                   1                        0              0        1 0.3845 0.1647 2677  0.06155    0.708
                   0                        1              0        1 0.3326 0.1539 2677  0.03091    0.634
                   0                        0              1        1 0.3411 0.1534 2677  0.04024    0.642

Results are averaged over the levels of: healthyes 
Confidence level used: 0.95

Upvotes: 2

swihart
swihart

Reputation: 2738

This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?

library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)

## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical!  but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")


## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples 
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html 
  rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr), 
              data = Males,
              coef = coef(model_biglm),
              vcov = vcov(model_biglm), 
              df = model_biglm$df.resid,
              nuisance="nr")
  
## Since we already specified nuisance in qdrg() we don't in emmeans():
  emmeans(rg1, c("residence","union"))

Which gives:

>   emmeans(rg1, c("residence","union"))
 residence       union emmean     SE   df lower.CL upper.CL
 rural_area      no      1.72 0.1417 2677     1.44     2.00
 north_east      no      1.67 0.0616 2677     1.55     1.79
 nothern_central no      1.53 0.0397 2677     1.45     1.61
 south           no      1.60 0.0386 2677     1.52     1.68
 rural_area      yes     1.63 0.2011 2677     1.23     2.02
 north_east      yes     1.72 0.0651 2677     1.60     1.85
 nothern_central yes     1.67 0.0503 2677     1.57     1.77
 south           yes     1.68 0.0460 2677     1.59     1.77

Results are averaged over the levels of: 1 nuisance factors, health 
Confidence level used: 0.95 

Upvotes: 1

Related Questions