user3631369
user3631369

Reputation: 331

Multiple linear regression with missing covariates

Imagine I have a dataset like

df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8))

df
   y x1 x2
1 11 23 NA
2 12 NA  9
3 13 27  2
4 14 20  9
5 15 20  7
6 16 21  8

If I perform a multiple linear regression, I get

m <- lm(y~x1+x2, data=df)
summary(m)

Call:
lm(formula = y ~ x1 + x2, data = df)

Residuals:
         3          4          5          6 
-1.744e-01 -1.047e+00 -4.233e-16  1.221e+00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.72093   27.06244   0.729    0.599
x1          -0.24419    0.93927  -0.260    0.838
x2           0.02326    1.01703   0.023    0.985

Residual standard error: 1.617 on 1 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.4767,    Adjusted R-squared:  -0.5698 
F-statistic: 0.4556 on 2 and 1 DF,  p-value: 0.7234

Here we have 2 observations (1 and 2) deleted due to missingness.

To reduce the effects of missing data, would it be wise to compute 2 different simple linear regressions?

I.e.

m1 <- lm(y~x1, data=df)
m2 <- lm(y~x2, data=df)

In this case, for each model we will have only 1 observation deleted due to missingness.

Upvotes: 1

Views: 517

Answers (1)

horseoftheyear
horseoftheyear

Reputation: 915

No, that would probably not be wise. Because you run into the issue of omitted variables bias. You can see how this will affect your estimates, for instance for x1, which is inflated:

summary(lm(y~x1, data=df))
Call:
lm(formula = y ~ x1, data = df)

Residuals:
      1       3       4       5       6 
-2.5287  0.8276 -0.5460  0.4540  1.7931 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  21.3276     7.1901   2.966   0.0592 .
x1           -0.3391     0.3216  -1.054   0.3692  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.897 on 3 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2703,    Adjusted R-squared:  0.02713 
F-statistic: 1.112 on 1 and 3 DF,  p-value: 0.3692

Note that you're relation of interest is y~x1+x2, so the effect of x1 on y accounting for the effect of x2, and vice versa. That is of course not the same as estimating y~x1 and y~x2separately, where you omit the effect of the other explanatory variable.

Now there are of course strategies to deal with missing values. One option is estimating a Bayesian model, using JAGS for instance, where you can model the missing values. An example would be the following for instance, where I take the mean and standard deviation of each variable to model the missing values:

model{
  for(i in 1:N){
    y[i] ~ dnorm(yhat[i], tau)
    yhat[i] <- a+ b1*x1[i] + b2*x2[i]

    # Accounting for missing data
    x1[i]~dnorm(22,3)
    x2[i]~dnorm(7,1.3) 
    }
  # Priors
  b1~dnorm(0, .01)    
  b2~dnorm(0, .01)    

  # Hyperpriors
  tau <- pow(sd, -2)
  sd ~ dunif(0, 20)
}

This is just off the top of my head. For better and more insightful advice on how deal with missing values I would recommend paying a visit to stats.stackexchange

Upvotes: 1

Related Questions