Reputation: 331
Imagine I have a dataset like
df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8))
df
y x1 x2
1 11 23 NA
2 12 NA 9
3 13 27 2
4 14 20 9
5 15 20 7
6 16 21 8
If I perform a multiple linear regression, I get
m <- lm(y~x1+x2, data=df)
summary(m)
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
3 4 5 6
-1.744e-01 -1.047e+00 -4.233e-16 1.221e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.72093 27.06244 0.729 0.599
x1 -0.24419 0.93927 -0.260 0.838
x2 0.02326 1.01703 0.023 0.985
Residual standard error: 1.617 on 1 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.4767, Adjusted R-squared: -0.5698
F-statistic: 0.4556 on 2 and 1 DF, p-value: 0.7234
Here we have 2 observations (1 and 2) deleted due to missingness.
To reduce the effects of missing data, would it be wise to compute 2 different simple linear regressions?
I.e.
m1 <- lm(y~x1, data=df)
m2 <- lm(y~x2, data=df)
In this case, for each model we will have only 1 observation deleted due to missingness.
Upvotes: 1
Views: 517
Reputation: 915
No, that would probably not be wise.
Because you run into the issue of omitted variables bias.
You can see how this will affect your estimates, for instance for x1
, which is inflated:
summary(lm(y~x1, data=df))
Call:
lm(formula = y ~ x1, data = df)
Residuals:
1 3 4 5 6
-2.5287 0.8276 -0.5460 0.4540 1.7931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3276 7.1901 2.966 0.0592 .
x1 -0.3391 0.3216 -1.054 0.3692
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.897 on 3 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2703, Adjusted R-squared: 0.02713
F-statistic: 1.112 on 1 and 3 DF, p-value: 0.3692
Note that you're relation of interest is y~x1+x2
, so the effect of x1
on y
accounting for the effect of x2
, and vice versa.
That is of course not the same as estimating y~x1
and y~x2
separately, where you omit the effect of the other explanatory variable.
Now there are of course strategies to deal with missing values.
One option is estimating a Bayesian model, using JAGS
for instance, where you can model the missing values. An example would be the following for instance, where I take the mean and standard deviation of each variable to model the missing values:
model{
for(i in 1:N){
y[i] ~ dnorm(yhat[i], tau)
yhat[i] <- a+ b1*x1[i] + b2*x2[i]
# Accounting for missing data
x1[i]~dnorm(22,3)
x2[i]~dnorm(7,1.3)
}
# Priors
b1~dnorm(0, .01)
b2~dnorm(0, .01)
# Hyperpriors
tau <- pow(sd, -2)
sd ~ dunif(0, 20)
}
This is just off the top of my head. For better and more insightful advice on how deal with missing values I would recommend paying a visit to stats.stackexchange
Upvotes: 1