epsilone
epsilone

Reputation: 765

Missing data behaviour in lm: complete cases used even with predictors without missing data

My question: what is the most efficient way of removing a predictor with NAs and consider the complete cases excluding that predictor?

The question arises from the following regression situation with NAs, in which there are missing values in Ozone (mostly) and Solar.R.

data(airquality)
summary(airquality)
#     Ozone           Solar.R           Wind             Temp           Month      
# Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000  
# 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000  
# Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000  
# Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993  
# 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000  
# Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000  
# NA's   :37       NA's   :7                                                       
#      Day      
# Min.   : 1.0  
# 1st Qu.: 8.0  
# Median :16.0  
# Mean   :15.8  
# 3rd Qu.:23.0  
# Max.   :31.0  

Regression of Wind on the remaining variables. Considers only the complete cases.

summary(lm(Wind ~ ., data = airquality))
# 
# Call:
# lm(formula = Wind ~ ., data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.3908 -2.2800 -0.3078  1.4132  9.6501 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 15.519460   2.918393   5.318 5.96e-07 ***
# Ozone       -0.060746   0.011798  -5.149 1.23e-06 ***
# Solar.R      0.003791   0.003216   1.179    0.241    
# Temp        -0.036604   0.044576  -0.821    0.413    
# Month       -0.159671   0.208082  -0.767    0.445    
# Day          0.017353   0.031238   0.556    0.580    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.822 on 105 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.3994,  Adjusted R-squared:  0.3708 
# F-statistic: 13.96 on 5 and 105 DF,  p-value: 1.857e-10

If Ozone is removed, still considers only the complete cases (with Ozone included). But this is different from manually removing Ozone.

summary(lm(Wind ~ . - Ozone, data = airquality))
# 
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -6.012 -2.323 -0.361  1.493  9.605 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 24.3159074  2.6354288   9.227 3.09e-15 ***
# Solar.R      0.0009228  0.0035281   0.262    0.794    
# Temp        -0.1900820  0.0369159  -5.149 1.21e-06 ***
# Month        0.0313046  0.2280600   0.137    0.891    
# Day          0.0008969  0.0346116   0.026    0.979    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.143 on 106 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.2477,  Adjusted R-squared:  0.2193 
# F-statistic: 8.727 on 4 and 106 DF,  p-value: 3.961e-06

summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
# 
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -8.1779 -2.2063 -0.2757  1.9448  9.3510 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 23.660271   2.416766   9.790  < 2e-16 ***
# Solar.R      0.002980   0.003113   0.957    0.340    
# Temp        -0.186386   0.032725  -5.695 6.89e-08 ***
# Month        0.074952   0.206334   0.363    0.717    
# Day         -0.011028   0.030304  -0.364    0.716    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.158 on 141 degrees of freedom
#   (7 observations deleted due to missingness)
# Multiple R-squared:  0.2125,  Adjusted R-squared:  0.1901 
# F-statistic: 9.511 on 4 and 141 DF,  p-value: 7.761e-07

Upvotes: 3

Views: 1205

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226097

It is indeed unfortunate and surprising that Wind ~ . - Ozone considers Ozone when finding complete cases; seems worth discussion on the [email protected] mailing list, if you want to pursue it. In the meantime, how about

 summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))

?

Upvotes: 4

Related Questions