predict.lm after regression with missing data in Y

Question

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.

# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))

# Regress X and Y
model<-lm(y~x+1)
summary(model)

# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)

Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
  replacement has 9 rows, data has 10

I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.

Thank you in advance for your assistance!

Ben Bolker · Accepted Answer

There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.

Set up data:

df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA

Fit model: default na.action is na.omit, which simply removes non-complete cases.

mod1 <- lm(y~x+1,data=df)
predict(mod1)
##    1    2    3    4    6    7    8    9   10 
##  100  200  300  400  600  700  800  900 1000

na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:

mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
##    1    2    3    4    5    6    7    8    9   10 
##  100  200  300  400   NA  600  700  800  900 1000

predict.lm after regression with missing data in Y

Answers (2)

Related Questions