Reputation: 63
I don't understand how to generate predicted values from a linear regression using the predict.lm
command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x
, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x))
; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict
function.
Thank you in advance for your assistance!
Upvotes: 4
Views: 5272
Reputation: 226162
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action
argument/?na.exclude
function. With this option set, predict()
(and similar downstream processing functions) will automatically fill in NA
values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action
is na.omit
, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude
removes non-complete cases before fitting, but then restores them (filled with NA
) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Upvotes: 6
Reputation: 8377
Actually, you are not using correctly the predict.lm
function.
Either way you have to input the model itself as its first argument, hereby model
, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA
row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x
has one more row than the training x
it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)
Upvotes: 2