Steph Locke
Steph Locke

Reputation: 6146

r predict glm score based on only partial records

I have a glm based on data A and I'd like to score data B to do validation, but some records in B have missing data.

Instead of these ending up without a score (na.omit) or being removed (na.exclude) I'd like them to end up with an outputted prediction that uses the model to determine a value based only on the data with values.

A reproducible example...

data(mtcars)
model<-glm(mpg~.,data=mtcars)
mtcarsNA<-mtcars
NAins <-  NAinsert <- function(df, prop = .1){
  n <- nrow(df)
  m <- ncol(df)
  num.to.na <- ceiling(prop*n*m)
  id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
  rows <- id %/% m + 1
  cols <- id %% m + 1
  sapply(seq(num.to.na), function(x){
    df[rows[x], cols[x]] <<- NA
  }
  )
  return(df)
}
mtcarsNA<-NAins(mtcarsNA,.4)
mtcarsNA$mpg<-mtcars$mpg
predict(model,newdata=mtcarsNA,type="response")

Where I need the last line to return a result (non-NA) for all records. Can you point me in the direction of the code needed?

Upvotes: 0

Views: 538

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226557

Based on the conversation in the comments, you want to replace NA values with zero before predicting. This seems dangerous/dubious to me -- use at your own risk.

naZero <- function(x) { x[is.na(x)] <- 0; x }
mtcarszero <- lapply(mtcarsNA,naZero)
predict(model,newdata=mtcarszero,type="response")

should be what you want.

For categorical variables, if you are using default treatment contrasts, then I think the consistent thing to do is something like this:

naZero <- function(x) { if (is.numeric(x)) {
                            repVal <- 0
                        } else {
                           if (is.factor(x)) {
                               repVal <- levels(x)[1]
                           } else stop("uh-oh")
                        }
                        x[is.na(x)] <- repVal
                        x }

Upvotes: 2

Related Questions