Error connected to factor var when using logistic model to predict

Question

Forgive me if my title is unclear, but I couldn't think a very clear way to summarize what I'm after.

I'm working with the Titanic dataset to learn logistic regression. The idea is to develop a model to predict survival. The data includes passenger Age. Using that attribute, I factored Age like

Age_labels <- c('0-10', '11-17', '18-29', '30-39', '40-49', '50-59', '60-69', '70-79')

train_data$AgeGroup <- cut(train_data$Age, c(0, 11, 18, 30, 40, 50, 60, 70, 80), include.highest=TRUE, labels= Age_labels)

Model completed, I'm ready to use it to predict survival on the test data--but I get an error when I try

test_data_predictions <- predict(my_model, newdata = test_data, type = "response")

Here's the error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor AgeGroup has new levels 60-69, 70-79

Why? It seems to mean the problem is because the test data includes passengers in the 60-69 and 70-79 AgeGroup (whereas train data did not include passengers in those age ranges). Or does the error actually mean something else?

Obviously I want to use this model to predict the survival of any passenger, regardless of age.

Here is a potential clue: str() tells me that AgeGroup in my test_data is a factor w /8 levels, whereas in train_data it's a factor w /6 levels. Also, there are no NAs in either train_data or test_data.

How do I correct the error so I can move on to actual predictions? Thanks

Note: haven't included data as this question does not seem to require reproducibility to answer

UPDATE

Per suggestion by @sjp, I went back and treated AgeGroup as continuous variable (as numeric). Doing so has adverse effects: AIC goes up, binnedplot of residuals now looks rather poor (too many outside of bin), and Hosmer-Lemeshow now says "Summary: model does not fit well". So, passing AgeGroup as numeric does make it possible for me to use model to make predictions on test data, but I worry the price is too high.

Ben Bolker · Accepted Answer

tl;dr Because there are age groups present in the test set that are in the training set (due to random sampling of small categories), R can't make predictions for those test cases. You can use caret::createDataPartition(age_group) to create a train/test split that is balanced on the age-group variable (and hence is not missing any categories). The help page (?createDataPartition) warns you that "for ... very small class sizes (<= 3) the classes may not show up in both the training and test data", but it seems to work OK here (the smallest group has n=6).

replicate the problem

tt <- transform(carData::TitanicSurvival,
                age_group = cut(age,
                               breaks = c(0, 11, 18, 30, 40, 50, 60, 70, 80))
                )
set.seed(101)
## allocate a small fraction (10%) to the training set to make 
##   it easier to get missing classes in the training set
split1 <- sample(c("train", "test"),
                 size = nrow(tt),
                 replace = TRUE,
                 prob = c(0.1, 0.9))
m1 <- glm(survived ~ age_group, family = binomial,
          data = tt[split1 == "train", ])
try(predict(m1, newdata = tt[split1 == "test",]))

this gives

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor age_group has new levels (70,80]

as in the original example.

balanced sample

library(caret)
set.seed(101)
w <- createDataPartition(tt$age_group, p = 0.1)$Resample1
table(tt$age_group[w])
table(tt$age_group[-w])
m2 <- glm(survived ~ age_group, family = binomial, data = tt[w,])
predict(m2, newdata = tt[-w,])

This works OK. Using table(tt$age_group[w]) and table(tt$age_group[-w]) confirms that every age class is present in both the training and the test set, although it doesn't cause any problems if classes are missing from the test set only ...

Error connected to factor var when using logistic model to predict

Answers (1)

replicate the problem

balanced sample

Related Questions