Reputation: 437
Forgive me if my title is unclear, but I couldn't think a very clear way to summarize what I'm after.
I'm working with the Titanic dataset to learn logistic regression. The idea is to develop a model to predict survival. The data includes passenger Age
. Using that attribute, I factored Age
like
Age_labels <- c('0-10', '11-17', '18-29', '30-39', '40-49', '50-59', '60-69', '70-79')
train_data$AgeGroup <- cut(train_data$Age, c(0, 11, 18, 30, 40, 50, 60, 70, 80), include.highest=TRUE, labels= Age_labels)
Model completed, I'm ready to use it to predict survival on the test data--but I get an error when I try
test_data_predictions <- predict(my_model, newdata = test_data, type = "response")
Here's the error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor AgeGroup has new levels 60-69, 70-79
Why? It seems to mean the problem is because the test data includes passengers in the 60-69 and 70-79 AgeGroup (whereas train data did not include passengers in those age ranges). Or does the error actually mean something else?
Obviously I want to use this model to predict the survival of any passenger, regardless of age.
Here is a potential clue: str() tells me that AgeGroup
in my test_data
is a factor w /8 levels, whereas in train_data
it's a factor w /6 levels. Also, there are no NA
s in either train_data or test_data.
How do I correct the error so I can move on to actual predictions? Thanks
Note: haven't included data as this question does not seem to require reproducibility to answer
UPDATE
Per suggestion by @sjp, I went back and treated AgeGroup
as continuous variable (as numeric). Doing so has adverse effects: AIC goes up, binnedplot of residuals now looks rather poor (too many outside of bin), and Hosmer-Lemeshow now says "Summary: model does not fit well". So, passing AgeGroup
as numeric does make it possible for me to use model to make predictions on test data, but I worry the price is too high.
Upvotes: 0
Views: 42
Reputation: 226182
tl;dr Because there are age groups present in the test set that are in the training set (due to random sampling of small categories), R can't make predictions for those test cases. You can use caret::createDataPartition(age_group)
to create a train/test split that is balanced on the age-group variable (and hence is not missing any categories). The help page (?createDataPartition
) warns you that "for ... very small class sizes (<= 3) the classes may not show up in both the training and test data", but it seems to work OK here (the smallest group has n=6).
tt <- transform(carData::TitanicSurvival,
age_group = cut(age,
breaks = c(0, 11, 18, 30, 40, 50, 60, 70, 80))
)
set.seed(101)
## allocate a small fraction (10%) to the training set to make
## it easier to get missing classes in the training set
split1 <- sample(c("train", "test"),
size = nrow(tt),
replace = TRUE,
prob = c(0.1, 0.9))
m1 <- glm(survived ~ age_group, family = binomial,
data = tt[split1 == "train", ])
try(predict(m1, newdata = tt[split1 == "test",]))
this gives
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor age_group has new levels (70,80]
as in the original example.
library(caret)
set.seed(101)
w <- createDataPartition(tt$age_group, p = 0.1)$Resample1
table(tt$age_group[w])
table(tt$age_group[-w])
m2 <- glm(survived ~ age_group, family = binomial, data = tt[w,])
predict(m2, newdata = tt[-w,])
This works OK. Using table(tt$age_group[w])
and table(tt$age_group[-w])
confirms that every age class is present in both the training and the test set, although it doesn't cause any problems if classes are missing from the test set only ...
Upvotes: 1