Reputation: 44340
Consider a simple dataset, split into a training and testing set:
dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
# x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
# x y z
# 5 5 e 1
When I train a logistic regression model to predict z
using x
and obtain test-set predictions, all is well:
mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
# 5
# 0.5546394
However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:
mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor y has new level e
Since I removed y
from my model equation, I'm surprised to see this error message. In my application, dat
is very wide, so z~.-y
is the most convenient model specification. The simplest workaround I can think of is removing the y
variable from my data frame and then training the model with the z~.
syntax, but I was hoping for a way to use the original dataset without the need to remove columns.
Upvotes: 43
Views: 64533
Reputation: 11024
If you are using the tidymodels framework, recipes has a way to exclude variables from modeling by changing their "role":
Now we can add roles to this recipe. We can use the
update_role()
function to let recipes know thatflight
andtime_hour
are variables with a custom role that we called"ID"
(a role can have any character value). Whereas our formula included all variables in the training set other thanarr_delay
as predictors, this tells the recipe to keep these two variables but not use them as either outcomes or predictors.
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID")
Upvotes: 1
Reputation: 73782
We may generalize @matt_k's great solution to apply it to high-dimensional data where there are multiple factors with different levels in the train
ing and test
sets, like these:
dat2
# x y1 y2 z
# 1 1 a A 0
# 2 2 b B 0
# 3 3 c C 1
# 4 4 d D 0
# 5 5 e E 1
When we divide into test and training as before,
train <- dat2[1:4, ]
test <- dat2[5, ]
both y1
and y2
test
levels will differ from those of train
and we get the error.
mod <- glm(z ~ ., data=train, family="binomial")
predict(mod, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor y1 has new level e
With high-dimensional data, it's rather boring to correct every single failing factor, so we might want to loop over them.
Either, the bad guys are of class "factor"
, or of class "character"
(as in our case). Since these will be the ones to be included in the 'xlevels', we use a small helper that identifies them,
is.prone <- function(x) is.factor(x) | is.character(x)
and put it into Map
.
id <- sapply(dat2, is.prone)
mod$xlevels <- Map(union, mod$xlevels, lapply(dat2[id], unique))
Then it should work.
predict(mod, newdata=test, type="response")
# 5
# 5.826215e-11
# Warning message:
# In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
# prediction from a rank-deficient fit may be misleading
dat2 <- structure(list(x = 1:5, y1 = c("a", "b", "c", "d", "e"), y2 = c("a",
"b", "c", "d", "e"), z = c(0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 4
Reputation: 9
I was confused about this issue for a long time. However, there was a simple solution to this. One of the variable "traffic type" had 20 factors and for one factor ie 17 there was only one row. Hence this row could be present either in train data or test data. In my case it was present in test data hence the error came - factor "traffic type" has a new level of 17 because there is no row with level 17in train data. I deleted this row from data set and model runs perfectly fine
Upvotes: 0
Reputation: 4489
You could try updating mod2$xlevels[["y"]]
in the model object
mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Another option would be to exclude (but not remove) "y" from the training data
mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Upvotes: 50