josliber
josliber

Reputation: 44340

"Factor has new levels" error for variable I'm not using

Consider a simple dataset, split into a training and testing set:

dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
#   x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
#   x y z
# 5 5 e 1

When I train a logistic regression model to predict z using x and obtain test-set predictions, all is well:

mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
#         5 
# 0.5546394 

However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:

mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor y has new level e

Since I removed y from my model equation, I'm surprised to see this error message. In my application, dat is very wide, so z~.-y is the most convenient model specification. The simplest workaround I can think of is removing the y variable from my data frame and then training the model with the z~. syntax, but I was hoping for a way to use the original dataset without the need to remove columns.

Upvotes: 43

Views: 64533

Answers (4)

qwr
qwr

Reputation: 11024

If you are using the tidymodels framework, recipes has a way to exclude variables from modeling by changing their "role":

Now we can add roles to this recipe. We can use the update_role() function to let recipes know that flight and time_hour are variables with a custom role that we called "ID" (a role can have any character value). Whereas our formula included all variables in the training set other than arr_delay as predictors, this tells the recipe to keep these two variables but not use them as either outcomes or predictors.

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) %>% 
  update_role(flight, time_hour, new_role = "ID") 

Upvotes: 1

jay.sf
jay.sf

Reputation: 73782

We may generalize @matt_k's great solution to apply it to high-dimensional data where there are multiple factors with different levels in the training and test sets, like these:

dat2
#   x y1 y2 z
# 1 1  a  A 0
# 2 2  b  B 0
# 3 3  c  C 1
# 4 4  d  D 0
# 5 5  e  E 1

When we divide into test and training as before,

train <- dat2[1:4, ]
test <- dat2[5, ]

both y1 and y2 test levels will differ from those of train and we get the error.

mod <- glm(z ~ ., data=train, family="binomial")
predict(mod, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor y1 has new level e

With high-dimensional data, it's rather boring to correct every single failing factor, so we might want to loop over them.

Either, the bad guys are of class "factor", or of class "character" (as in our case). Since these will be the ones to be included in the 'xlevels', we use a small helper that identifies them,

is.prone <- function(x) is.factor(x) | is.character(x)

and put it into Map.

id <- sapply(dat2, is.prone)
mod$xlevels <- Map(union, mod$xlevels, lapply(dat2[id], unique))

Then it should work.

predict(mod, newdata=test, type="response")
#            5 
# 5.826215e-11 
# Warning message:
# In predict.lm(object, newdata, se.fit, scale = 1, type = if (type ==  :
#   prediction from a rank-deficient fit may be misleading

dat2 <- structure(list(x = 1:5, y1 = c("a", "b", "c", "d", "e"), y2 = c("a", 
"b", "c", "d", "e"), z = c(0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 4

I was confused about this issue for a long time. However, there was a simple solution to this. One of the variable "traffic type" had 20 factors and for one factor ie 17 there was only one row. Hence this row could be present either in train data or test data. In my case it was present in test data hence the error came - factor "traffic type" has a new level of 17 because there is no row with level 17in train data. I deleted this row from data set and model runs perfectly fine

Upvotes: 0

matt_k
matt_k

Reputation: 4489

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
#        5 
#0.5546394 

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
#        5 
#0.5546394 

Upvotes: 50

Related Questions