Reputation: 163
I preprocessed a training data set (A) und now want to reproduce these steps for a test set (B) using R recipes.
The problem is, that there are new factor levels in the test set, that I want to ignore:
library(recipes)
(A <- data.frame(a = c(1:19, NA), b = factor(c(rep("l1",18), "l2", NA))))
(B <- data.frame(a = c(1:3, NA), b = factor(c("l1", "l2", NA, "l3"))))
rec.task <-
recipe(~ ., data = A) %>%
step_unknown(all_predictors(), -all_numeric()) %>%
step_medianimpute(all_numeric()) %>%
step_other(all_predictors(), -all_numeric(), threshold = 0.1, other=".merged") %>%
step_dummy(all_predictors(), -all_numeric())
tr.recipe <- prep(rec.task, training = A)
(AA <- juice(tr.recipe))
Now the problem is the NA in the following table:
(BB <- bake(tr.recipe, B))
a b_.merged
<dbl> <dbl>
1 1 0
2 2 1
3 3 1
4 10 NA
Warnmeldung:
There are new levels in a factor: NA
Can I avoid it somehow during these steps? Can I impute zero to the NAs within the recipes procedure (I am not interested in a base R or dplyr solution)?
Upvotes: 1
Views: 1152
Reputation: 163
As topepo explained, the step_novel function is a possible solution. Change code where rec.task is assigned, in the following way
rec.task <-
recipe(~ ., data = A) %>%
step_novel(all_predictors(), -all_numeric()) %>%
step_unknown(all_predictors(), -all_numeric()) %>%
step_medianimpute(all_numeric()) %>%
step_other(all_predictors(), -all_numeric(), threshold = 0.1, other=".merged") %>%
step_dummy(all_predictors(), -all_numeric()) %>%
step_zv(all_predictors())
Then the output will be:
# A tibble: 4 x 2
a b_.merged
<dbl> <dbl>
1 1 0
2 2 1
3 3 1
4 10 1
Upvotes: 1