ghlavin
ghlavin

Reputation: 163

How to handle NAs due to novel factor levels using R recipes?

I preprocessed a training data set (A) und now want to reproduce these steps for a test set (B) using R recipes.

The problem is, that there are new factor levels in the test set, that I want to ignore:

library(recipes)

(A <- data.frame(a = c(1:19, NA), b = factor(c(rep("l1",18), "l2", NA))))

(B <- data.frame(a = c(1:3, NA), b = factor(c("l1", "l2", NA, "l3"))))

rec.task <- 
  recipe(~ ., data = A) %>% 
  step_unknown(all_predictors(), -all_numeric()) %>% 
  step_medianimpute(all_numeric()) %>%  
  step_other(all_predictors(), -all_numeric(), threshold = 0.1, other=".merged") %>% 
  step_dummy(all_predictors(), -all_numeric()) 

tr.recipe <- prep(rec.task, training = A)
(AA <- juice(tr.recipe))

Now the problem is the NA in the following table:

(BB <- bake(tr.recipe, B))

      a b_.merged
  <dbl>     <dbl>
1     1         0
2     2         1
3     3         1
4    10        NA
Warnmeldung:
There are new levels in a factor: NA 

Can I avoid it somehow during these steps? Can I impute zero to the NAs within the recipes procedure (I am not interested in a base R or dplyr solution)?

Upvotes: 1

Views: 1152

Answers (2)

ghlavin
ghlavin

Reputation: 163

As topepo explained, the step_novel function is a possible solution. Change code where rec.task is assigned, in the following way

rec.task <- 
recipe(~ ., data = A) %>% 
step_novel(all_predictors(), -all_numeric()) %>% 
step_unknown(all_predictors(), -all_numeric()) %>% 
step_medianimpute(all_numeric()) %>%  
step_other(all_predictors(), -all_numeric(), threshold = 0.1, other=".merged") %>% 
step_dummy(all_predictors(), -all_numeric()) %>% 
step_zv(all_predictors())

Then the output will be:

# A tibble: 4 x 2
      a b_.merged
  <dbl>     <dbl>
1     1         0
2     2         1
3     3         1
4    10         1

Upvotes: 1

topepo
topepo

Reputation: 14331

step_novel() is the solution. See the dummy variables vignette.

Upvotes: 0

Related Questions