thebilly
thebilly

Reputation: 57

recipes::step_num2factor() leaves last level as NA when baking despite enough levels supplied (MWE supplied)

The last category I create with the function step_num2factor() creates all levels correctly but the last one. There it fills in an NA.

MWE

test <- tibble(pred = c(0, 1, 2, 3, 4, 5, 8), target = c(0,1,0,1,1,1,0))

looks like this when printed:

# A tibble: 7 x 2
   pred target
  <dbl>  <dbl>
1     0      0
2     1      1
3     2      0
4     3      1
5     4      1
6     5      1
7     8      0

Doing the recipe steps and comparing results

test <- tibble(pred = c(0, 1, 2, 3, 4, 5, 8), target = c(0,1,0,1,1,1,0))

my_levels <- c("zero", "one", "two", "three", "four", "five", "eight")

recipe(target ~ pred, data = test) %>% 
step_num2factor(pred, levels = my_levels, transform = function(x) x + 1) %>% 
prep(training = test) %>% 
bake(new_data = test)

Remark: transform because of the level 0 which a factor cannot have. (source)

Transformed dataset after prepping and baking

# A tibble: 7 x 2
  pred  target
  <fct>  <dbl>
1 zero       0
2 one        1
3 two        0
4 three      1
5 four       1
6 five       1
7 NA         0

The NA is not supposed to be there. it is supposed to be category "eight". What am I doing wrong?

Remark: I tried it with "six" as well, as I thought maybe the function only accepts the values in words and not completely randomly named levels, but that wasn't it either.

Upvotes: 0

Views: 83

Answers (1)

EmilHvitfeldt
EmilHvitfeldt

Reputation: 3185

You need to make sure that your input, levels, and transform match up perfectly. You were correct in using transform = function(x) x + 1 since you are trying to capture 0. So when your input is n then the n+1th value of levels is selected.

When your input is 8 then step_num2factor() returns the 8+1=9th value of levels which isn't there since it is only of length 7, resulting in the NA you see. The code below should illustrate the issue

library(recipes)

my_levels <- c("zero", "one", "two", "three", "four", "five", "eight")

test <- tibble(pred = c(0, 1, 2, 3, 4, 5, 6), target = c(0,1,0,1,1,1,0))

recipe(target ~ pred, data = test) %>% 
  step_num2factor(pred, levels = my_levels, transform = function(x) x + 1) %>% 
  prep() %>% 
  bake(new_data = NULL)
#> # A tibble: 7 x 2
#>   pred  target
#>   <fct>  <dbl>
#> 1 zero       0
#> 2 one        1
#> 3 two        0
#> 4 three      1
#> 5 four       1
#> 6 five       1
#> 7 eight      0

To fix your problem, you need to make sure that there are no gaps in my_levels

test <- tibble(pred = c(0, 1, 2, 3, 4, 5, 8), target = c(0,1,0,1,1,1,0))

my_levels <- c("zero", "one", "two", "three", "four", "five", 
               "six", "seven", "eight", "nine", "ten")

recipe(target ~ pred, data = test) %>% 
  step_num2factor(pred, levels = my_levels, transform = function(x) x + 1) %>% 
  prep() %>% 
  bake(new_data = NULL)
#> # A tibble: 7 x 2
#>   pred  target
#>   <fct>  <dbl>
#> 1 zero       0
#> 2 one        1
#> 3 two        0
#> 4 three      1
#> 5 four       1
#> 6 five       1
#> 7 eight      0

Created on 2021-03-27 by the reprex package (v0.3.0)

Upvotes: 2

Related Questions