WTB
WTB

Reputation: 25

How to handle one categorical variable in multiple columns using tidymodels?

I have a dataset with one categorical variable spread across multiple columns. Like this,

ID Pet_1 Pet_2 Pet_3 Siblings Income Result
1 dog horse cat 0 90000 0
2 cat bird NA 1 50000 1
3 NA NA NA 3 75000 1
4 horse dog snake 1 120000 0

There's an ID column, a set of columns that are really one variable (Pet_1 - Pet_3) where order doesn't matter and can be missing, other predictor columns, and the response.

How can I handle the set of columns that go together using tidymodels? For example, dog in Pet_1 should have the same effect as dog in Pet_3. I was thinking about trying to pull those columns out, pivot long, run an encoding step, aggregate that result back to one row per ID. But I don't think it's possible to aggregate in a recipe step.

Upvotes: 1

Views: 234

Answers (1)

EmilHvitfeldt
EmilHvitfeldt

Reputation: 3185

You are correct that there isn't a good way to do aggregation inside recipes. We do have one step that would work well with the data you have here. step_dummy_multi_choice() will create a set of dummy variables of the labels from multiple variables.

library(recipes)
library(tibble)

example_data <- tibble(
  ID = c(1, 2, 3, 4), 
  Pet_1 = c("dog", "cat", NA, "horse"),
  Pet_2 = c("horse", "bird", NA, "dog"),
  Pet_3 = c("cat", NA, NA, "snake"), 
  Siblings = c(0, 1, 3, 1),
  Income = c(90000,  50000, 75000, 120000), 
  Result = c(0, 1, 1, 0)
)

rec_spec <- recipe(Result ~ ., example_data) %>%
  step_dummy_multi_choice(starts_with("Pet_"))

rec_spec %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 4 × 9
#>      ID Siblings Income Result Pet_1_bird Pet_1_cat Pet_1_dog Pet_1_horse
#>   <dbl>    <dbl>  <dbl>  <dbl>      <int>     <int>     <int>       <int>
#> 1     1        0  90000      0          0         1         1           1
#> 2     2        1  50000      1          1         1         0           0
#> 3     3        3  75000      1          0         0         0           0
#> 4     4        1 120000      0          0         0         1           1
#> # … with 1 more variable: Pet_1_snake <int>

Created on 2022-06-24 by the reprex package (v2.0.1)

Upvotes: 3

Related Questions