Reputation: 25
I have a dataset with one categorical variable spread across multiple columns. Like this,
ID | Pet_1 | Pet_2 | Pet_3 | Siblings | Income | Result |
---|---|---|---|---|---|---|
1 | dog | horse | cat | 0 | 90000 | 0 |
2 | cat | bird | NA | 1 | 50000 | 1 |
3 | NA | NA | NA | 3 | 75000 | 1 |
4 | horse | dog | snake | 1 | 120000 | 0 |
There's an ID column, a set of columns that are really one variable (Pet_1 - Pet_3) where order doesn't matter and can be missing, other predictor columns, and the response.
How can I handle the set of columns that go together using tidymodels
? For example, dog in Pet_1 should have the same effect as dog in Pet_3. I was thinking about trying to pull those columns out, pivot long, run an encoding step, aggregate that result back to one row per ID. But I don't think it's possible to aggregate in a recipe
step.
Upvotes: 1
Views: 234
Reputation: 3185
You are correct that there isn't a good way to do aggregation inside recipes
. We do have one step that would work well with the data you have here. step_dummy_multi_choice()
will create a set of dummy variables of the labels from multiple variables.
library(recipes)
library(tibble)
example_data <- tibble(
ID = c(1, 2, 3, 4),
Pet_1 = c("dog", "cat", NA, "horse"),
Pet_2 = c("horse", "bird", NA, "dog"),
Pet_3 = c("cat", NA, NA, "snake"),
Siblings = c(0, 1, 3, 1),
Income = c(90000, 50000, 75000, 120000),
Result = c(0, 1, 1, 0)
)
rec_spec <- recipe(Result ~ ., example_data) %>%
step_dummy_multi_choice(starts_with("Pet_"))
rec_spec %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 4 × 9
#> ID Siblings Income Result Pet_1_bird Pet_1_cat Pet_1_dog Pet_1_horse
#> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 1 0 90000 0 0 1 1 1
#> 2 2 1 50000 1 1 1 0 0
#> 3 3 3 75000 1 0 0 0 0
#> 4 4 1 120000 0 0 0 1 1
#> # … with 1 more variable: Pet_1_snake <int>
Created on 2022-06-24 by the reprex package (v2.0.1)
Upvotes: 3