Reputation: 78917
In the Iris dataset Species is a factor variable with 3 levels("setosa" "versicolor" "virginica"). I would like to create 3 additional columns named ("setosa" "versicolor" "virginica") with False and True as logical factor variable for each column. In short: I would like to dichotomize the levels of the variable Species in the Iris dataset into 3 new columns as a logical variable. My code works, but I wonder if there is a more straight way:
df <- iris %>%
select(Species) %>%
mutate(setosa = case_when(Species=="setosa" ~ 1,
TRUE ~ 0),
versicolor = case_when(Species=="versicolor" ~ 1,
TRUE ~ 0),
virginica = case_when(Species=="virginica" ~ 1,
TRUE ~ 0),
)
df$setosa <- as.logical(df$setosa)
df$versicolor <- as.logical(df$versicolor)
df$virginica <- as.logical(df$virginica)
Upvotes: 2
Views: 1759
Reputation: 269451
Use any of these:
iris %>% cbind(sapply(levels(.$Species), `==`, .$Species))
iris %>% cbind(model.matrix(~ Species + 0, .) == 1)
iris %>% cbind(outer(.$Species, setNames(levels(.$Species), levels(.$Species)), "=="))
expand_factor <- function(f) {
m <- matrix(0, length(f), nlevels(f), dimnames = list(NULL, levels(f)))
replace(m, cbind(seq_along(f), f), 1)
}
iris %>% cbind(expand_factor(.$Species) == 1)
library(nnet)
iris %>% cbind(class.ind(.$Species) == 1)
Upvotes: 5
Reputation:
Here is another tidyverse way. I find it tedious and personally would not use it for anything as simple as your example, but it can be useful for more complex applications. For example, if you are "one hot" encoding multiple variables, it may for some reason be nice to have that single variable stored all within one column. Then you can extract it without having to constantly grab a varying amount of columns for different variables.
This makes use of the ability to store a list()
inside of a tibble
, and then unnests it into columns.
library(purrr)
library(dplyr)
library(tidyr)
iris %>%
mutate(species_one_hot = map(Species, ~ set_names(levels(Species) == .x, levels(Species)))) %>%
unnest_wider(species_one_hot)
Here is how you could stop a step earlier to just store the coding for later.
iris2 <- iris %>%
mutate(species_one_hot = map(Species, ~ set_names(levels(Species) == .x, levels(Species))))
# now you can grab a single column and have the full encoding
bind_rows(iris2$species_one_hot)
Upvotes: 1
Reputation: 39595
Try this creating a logical variable directly for Species
, as well as a copy, and then reshape to wide using tidyverse
functions. You will also need an id
variable for your rows. Here the code:
library(dplyr)
library(tidyr)
#Data
data(iris)
#Code
df <- iris %>% mutate(id=row_number(),Species2=Species) %>%
select(c(id,Species,Species2)) %>%
mutate(Value=T) %>%
pivot_wider(names_from = Species2,values_from=Value,values_fill=F) %>%
select(-id)
Output:
# A tibble: 150 x 4
Species setosa versicolor virginica
<fct> <lgl> <lgl> <lgl>
1 setosa TRUE FALSE FALSE
2 setosa TRUE FALSE FALSE
3 setosa TRUE FALSE FALSE
4 setosa TRUE FALSE FALSE
5 setosa TRUE FALSE FALSE
6 setosa TRUE FALSE FALSE
7 setosa TRUE FALSE FALSE
8 setosa TRUE FALSE FALSE
9 setosa TRUE FALSE FALSE
10 setosa TRUE FALSE FALSE
# ... with 140 more rows
Upvotes: 1