Reputation: 101
I have the following R dataframe :
id color
001 blue
001 yellow
001 red
002 blue
003 blue
003 yellow
What's the general method to one-hot-encode such a dataframe into the following :
id blue yellow red
001 1 1 1
002 1 0 0
003 1 0 1
Thank you very much.
Upvotes: 0
Views: 362
Reputation: 39613
Try this. You can create a variable for those observations present in data equals to one and then use pivot_wider()
to reshape the values. As you will get NA
for classes not present in data, you can replace it with zero using replace()
. Here the code using tidyverse
functions:
library(dplyr)
library(tidyr)
#Code
dfnew <- df %>% mutate(val=1) %>%
pivot_wider(names_from = color,values_from=val) %>%
replace(is.na(.),0)
Output:
# A tibble: 3 x 4
id blue yellow red
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 1 0 0
3 3 1 1 0
Some data used:
#Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L), color = c("blue",
"yellow", "red", "blue", "blue", "yellow")), class = "data.frame", row.names = c(NA,-6L))
Upvotes: 1
Reputation: 33548
With data.table
:
library(data.table)
dcast(setDT(df), id ~ color, fun.aggregate = length)
# id blue red yellow
# 1: 001 1 1 1
# 2: 002 1 0 0
# 3: 003 1 0 1
Same logic with tidyr
:
library(tidyr)
pivot_wider(df, names_from=color, values_from=color, values_fn=length, values_fill=0)
# id blue yellow red
# <chr> <int> <int> <int>
# 1 001 1 1 1
# 2 002 1 0 0
# 3 003 1 1 0
Base R
:
out <- as.data.frame.matrix(pmin(with(df, table(id, color)), 1))
out$id <- rownames(out)
out
# blue red yellow id
# 001 1 1 1 001
# 002 1 0 0 002
# 003 1 0 1 003
Reproducible data
df <- data.frame(
id = c("001", "001", "001", "002", "003", "003"),
color = c("blue", "yellow", "red", "blue", "blue", "yellow")
)
Upvotes: 1
Reputation: 2021
There are many ways to do this in R. It depends on what packages you are using. Most of the modeling packages such as caret
and tidymodels
have functions to do this for you.
However, if you aren't using a modeling package the tidyverse has an easy way to do this.
library(dplyr)
library(tidyr)
df <- tribble(
~id, ~color,
'001', 'blue',
'001', 'yellow',
'001', 'red',
'002', 'blue',
'003', 'blue',
'003', 'yellow')
df_onehot <- df %>%
mutate(value = 1) %>%
pivot_wider(names_from = color,values_from = value,values_fill = 0)
# A tibble: 3 x 4
# id blue yellow red
# <chr> <dbl> <dbl> <dbl>
# 1 001 1 1 1
# 2 002 1 0 0
# 3 003 1 1 0
Upvotes: 1