Louis GRIMALDI
Louis GRIMALDI

Reputation: 101

One-hot-encoding a R list of characters

I have the following R dataframe :

id    color
001   blue
001   yellow
001   red
002   blue
003   blue
003   yellow

What's the general method to one-hot-encode such a dataframe into the following :

id    blue    yellow    red
001   1       1         1
002   1       0         0
003   1       0         1

Thank you very much.

Upvotes: 0

Views: 362

Answers (3)

Duck
Duck

Reputation: 39613

Try this. You can create a variable for those observations present in data equals to one and then use pivot_wider() to reshape the values. As you will get NA for classes not present in data, you can replace it with zero using replace(). Here the code using tidyverse functions:

library(dplyr)
library(tidyr)
#Code
dfnew <- df %>% mutate(val=1) %>%
  pivot_wider(names_from = color,values_from=val) %>%
  replace(is.na(.),0)

Output:

# A tibble: 3 x 4
     id  blue yellow   red
  <int> <dbl>  <dbl> <dbl>
1     1     1      1     1
2     2     1      0     0
3     3     1      1     0

Some data used:

#Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L), color = c("blue", 
"yellow", "red", "blue", "blue", "yellow")), class = "data.frame", row.names = c(NA,-6L))

Upvotes: 1

s_baldur
s_baldur

Reputation: 33548

With data.table:

library(data.table)
dcast(setDT(df), id ~ color, fun.aggregate = length)

#     id blue red yellow
# 1: 001    1   1      1
# 2: 002    1   0      0
# 3: 003    1   0      1

Same logic with tidyr:

library(tidyr)
pivot_wider(df, names_from=color, values_from=color, values_fn=length, values_fill=0)

#   id     blue yellow   red
#   <chr> <int>  <int> <int>
# 1 001       1      1     1
# 2 002       1      0     0
# 3 003       1      1     0

Base R:

out <- as.data.frame.matrix(pmin(with(df, table(id, color)), 1))
out$id <- rownames(out)
out
#     blue red yellow  id
# 001    1   1      1 001
# 002    1   0      0 002
# 003    1   0      1 003

Reproducible data

df <- data.frame(
  id = c("001", "001", "001", "002", "003", "003"), 
  color = c("blue", "yellow", "red", "blue", "blue", "yellow")
)

Upvotes: 1

Adam Sampson
Adam Sampson

Reputation: 2021

There are many ways to do this in R. It depends on what packages you are using. Most of the modeling packages such as caret and tidymodels have functions to do this for you.

However, if you aren't using a modeling package the tidyverse has an easy way to do this.

library(dplyr)
library(tidyr)

df <- tribble(
  ~id,    ~color,
  '001',   'blue',
  '001',   'yellow',
  '001',   'red',
  '002',   'blue',
  '003',   'blue',
  '003',   'yellow')

df_onehot <- df %>%
  mutate(value = 1) %>%
  pivot_wider(names_from = color,values_from = value,values_fill = 0)
# A tibble: 3 x 4
#    id     blue yellow   red
#   <chr> <dbl>  <dbl> <dbl>
# 1 001       1      1     1
# 2 002       1      0     0
# 3 003       1      1     0

Upvotes: 1

Related Questions