Ant
Ant

Reputation: 9

Split each Row in a column with Multiple Unique words into Multiple Columns

I would like to create a function to create that checks the unique words in a single row going down the column and creating dummy columns based on this. For example:

ID Letters
1  A, B, C
2  C, D
3  A
4  B, D
5  Z
6  A

The expected outcome would be:

ID Letters  Letter_A Letter_B Letter_C Letter_D Letter_Z
1  A, B, C   1          1        1        0       0
2  C, D      0          0        1        1       0
3  A         1          0        0        0       0
4  B, D      0          1        0        1       0
5  Z         0          0        0        0       1
6  A         1          0        0        0       0

I've found this block of code

uniq <- unique(unlist(strsplit(as.character(df$values),', ')))
m <- matrix(0, nrow(df), length(uniq), dimnames = list(NULL, paste0("Letter_", uniq)))

for (i in seq_along(df$values)) {
  k <- match(df$values[i], uniq, 0)
  m[i,k] <- 1
}

Where uniq will create a new hold each unique word separated by the comma and create a new column Letter_A etc. However, the forloop will only check the first letter in the column. So the current outcome looks like this where the other letters are not being changed to 1

ID Letters  Letter_A Letter_B Letter_C Letter_D Letter_Z
1  A, B, C   1          0        0        0       0
2  C, D      0          0        1        0       0
3  A         1          0        0        0       0
4  B, D      0          1        0        0       0
5  Z         0          0        0        0       1
6  A         1          0        0        0       0

Upvotes: 1

Views: 115

Answers (4)

chinsoon12
chinsoon12

Reputation: 25225

An option using stats::xtabs and DF is from Cole's solution:

l <- strsplit(DF$Letters, ", ") 
tab <- data.frame(ID=rep(seq_along(l), lengths(l)), Letters=unlist(l), V=1L)
cbind(DF, as.data.frame.matrix(xtabs(V ~ ID + Letters, tab)))

output:

  ID Letters A B C D Z
1  1 A, B, C 1 1 1 0 0
2  2    C, D 0 0 1 1 0
3  3       A 1 0 0 0 0
4  4    B, D 0 1 0 1 0
5  5       Z 0 0 0 0 1
6  6       A 1 0 0 0 0

Upvotes: 1

James
James

Reputation: 111

You can use the mtabulate function from the qdapTools library.

library(qdapTools)
library(dplyr)

x <- "
ID Letters
1  'A, B, C'
2  'C, D'
3  A
4  'B, D'
5  Z
6  A
"

df <- read.table(text = x, header = TRUE, stringsAsFactors = FALSE)

encoded_df <- cbind(df, mtabulate(strsplit(df$Letters, ", "))) %>% 
              rename_at(vars(!colnames(df)), ~paste0("Letter_", .))

This will apply the one hot encoding to the letters and then add the Letter_ prefix to all the new columns that are created.

Upvotes: 1

Sathish
Sathish

Reputation: 12713

Code:

library(data.table)
setDT(df)
dcast(data = df[, strsplit(Letters, split = ","), by = .(ID, Letters)][, V1 := trimws(V1)][],
      formula = ID + Letters ~ V1, 
      fun.aggregate = length, 
      value.var = "V1")
#    ID Letters A B C D Z
# 1:  1 A, B, C 1 1 1 0 0
# 2:  2    C, D 0 0 1 1 0
# 3:  3       A 1 0 0 0 0
# 4:  4    B, D 0 1 0 1 0
# 5:  5       Z 0 0 0 0 1
# 6:  6       A 1 0 0 0 0

Data:

df <- read.table(text='ID Letters
1  "A, B, C"
                 2  "C, D"
                 3  "A"
                 4  "B, D"
                 5  "Z"
                 6  "A"', header = TRUE, stringsAsFactors = FALSE)

Upvotes: 1

Cole
Cole

Reputation: 11255

Here's one way to do it:

DF = data.frame(ID = seq_len(6L),
                Letters = c("A, B, C", "C, D", "A", "B, D", "Z", "A"))

spl_letters = strsplit(as.character(DF[["Letters"]]), ", ", fixed = TRUE)
uniq = unique(unlist(spl_letters), use.names = FALSE)

data.frame(DF,
           setNames(data.frame(t(vapply(spl_letters, function(x) +(uniq %in% x), seq_along(uniq)))), paste0("Letter_", uniq))
)

  ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1  1 A, B, C        1        1        1        0        0
2  2    C, D        0        0        1        1        0
3  3       A        1        0        0        0        0
4  4    B, D        0        1        0        1        0
5  5       Z        0        0        0        0        1
6  6       A        1        0        0        0        0

Basically, the for loop was changed into vapply and instead of unlist, the original strsplit result was kept in order to match what was uniq.

Upvotes: 1

Related Questions