Reputation: 9
I would like to create a function to create that checks the unique words in a single row going down the column and creating dummy columns based on this. For example:
ID Letters
1 A, B, C
2 C, D
3 A
4 B, D
5 Z
6 A
The expected outcome would be:
ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1 A, B, C 1 1 1 0 0
2 C, D 0 0 1 1 0
3 A 1 0 0 0 0
4 B, D 0 1 0 1 0
5 Z 0 0 0 0 1
6 A 1 0 0 0 0
I've found this block of code
uniq <- unique(unlist(strsplit(as.character(df$values),', ')))
m <- matrix(0, nrow(df), length(uniq), dimnames = list(NULL, paste0("Letter_", uniq)))
for (i in seq_along(df$values)) {
k <- match(df$values[i], uniq, 0)
m[i,k] <- 1
}
Where uniq will create a new hold each unique word separated by the comma and create a new column Letter_A etc. However, the forloop will only check the first letter in the column. So the current outcome looks like this where the other letters are not being changed to 1
ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1 A, B, C 1 0 0 0 0
2 C, D 0 0 1 0 0
3 A 1 0 0 0 0
4 B, D 0 1 0 0 0
5 Z 0 0 0 0 1
6 A 1 0 0 0 0
Upvotes: 1
Views: 115
Reputation: 25225
An option using stats::xtabs
and DF
is from Cole's solution:
l <- strsplit(DF$Letters, ", ")
tab <- data.frame(ID=rep(seq_along(l), lengths(l)), Letters=unlist(l), V=1L)
cbind(DF, as.data.frame.matrix(xtabs(V ~ ID + Letters, tab)))
output:
ID Letters A B C D Z
1 1 A, B, C 1 1 1 0 0
2 2 C, D 0 0 1 1 0
3 3 A 1 0 0 0 0
4 4 B, D 0 1 0 1 0
5 5 Z 0 0 0 0 1
6 6 A 1 0 0 0 0
Upvotes: 1
Reputation: 111
You can use the mtabulate
function from the qdapTools
library.
library(qdapTools)
library(dplyr)
x <- "
ID Letters
1 'A, B, C'
2 'C, D'
3 A
4 'B, D'
5 Z
6 A
"
df <- read.table(text = x, header = TRUE, stringsAsFactors = FALSE)
encoded_df <- cbind(df, mtabulate(strsplit(df$Letters, ", "))) %>%
rename_at(vars(!colnames(df)), ~paste0("Letter_", .))
This will apply the one hot encoding to the letters and then add the Letter_
prefix to all the new columns that are created.
Upvotes: 1
Reputation: 12713
Code:
library(data.table)
setDT(df)
dcast(data = df[, strsplit(Letters, split = ","), by = .(ID, Letters)][, V1 := trimws(V1)][],
formula = ID + Letters ~ V1,
fun.aggregate = length,
value.var = "V1")
# ID Letters A B C D Z
# 1: 1 A, B, C 1 1 1 0 0
# 2: 2 C, D 0 0 1 1 0
# 3: 3 A 1 0 0 0 0
# 4: 4 B, D 0 1 0 1 0
# 5: 5 Z 0 0 0 0 1
# 6: 6 A 1 0 0 0 0
Data:
df <- read.table(text='ID Letters
1 "A, B, C"
2 "C, D"
3 "A"
4 "B, D"
5 "Z"
6 "A"', header = TRUE, stringsAsFactors = FALSE)
Upvotes: 1
Reputation: 11255
Here's one way to do it:
DF = data.frame(ID = seq_len(6L),
Letters = c("A, B, C", "C, D", "A", "B, D", "Z", "A"))
spl_letters = strsplit(as.character(DF[["Letters"]]), ", ", fixed = TRUE)
uniq = unique(unlist(spl_letters), use.names = FALSE)
data.frame(DF,
setNames(data.frame(t(vapply(spl_letters, function(x) +(uniq %in% x), seq_along(uniq)))), paste0("Letter_", uniq))
)
ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1 1 A, B, C 1 1 1 0 0
2 2 C, D 0 0 1 1 0
3 3 A 1 0 0 0 0
4 4 B, D 0 1 0 1 0
5 5 Z 0 0 0 0 1
6 6 A 1 0 0 0 0
Basically, the for
loop was changed into vapply
and instead of unlist
, the original strsplit
result was kept in order to match what was uniq
.
Upvotes: 1