Reputation: 35
I have a dataframe structured as:
ID X1 X2 X3 X4 X5
1 1 grn gerp hrn asn bln
2 2 asn bln hgv mpl zwl
3 3 zwl mpl lwd <NA> <NA>
4 4 bln asn hrn gerp grn
5 5 lwd mpl zwl <NA> <NA>
I'm using currently an insufficient method to check if a row contains a word in the following wordlist:
wordlist <- c("asn", "bln", "gerp", "grn", "hgv", "hrn", "lwd", "mpl", "zwl")
By using the method below I get a TRUE or FALSE if the row ID contains the word in the row:
newDF <- as.data.frame(DF[,1])
newDF[, wordlist] <- NA
newDF[2] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[1])))
newDF[3] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[2])))
newDF[4] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[3])))
newDF[5] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[4])))
newDF[6] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[5])))
newDF[7] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[6])))
newDF[8] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[7])))
newDF[9] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[8])))
newDF[10] <- apply(DF, 1, function(r) any(r %in% as.character(wordlist[9])))
Resulting in the following dataframe:
DF[, 1] asn bln gerp grn hgv hrn lwd mpl zwl
1 1 TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
2 2 FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
3 3 FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
4 4 TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
5 5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
As you see this method is quite inefficient, especially as I have to apply this method to a much bigger DF and a wordlist of 400 words.
Main question: (EDIT: SOLVED)
Sub question:
The dataframe to try:
> dput(DF)
structure(list(ID = 1:5, X1 = structure(c(3L, 1L, 5L, 2L, 4L), .Label = c("asn ", "bln", "grn", "lwd", "zwl"), class = "factor"), X2 = structure(c(3L, 2L, 4L, 1L, 4L), .Label = c("asn", "bln", "gerp", "mpl"), class = "factor"), X3 = structure(c(2L, 1L, 3L, 2L, 4L), .Label = c("hgv", "hrn",
"lwd", "zwl"), class = "factor"), X4 = structure(c(1L, 3L,
NA, 2L, NA), .Label = c("asn", "gerp", "mpl"), class = "factor"), X5 = structure(c(1L, 3L, NA, 2L, NA), .Label = c("bln", "grn",
"zwl"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
Thanks in advance!
Upvotes: 1
Views: 83
Reputation: 50678
Here is a base R option using match
t(apply(DF, 1, function(x) sapply(wordlist, function(y)
ifelse(is.na(match(y, x)), FALSE, TRUE))))
# asn bln gerp grn hgv hrn lwd mpl zwl
#[1,] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
#[3,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
#[4,] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Or to get the column name of DF
of the matched word
t(apply(DF, 1, function(x) sapply(wordlist, function(y)
ifelse(match(y, x), paste0("X", match(y, x) - 1), NA))))
# asn bln gerp grn hgv hrn lwd mpl zwl
#[1,] "X4" "X5" "X2" "X1" NA "X3" NA NA NA
#[2,] NA "X2" NA NA "X3" NA NA "X4" "X5"
#[3,] NA NA NA NA NA NA "X3" "X2" "X1"
#[4,] "X2" "X1" "X4" "X5" NA "X3" NA NA NA
#[5,] NA NA NA NA NA NA "X1" "X2" "X3"
Or to get the column index in DF
of the matched word
t(apply(DF, 1, function(x) sapply(wordlist, function(y) match(y, x))))
# asn bln gerp grn hgv hrn lwd mpl zwl
#[1,] 5 6 3 2 NA 4 NA NA NA
#[2,] NA 3 NA NA 4 NA NA 5 6
#[3,] NA NA NA NA NA NA 4 3 2
#[4,] 3 2 5 6 NA 4 NA NA NA
#[5,] NA NA NA NA NA NA 2 3 4
Upvotes: 1