Reputation: 161
Suppose each row in a column(letter_strings) has a variable number of strings separated by a comma. For example:
letter_strings
abc, def, ghi, jkl
mno, pqr
stu, vw, xyz
I would like to lookup each string in a dataframe:
letter_strings code
abc YES
def NO
ghi MAYBE
jkl SURE
mno PERHAPS
pqr ALWAYS
stu NEVER
vw NOGO
xyz ABSENT
and obtain the following corresponding rows, in an additional column
YES, NO, MAYBE, SURE
PERHAPS, ALWAYS
NEVER, NOGO, ABSENT
Is this possible in R, I don't really know how to tackle this problem...
Thanks in advance!
W
Upvotes: 1
Views: 330
Reputation: 269481
1) gusbfn
gsubfn
is like gsub
except it looks up the matches to the regular expression (here defined to be "\\w+"
, i.e. a sequence of word characters) in the names of the lookup list, lookup
, replacing the names in the target string with their values in lookup
.
library(gsubfn)
lookup <- with(DF2, as.list(setNames(code, letter_strings)))
transform(DF1, codes = gsubfn("\\w+", lookup, letter_strings))
giving:
letter_strings codes
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2 mno, pqr PERHAPS, ALWAYS
3 stu, vw, xyz NEVER, NOGO, ABSENT
2) dplyr/tidyr Convert DF1 to long form, join it with DF2 and then reshape it back to the original form:
library(dplyr)
library(tidyr)
DF1 %>%
mutate(id = 1:n()) %>%
separate_rows(letter_strings) %>%
left_join(DF2) %>%
group_by(id) %>%
summarise(letter_string = toString(letter_strings), codes = toString(code)) %>%
ungroup %>%
select(-id)
giving:
Joining, by = "letter_strings"
# A tibble: 3 x 2
letter_string codes
<chr> <chr>
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2 mno, pqr PERHAPS, ALWAYS
3 stu, vw, xyz NEVER, NOGO, ABSENT
3) strsplit/merge/aggregate Use strsplit
to split the strings in DF1
and stack
that into long form st
. Then merge
that with DF2
and aggregate
back to the original form. No packages are used.
s <- strsplit(DF1$letter_strings, ", ")
st <- stack(setNames(s, seq_along(s)))
m <- merge(st, DF2, by = 1, all.x = TRUE, all.y = FALSE)
aggregate(. ~ ind, m, toString)[-1]
giving:
values code
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2 mno, pqr PERHAPS, ALWAYS
3 stu, vw, xyz NEVER, NOGO, ABSENT
3a) magrittr This could be expressed using magrittr:
library(magrittr)
DF1 %>%
"$"("letter_strings") %>%
strsplit(", ") %>%
setNames(seq_along(.)) %>%
stack %>%
merge(DF2, by = 1, all.x = TRUE, all.y = FALSE) %>%
aggregate(. ~ ind, ., toString) %>%
"["(-1)
s <- stack(setNames(strsplit(DF1$letter_strings, ", "), 1:nrow(DF1)))
m <- merge(s, DF2, by = 1, all.x = TRUE, all.y = FALSE)
aggregate(. ~ ind, m, toString)[-1]
4) data.table Note that in a comment below that @Uwe has provided a data.table version of the approach in (2) and (3) which is to convert to long form, join and convert back.
Note: Inputs in reproducible form:
Lines1 <- "
letter_strings
abc, def, ghi, jkl
mno, pqr
stu, vw, xyz"
DF1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, sep = ";")
Lines2 <- "
letter_strings code
abc YES
def NO
ghi MAYBE
jkl SURE
mno PERHAPS
pqr ALWAYS
stu NEVER
vw NOGO
xyz ABSENT"
DF2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE)
Upvotes: 3
Reputation: 37641
If there are not too many letter strings, you can do this with gsub
in a loop.
Temp = letter_strings
for(i in 1:nrow(df)) {
Temp = gsub(df$letter_strings[i], df$code[i], Temp) }
Temp
[1] "YES, NO, MAYBE, SURE" "PERHAPS, ALWAYS" "NEVER, NOGO, ABSENT"
Upvotes: 0