Reputation: 161

dataframe in R: lookup multiple strings in a cell that are separated by a comma

Suppose each row in a column(letter_strings) has a variable number of strings separated by a comma. For example:

letter_strings

abc, def, ghi, jkl 
mno, pqr
stu, vw, xyz

I would like to lookup each string in a dataframe:

letter_strings code

abc YES
def NO
ghi MAYBE
jkl SURE
mno PERHAPS
pqr ALWAYS
stu NEVER
vw NOGO
xyz ABSENT

and obtain the following corresponding rows, in an additional column

YES, NO, MAYBE, SURE
PERHAPS, ALWAYS
NEVER, NOGO, ABSENT

Is this possible in R, I don't really know how to tackle this problem...

Thanks in advance!

Upvotes: 1

Answers (2)

G. Grothendieck

Reputation: 269481

1) gusbfn

gsubfn is like gsub except it looks up the matches to the regular expression (here defined to be "\\w+", i.e. a sequence of word characters) in the names of the lookup list, lookup, replacing the names in the target string with their values in lookup.

library(gsubfn)

lookup <- with(DF2, as.list(setNames(code, letter_strings)))
transform(DF1, codes = gsubfn("\\w+", lookup, letter_strings))

giving:

      letter_strings                codes
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2           mno, pqr      PERHAPS, ALWAYS
3       stu, vw, xyz  NEVER, NOGO, ABSENT

2) dplyr/tidyr Convert DF1 to long form, join it with DF2 and then reshape it back to the original form:

library(dplyr)
library(tidyr)

DF1 %>% 
    mutate(id = 1:n()) %>% 
    separate_rows(letter_strings) %>% 
    left_join(DF2) %>% 
    group_by(id) %>% 
    summarise(letter_string = toString(letter_strings), codes = toString(code)) %>% 
    ungroup %>%
    select(-id)

giving:

Joining, by = "letter_strings"
# A tibble: 3 x 2
       letter_string                codes
               <chr>                <chr>
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2           mno, pqr      PERHAPS, ALWAYS
3       stu, vw, xyz  NEVER, NOGO, ABSENT

3) strsplit/merge/aggregate Use strsplit to split the strings in DF1 and stack that into long form st. Then merge that with DF2 and aggregate back to the original form. No packages are used.

s <- strsplit(DF1$letter_strings, ", ")
st <- stack(setNames(s, seq_along(s)))
m <- merge(st, DF2, by = 1, all.x = TRUE, all.y = FALSE)
aggregate(. ~ ind, m, toString)[-1]

giving:

              values                 code
1 abc, def, ghi, jkl YES, NO, MAYBE, SURE
2           mno, pqr      PERHAPS, ALWAYS
3       stu, vw, xyz  NEVER, NOGO, ABSENT

3a) magrittr This could be expressed using magrittr:

library(magrittr)

DF1 %>%
    "$"("letter_strings") %>%
    strsplit(", ") %>%
    setNames(seq_along(.)) %>%
    stack %>%
    merge(DF2, by = 1, all.x = TRUE, all.y = FALSE) %>%
    aggregate(. ~ ind, ., toString) %>%
    "["(-1)


s <- stack(setNames(strsplit(DF1$letter_strings, ", "), 1:nrow(DF1)))
m <- merge(s, DF2, by = 1, all.x = TRUE, all.y = FALSE)
aggregate(. ~ ind, m, toString)[-1]

4) data.table Note that in a comment below that @Uwe has provided a data.table version of the approach in (2) and (3) which is to convert to long form, join and convert back.

Note: Inputs in reproducible form:

Lines1 <- "
letter_strings
abc, def, ghi, jkl
mno, pqr
stu, vw, xyz"
DF1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, sep = ";")

Lines2 <- "
letter_strings code
abc YES
def NO
ghi MAYBE
jkl SURE
mno PERHAPS
pqr ALWAYS
stu NEVER
vw NOGO
xyz ABSENT"
DF2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE)

Upvotes: 3

G5W

Reputation: 37641

If there are not too many letter strings, you can do this with gsub in a loop.

Temp = letter_strings
for(i in 1:nrow(df)) {
    Temp = gsub(df$letter_strings[i], df$code[i], Temp) }
Temp
[1] "YES, NO, MAYBE, SURE" "PERHAPS, ALWAYS"      "NEVER, NOGO, ABSENT"

Upvotes: 0

dataframe in R: lookup multiple strings in a cell that are separated by a comma

Answers (2)

Related Questions