Reputation: 106

Multiple String Matching in R

Consider A,B,C,D .... as words. I have two DFs.

df1:

ColA
A B
B C
C D
E F
G H
A M
M

df2:

ColB
A B C D X Y Z
C D M N F K L
S H A F R M T U

Operation: I want to search all element of df1 in df2 then append all the matching values in a new column OR may be create multiple rows.

Output 1:

ColB                    COlB
A B C D X Y Z           A,A B,B C,C D
C D M N F K L           C D,M
S H A F R M T U         A,A M

Output2:

ColB                   Output
A B C D X Y Z           A
A B C D X Y Z           A B
A B C D X Y Z           B C
A B C D X Y Z           C D
C D M N F K L           C D
C D M N F K L           M
S H A F R M T U         A
S H A F R M T U         A M

Upvotes: 0

Answers (2)

Ken Benoit

Reputation: 14902

I think this will do it, although it differs a bit from your expected answer, which I think is wrong.

First set up the input data frames:

# set up the data
df1 <- data.frame(ColA = c("A B", 
                           "B C", 
                           "C D",
                           "E F",
                           "G H",
                           "A M",
                           "M"),
                  stringsAsFactors = FALSE)
df2 <- data.frame(ColB = c("A B C D X Y Z",
                           "C D M N F K L",
                           "S H A F R M T"),
                  stringsAsFactors = FALSE)

Next we will form all the pairwise combinations of the things to search with the things to be searched:

# create a vector of patterns and items to search
intermediate <- as.vector(outer(df2$ColB, df1$ColA, paste, sep = "|"))
# split it into a list
intermediate <- strsplit(intermediate, "|", fixed = TRUE)

Then we can create a function to match the elements for each row of this full combination dataset The core is the foundMatch which returns a logical indicating whether all elements in ColA were present in ColB. In your examples, order does not matter, so here we split the elements and look for all of the first to be in the second.

# set up the output data.frame
Output2 <- data.frame(do.call(rbind, intermediate))
names(Output2) <- c("ColB", "Output")

# here is the core, which does the element matching
foundMatch <- apply(Output2, 1, function(x) {
    tokens <- strsplit(x, " ", fixed = TRUE)
    all(tokens[[2]] %in% tokens[[1]])
})
# filter out the ones with the match
Output2 <- Output2[foundMatch, ]

Output2
##             ColB Output
## 1  A B C D X Y Z    A B
## 2  C D M N F K L    A B
## 3  S H A F R M T    A B
## 10 A B C D X Y Z    E F
## 14 C D M N F K L    G H
## 20 C D M N F K L      M
## 21 S H A F R M T      M

Not exactly what you have above but I think it's correct.

Upvotes: 1

Roman

Reputation: 17648

It is not obvious for me how your data.frames df1 and df2 are built. But you can try to vectorise your data and match both sets.

d1 <- sort(as.character(unlist(df1)))
d2 <- sort(as.character(unlist(df2)))
# get the intersection/difference without duplicates
intersect(d1,d2)
setdiff(d1,d2)
# get all values matching with the first or with the second dataset, respectively 
d1[ d1 %in% d2 ]
d2[ d2 %in% d1 ]