Tanvi
Tanvi

Reputation: 69

How to match strings between two columns in R?

I want to create a new column (MATCH) on the basis of string match between two existing columns. For example -

st_add aa_add MATCH
jai maa durga society jai maa durga colony MATCH
elph road highway 1 road highway 2 elph MATCH
srinivan colony parel ist srinivan bus depot NOT MATCH

If there is a match in three or more words between column 1 and column 2 then then column 3(MATCH) should show "MATCH". But if there is less than 3 words matches or no match at all (example row 3) then the result should be "NO MATCH"

How can I do this using R??

Upvotes: 2

Views: 1268

Answers (2)

Zaw
Zaw

Reputation: 1474

You can try stringdist. You can set a string distance threshold for a match. It also offers multiple methods for computing distance. Thanks Ronak for the dataset code.

library(stringdist)

df$match <- ifelse(stringdist(df$st_add, df$aa_add) < 12, "MATCH", "NOT MATCH")
df

#                      st_add               aa_add     match
# 1     jai maa durga society jai maa durga colony     MATCH
# 2       elph road highway 1  road highway 2 elph     MATCH
# 3 srinivan colony parel ist   srinivan bus depot NOT MATCH

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388817

You can split the data into words in st_add and aa_add count the number of common words, if they are greater than equal to 3 assign 'MATCH' to it.

df$MATCH <- ifelse(mapply(function(x, y) length(intersect(x, y)), 
                strsplit(df$st_add, '\\s+'),
                strsplit(df$aa_add, '\\s+')) >= 3, 'MATCH', 'NOT MATCH')
df

#                     st_add               aa_add     MATCH
#1     jai maa durga society jai maa durga colony     MATCH
#2       elph road highway 1  road highway 2 elph     MATCH
#3 srinivan colony parel ist   srinivan bus depot NOT MATCH

data

df <- structure(list(st_add = c("jai maa durga society", "elph road highway 1", 
"srinivan colony parel ist"), aa_add = c("jai maa durga colony", 
"road highway 2 elph", "srinivan bus depot")), row.names = c(NA, 
-3L), class = "data.frame")

Upvotes: 2

Related Questions