Reputation: 69
I want to create a new column (MATCH) on the basis of string match between two existing columns. For example -
st_add | aa_add | MATCH |
---|---|---|
jai maa durga society | jai maa durga colony | MATCH |
elph road highway 1 | road highway 2 elph | MATCH |
srinivan colony parel ist | srinivan bus depot | NOT MATCH |
If there is a match in three or more words between column 1 and column 2 then then column 3(MATCH) should show "MATCH". But if there is less than 3 words matches or no match at all (example row 3) then the result should be "NO MATCH"
How can I do this using R??
Upvotes: 2
Views: 1268
Reputation: 1474
You can try stringdist
. You can set a string distance threshold for a match. It also offers multiple methods for computing distance. Thanks Ronak for the dataset code.
library(stringdist)
df$match <- ifelse(stringdist(df$st_add, df$aa_add) < 12, "MATCH", "NOT MATCH")
df
# st_add aa_add match
# 1 jai maa durga society jai maa durga colony MATCH
# 2 elph road highway 1 road highway 2 elph MATCH
# 3 srinivan colony parel ist srinivan bus depot NOT MATCH
Upvotes: 3
Reputation: 388817
You can split the data into words in st_add
and aa_add
count the number of common words, if they are greater than equal to 3 assign 'MATCH'
to it.
df$MATCH <- ifelse(mapply(function(x, y) length(intersect(x, y)),
strsplit(df$st_add, '\\s+'),
strsplit(df$aa_add, '\\s+')) >= 3, 'MATCH', 'NOT MATCH')
df
# st_add aa_add MATCH
#1 jai maa durga society jai maa durga colony MATCH
#2 elph road highway 1 road highway 2 elph MATCH
#3 srinivan colony parel ist srinivan bus depot NOT MATCH
data
df <- structure(list(st_add = c("jai maa durga society", "elph road highway 1",
"srinivan colony parel ist"), aa_add = c("jai maa durga colony",
"road highway 2 elph", "srinivan bus depot")), row.names = c(NA,
-3L), class = "data.frame")
Upvotes: 2