in R, how to check if a word in an entry matches partially the word in another entry

Question

Specifically, I'd like to check if a substring of the entry in one column is an exact match for one of the words in the entries in another column, but the non-substring parts cannot be too long (exceeding four characters)

If I have a dataframe

df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G Gosling"),"check"=c("Denzelboss","Garfield","Goslin"))

then I want the results to be

True, True, False

the first one because of one of the two words "Denzel" is a substring of the other entry (and the deviation string 'boss' is not longer than 4 characters), the second one because one of the three words, "Garfield," is contained in the other entry--it's an exact match, and the third because none of the three words is a substring of the entry in the 'check' column. ("Gosling" would return true)

All entries in the second column have only one word. I don't want to use a fuzzy matching algorithm because the word in the entry (like Denzel)should be an exact substring of the other entry "Denzelboss," but I also don't want to return true when the entry is "DenzelJohnson", where the deviation "Johnson" is too long.

thelatemail · Accepted Answer

Here I am running grepl in an mapply loop for each row and checking to make sure the difference in the length of each substring (number of characters - nchar) is less than the limit of 4:

df[] <- lapply(df, as.character)
mapply(
  function(sp,ck) any(sapply(sp, function(x) grepl(x,ck) & (nchar(ck)-nchar(x) <= 4))),
  strsplit(df$name,"\s+"),
  df$check
)
#[1]  TRUE  TRUE FALSE

in R, how to check if a word in an entry matches partially the word in another entry

Answers (2)

Related Questions