user1569897
user1569897

Reputation: 437

in R, how to check if a word in an entry matches partially the word in another entry

Specifically, I'd like to check if a substring of the entry in one column is an exact match for one of the words in the entries in another column, but the non-substring parts cannot be too long (exceeding four characters)

If I have a dataframe

df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G Gosling"),"check"=c("Denzelboss","Garfield","Goslin"))

then I want the results to be

True, True, False

the first one because of one of the two words "Denzel" is a substring of the other entry (and the deviation string 'boss' is not longer than 4 characters), the second one because one of the three words, "Garfield," is contained in the other entry--it's an exact match, and the third because none of the three words is a substring of the entry in the 'check' column. ("Gosling" would return true)

All entries in the second column have only one word. I don't want to use a fuzzy matching algorithm because the word in the entry (like Denzel)should be an exact substring of the other entry "Denzelboss," but I also don't want to return true when the entry is "DenzelJohnson", where the deviation "Johnson" is too long.

Upvotes: 2

Views: 311

Answers (2)

CPak
CPak

Reputation: 13591

Your data frame stringsAsFactors=F

df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G 

Gosling"),"check"=c("Denzelboss","Garfield","Goslin"),stringsAsFactors=F)

I use iterators::iter to iterate over rows of df, and stringr verbs

Reduce("c", lapply(iter(df,by="row"), function(x) Reduce("any", mapply(function(y,z) ifelse(str_detect(z, y) & nchar(str_replace(z, y, "")) < 5, TRUE, FALSE), as.list(unlist(str_extract_all(x$name, boundary("word")))), x$check))))

[1]  TRUE  TRUE FALSE

Upvotes: 0

thelatemail
thelatemail

Reputation: 93938

Here I am running grepl in an mapply loop for each row and checking to make sure the difference in the length of each substring (number of characters - nchar) is less than the limit of 4:

df[] <- lapply(df, as.character)
mapply(
  function(sp,ck) any(sapply(sp, function(x) grepl(x,ck) & (nchar(ck)-nchar(x) <= 4))),
  strsplit(df$name,"\\s+"),
  df$check
)
#[1]  TRUE  TRUE FALSE

Upvotes: 4

Related Questions