Reputation: 437
Specifically, I'd like to check if a substring of the entry in one column is an exact match for one of the words in the entries in another column, but the non-substring parts cannot be too long (exceeding four characters)
If I have a dataframe
df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G Gosling"),"check"=c("Denzelboss","Garfield","Goslin"))
then I want the results to be
True, True, False
the first one because of one of the two words "Denzel" is a substring of the other entry (and the deviation string 'boss' is not longer than 4 characters), the second one because one of the three words, "Garfield," is contained in the other entry--it's an exact match, and the third because none of the three words is a substring of the entry in the 'check' column. ("Gosling" would return true)
All entries in the second column have only one word. I don't want to use a fuzzy matching algorithm because the word in the entry (like Denzel)should be an exact substring of the other entry "Denzelboss," but I also don't want to return true when the entry is "DenzelJohnson", where the deviation "Johnson" is too long.
Upvotes: 2
Views: 311
Reputation: 13591
Your data frame stringsAsFactors=F
df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G
Gosling"),"check"=c("Denzelboss","Garfield","Goslin"),stringsAsFactors=F)
I use iterators::iter
to iterate over rows of df
, and stringr
verbs
Reduce("c", lapply(iter(df,by="row"), function(x) Reduce("any", mapply(function(y,z) ifelse(str_detect(z, y) & nchar(str_replace(z, y, "")) < 5, TRUE, FALSE), as.list(unlist(str_extract_all(x$name, boundary("word")))), x$check))))
[1] TRUE TRUE FALSE
Upvotes: 0
Reputation: 93938
Here I am running grepl
in an mapply
loop for each row and checking to make sure the difference in the length of each substring (number of characters - nchar
) is less than the limit of 4:
df[] <- lapply(df, as.character)
mapply(
function(sp,ck) any(sapply(sp, function(x) grepl(x,ck) & (nchar(ck)-nchar(x) <= 4))),
strsplit(df$name,"\\s+"),
df$check
)
#[1] TRUE TRUE FALSE
Upvotes: 4