Reputation: 1237
I have an output like this:
library(dplyr)
Data <- tibble(
Name1 = c("PlaceA, PlaceB & PlaceC", "PlaceD and PlaceE", "PlaceF.", "PlaceG & PlaceH", "Place K-Place L", "Place M and Place N","PlaceP-PlaceQ"),
Name2 = c("PlaceB, PlaceA & PlaceC", "PlaceD & PlaceE", "PlaceF","PlaceG & PlaceJ", "Place L-Place K", "Place N and Place M","PlaceP-PlaceR"))
I would like to compare the two columns row by row to see if they are the same, but 1) ignore the order of the words 2) the characters used to separate the words and 3) if an '&' has been used instead of 'and'
With an output like this:
Data %>% mutate(Match = c("TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","FALSE"))
I'm sure there must be a way of using stringr
to do this, but I can't find it.
Edit @akrun noticing I had made a typo in my dummy data made me think about typos in my real data. If there is only one letter difference (either an additional letter or a mistyped letter in the word) then they are probably the same and should match. If a word has the same letters but in a different order it shouldn't. Something like this:
Mispellings <- tibble(
Name1 = c("Location","Place","Racecar"),
Name2 = c("Locatione","Pluce","Carrace"),
Match = c("TRUE", "TRUE", "FALSE"))
Can any solution for my original question also deal with this additional scenario?
Upvotes: 1
Views: 78
Reputation: 887213
One option is to split into list and sort
, then do the comparison of list elements
lst1 <- lapply(strsplit(Data$Name1, "\\s*[,&.-]\\s*|\\s*and\\s*"), sort)
lst2 <- lapply(strsplit(Data$Name2, "\\s*[,&.-]\\s*|\\s*and\\s*"), sort)
mapply(function(x, y) all(x == y), lst1, lst2)
[1] TRUE TRUE TRUE FALSE TRUE TRUE FALSE
Or use setequal
do.call(mapply, c(FUN = setequal, unname(lapply(Data,
function(x) strsplit(x, "\\s*[,&.-]\\s*|\\s*and\\s*")))))
[1] TRUE TRUE TRUE FALSE TRUE TRUE FALSE
Upvotes: 1