Reputation: 176
I am trying to compare strings like PRABHAKAR SHARMA
and SHARMA KUMAR PRABHAKAR
. the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched.
I tried using levenshteinSim
in RecordLinkage
package but it gives a number corresponding to the number of changes required to change one string to another.
install.packages("RecordLinkage")
require(RecordLinkage)
levenshteinSim("PRABHAKAR SHARMA","SHARMA KUMAR PRABHAKAR")
#[1] 0.3636364
I want a 100% match in such a case. Also, this has to be replicated for over 1,000,000 records.
Upvotes: 2
Views: 1219
Reputation: 18400
Here is one approach
s1 <- "PRABHAKAR SHARMA"
s2 <- "SHARMA KUMAR PRABHAKAR"
compare <- function(s1, s2) {
c1 <- unique(strsplit(s1, "")[[1]])
c2 <- unique(strsplit(s2, "")[[1]])
length(intersect(c1,c2))/length(c1)
}
compare(s1,s2)
#1
It may be a little slow, though. And it considers the space character as character, too. Use Vectorize
to apply on a column:
dat <- data.frame(small=c("a", "b"), big=c("aa", "cc"), stringsAsFactors=FALSE)
vcomp <- Vectorize(compare)
dat <- transform(dat, comp=vcomp(small, big))
Upvotes: 5
Reputation: 596
If the characters to be considered are only letters you could use:
comp <- function(s1, s2){
in1 = letters %in% strsplit(tolower(s1), "")[[1]]
in2 = letters %in% strsplit(tolower(s2), "")[[1]]
sum(in1 & in2)/sum(in1)
}
Upvotes: 3