Reputation: 640
I have some code that I would like to vectorize but I am not sure how. The following code gives some example data, comprised of names and addreses.
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
"joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
This block calculates levenshtein distince between two columns of text via R's native adist
function and then applies the min
function.
dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)
I would like to do the following:
source1$name
with source2$name
based on the minimum levenshtein distance.If the results of 1 yield an NA, match based on source1$address
and source2$address
using levenshtein distance. I have tried using a for loop, which works fine for 1 but not 2. Here is the code I used to try and incorporate both:
match.s1.s2<-NULL
for(i in 1:nrow(dist.name)){
for(j in 1:nrow(dist.address)){
if(is.na(match(min.name[i], dist.name[i, ]))) {
s2.i <- match(min.address[j], dist.address[j,])
s1.i <- i
match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name,
s1name=source1[s1.i,]$name, adist=min.name[j],
s1.i.address = source1[s1.i,]$address,
s2.i.address = source2[s2.i,]$address),match.s1.s2)
} else {
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name,
adist=min.name[i], s1.i.address = source1[s1.i,]$address,
s2.i.address = source2[s2.i,]$address),match.s1.s2)
}
}
}
My problem is it's slow and it ends up producing a data frame that is much too large. The end result, data frame match.s1.s2
should have the same number of rows as source1. Any advice or help would be much appreciated. Thanks.
Upvotes: 0
Views: 251
Reputation: 128
It would be more efficient to use normalized scores (between 0 and 1). That way you could use a vectorized ifelse
to only change the NA
for the correspondent score of address. With non-normalized scores you have to change the entire row. Try this approach:
dist.mat.nm <- adist(source1$name, source2$name, partial = TRUE, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
#If you use non-normalized distances
dist.mat <- dist.mat.nm
for(i in 1:nrow(dist.mat)){
if(is.na(dist.mat[i, ])) dist.mat[i, ] <- dist.mat.ad[i, ]
}
#If you use normalized distances
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)
which.match <- function(x, nm) return(nm[which(x == min(x))[1]])
matches <- apply(dist.mat, 1, which.match, nm = source2$name)
That may improve the performance and solve your problem. If you're willing to change to a normalized distance (instead of levenshtein), I would recommend Jaro-Winkler's.
Upvotes: 1