jvalenti
jvalenti

Reputation: 640

vectorized text mining over multiple columns

I have some code that I would like to vectorize but I am not sure how. The following code gives some example data, comprised of names and addreses.

name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md", 
         "811 quincy st washington dc", "1911 1st st rockville md")

source1 <- data.frame(name, address)

name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
      "joes crag shack", "mike lowry place", "holiday inn", "zummer")

name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
         "1100 21st st nw washington dc", "1804 w 5th st wilmington de",
         "1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
         "400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address) 

This block calculates levenshtein distince between two columns of text via R's native adist function and then applies the min function.

dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)

I would like to do the following:

  1. match source1$name with source2$name based on the minimum levenshtein distance.
  2. If the results of 1 yield an NA, match based on source1$address and source2$address using levenshtein distance. I have tried using a for loop, which works fine for 1 but not 2. Here is the code I used to try and incorporate both:

    match.s1.s2<-NULL  
    for(i in 1:nrow(dist.name)){
      for(j in 1:nrow(dist.address)){
    if(is.na(match(min.name[i], dist.name[i, ]))) {
    s2.i <- match(min.address[j], dist.address[j,])
    s1.i <- i
    match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, 
                                             s1name=source1[s1.i,]$name, adist=min.name[j], 
                                             s1.i.address = source1[s1.i,]$address,
                                             s2.i.address = source2[s2.i,]$address),match.s1.s2)
    
    } else {
      s2.i<-match(min.name[i],dist.name[i,])
      s1.i<-i
      match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name, 
                                adist=min.name[i], s1.i.address = source1[s1.i,]$address,
                                s2.i.address = source2[s2.i,]$address),match.s1.s2)
        }
    
      }
    
    }
    

My problem is it's slow and it ends up producing a data frame that is much too large. The end result, data frame match.s1.s2 should have the same number of rows as source1. Any advice or help would be much appreciated. Thanks.

Upvotes: 0

Views: 251

Answers (1)

Anderson Neisse
Anderson Neisse

Reputation: 128

It would be more efficient to use normalized scores (between 0 and 1). That way you could use a vectorized ifelse to only change the NA for the correspondent score of address. With non-normalized scores you have to change the entire row. Try this approach:

dist.mat.nm <- adist(source1$name, source2$name, partial = TRUE, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

#If you use non-normalized distances
dist.mat <- dist.mat.nm
for(i in 1:nrow(dist.mat)){
  if(is.na(dist.mat[i, ])) dist.mat[i, ] <- dist.mat.ad[i, ]
}

#If you use normalized distances
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)

which.match <- function(x, nm) return(nm[which(x == min(x))[1]])

matches <- apply(dist.mat, 1, which.match, nm = source2$name)

That may improve the performance and solve your problem. If you're willing to change to a normalized distance (instead of levenshtein), I would recommend Jaro-Winkler's.

Upvotes: 1

Related Questions