jvalenti
jvalenti

Reputation: 640

subset using grep to include levenshtein distance?

I have two dataframes, with two character vectors of differing lengths that I would like to match, like so:

company.a <- c("heinz", "hawkings mcgill", "heinz ketchup", "heinz vinegars", "davis and smith", "dell computers", "dell", "O organics", "organics")

company.b <- c("heinz", "hawkings-mcgill", "oyster bay", "company x", "dell")

I would like to compare company.b to company.a, returning a vector containing the element from company.b that was matched to company.a. I tried using the following code to subset the larger frame

match.comp <- subset(company.a, grep(paste(company.b, collapse = "|"), company.a, value = TRUE)).

However, what I get in return is an error stating 'subset' must be logical. I would like the following result:

match <- c("heinz", "hawkings mcgill", "heinz", "heinz", "FALSE", "dell", "dell", FALSE, FALSE)

Given the error, clearly I'm missing something about grep or subset. I have two questions:

  1. Is grep the best way to do this? Or is there another way? I know about exact matching using the which(A %in% B) approach, but I cannot guarantee that the strings will be exact matches.

  2. Grep will return the first match, but is there a way to extract all possible matches that were considered via, say, levenshtein distance? I'm aware of the adist function in utils package, but I want to know if it can be combined with grep.

Any help or advice would be greatly appreciated. Thanks.

Upvotes: 1

Views: 744

Answers (2)

alistaire
alistaire

Reputation: 43334

You can use adist or agrep, but it's a very subjective process in terms of what costs and cutoff points will be. In this case, you could get the desired result with

d <- adist(company.b, company.a, partial = TRUE) 
d <- apply(d, 2, prop.table)    # working with proportions instead of costs can be useful

matches <- apply(d, 2, function(x){
    x <- setNames(x, company.b)
    names(which.min(x[x < 0.05]))    # set cutoff carefully
})
matches <- sapply(matches, function(x){ifelse(is.null(x), NA, x)})    # clean out NULLs

matches
## [1] "heinz"           "hawkings-mcgill" "heinz"           "heinz"           NA               
## [6] "dell"            "dell"            NA                NA    

Upvotes: 2

raghu
raghu

Reputation: 416

This might partially satisfy your requirement:

library(utils)

company.a <- c("heinz", "hawkings mcgill", "heinz ketchup", "heinz vinegars", "davis and smith", "dell computers", "dell", "O organics", "organics")
company.b <- c("heinz", "hawkings-mcgill", "oyster bay", "company x", "dell")

limit <- 2

res <- sapply(company.a, function(wa) {
    d <- sapply(company.b, function(wb){
        adist(wb, wa)
    })
    d <- d[d<=limit]
    names(d)
})

The above code snippet will extract all matches in the second array, to each word in the first array. Here, two words are matched if the Levenshtein distance is atmost 'limit'.

Also note that some of the matches you have indicated above are not this simple. For eg, if "heinz" has to match "heinz ketchup", this will require a Levenshtein distance limit of 8, which will be too high in general. A more involved distance function will have to be constructed to deal with these cases.

Upvotes: 2

Related Questions