subset using grep to include levenshtein distance?

Question

I have two dataframes, with two character vectors of differing lengths that I would like to match, like so:

company.a <- c("heinz", "hawkings mcgill", "heinz ketchup", "heinz vinegars", "davis and smith", "dell computers", "dell", "O organics", "organics")

company.b <- c("heinz", "hawkings-mcgill", "oyster bay", "company x", "dell")

I would like to compare company.b to company.a, returning a vector containing the element from company.b that was matched to company.a. I tried using the following code to subset the larger frame

match.comp <- subset(company.a, grep(paste(company.b, collapse = "|"), company.a, value = TRUE)).

However, what I get in return is an error stating 'subset' must be logical. I would like the following result:

match <- c("heinz", "hawkings mcgill", "heinz", "heinz", "FALSE", "dell", "dell", FALSE, FALSE)

Given the error, clearly I'm missing something about grep or subset. I have two questions:

Is grep the best way to do this? Or is there another way? I know about exact matching using the which(A %in% B) approach, but I cannot guarantee that the strings will be exact matches.
Grep will return the first match, but is there a way to extract all possible matches that were considered via, say, levenshtein distance? I'm aware of the adist function in utils package, but I want to know if it can be combined with grep.

Any help or advice would be greatly appreciated. Thanks.

alistaire · Accepted Answer

You can use adist or agrep, but it's a very subjective process in terms of what costs and cutoff points will be. In this case, you could get the desired result with

d <- adist(company.b, company.a, partial = TRUE) 
d <- apply(d, 2, prop.table)    # working with proportions instead of costs can be useful

matches <- apply(d, 2, function(x){
    x <- setNames(x, company.b)
    names(which.min(x[x < 0.05]))    # set cutoff carefully
})
matches <- sapply(matches, function(x){ifelse(is.null(x), NA, x)})    # clean out NULLs

matches
## [1] "heinz"           "hawkings-mcgill" "heinz"           "heinz"           NA               
## [6] "dell"            "dell"            NA                NA

subset using grep to include levenshtein distance?

Answers (2)

Related Questions