Reputation: 640
I have two dataframes, with two character vectors of differing lengths that I would like to match, like so:
company.a <- c("heinz", "hawkings mcgill", "heinz ketchup", "heinz vinegars", "davis and smith", "dell computers", "dell", "O organics", "organics")
company.b <- c("heinz", "hawkings-mcgill", "oyster bay", "company x", "dell")
I would like to compare company.b to company.a, returning a vector containing the element from company.b that was matched to company.a. I tried using the following code to subset the larger frame
match.comp <- subset(company.a, grep(paste(company.b, collapse = "|"), company.a, value = TRUE)).
However, what I get in return is an error stating 'subset' must be logical. I would like the following result:
match <- c("heinz", "hawkings mcgill", "heinz", "heinz", "FALSE", "dell", "dell", FALSE, FALSE)
Given the error, clearly I'm missing something about grep or subset. I have two questions:
Is grep the best way to do this? Or is there another way? I know about exact matching using the which(A %in% B) approach, but I cannot guarantee that the strings will be exact matches.
Grep will return the first match, but is there a way to extract all possible matches that were considered via, say, levenshtein distance? I'm aware of the adist function in utils package, but I want to know if it can be combined with grep.
Any help or advice would be greatly appreciated. Thanks.
Upvotes: 1
Views: 744
Reputation: 43334
You can use adist
or agrep
, but it's a very subjective process in terms of what costs and cutoff points will be. In this case, you could get the desired result with
d <- adist(company.b, company.a, partial = TRUE)
d <- apply(d, 2, prop.table) # working with proportions instead of costs can be useful
matches <- apply(d, 2, function(x){
x <- setNames(x, company.b)
names(which.min(x[x < 0.05])) # set cutoff carefully
})
matches <- sapply(matches, function(x){ifelse(is.null(x), NA, x)}) # clean out NULLs
matches
## [1] "heinz" "hawkings-mcgill" "heinz" "heinz" NA
## [6] "dell" "dell" NA NA
Upvotes: 2
Reputation: 416
This might partially satisfy your requirement:
library(utils)
company.a <- c("heinz", "hawkings mcgill", "heinz ketchup", "heinz vinegars", "davis and smith", "dell computers", "dell", "O organics", "organics")
company.b <- c("heinz", "hawkings-mcgill", "oyster bay", "company x", "dell")
limit <- 2
res <- sapply(company.a, function(wa) {
d <- sapply(company.b, function(wb){
adist(wb, wa)
})
d <- d[d<=limit]
names(d)
})
The above code snippet will extract all matches in the second array, to each word in the first array. Here, two words are matched if the Levenshtein distance is atmost 'limit'.
Also note that some of the matches you have indicated above are not this simple. For eg, if "heinz" has to match "heinz ketchup", this will require a Levenshtein distance limit of 8, which will be too high in general. A more involved distance function will have to be constructed to deal with these cases.
Upvotes: 2