Reputation: 397
I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION
with A&A PRECISION ENGINEERING
However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted Damerau-Levenshtein distance, Full Damerau-Levenshtein distance, Longest Common Substring distance, Q-gram distance, cosine distance, Jaccard distance Jaro, and Jaro-Winkler distance
matches: B&B PRECISION
instead.
Any idea which metric would give more emphasis to the preciseness of the substrings and its sequence matched and care less about the length of the string? I think it is because of the length of the string that the metrics would always choose wrongly.
Upvotes: 3
Views: 871
Reputation: 6496
If you really want to "...give more emphasis to the preciseness of the substrings and its sequence...", then this function could work, as it tests wether a string is a substring of another one:
library(data.table)
x <- c("A&A PRECISION", "A&A PRECISION ENGINEERING", "B&B PRECISION")
y <- x
We want to expand the grid. For that I'd use the CJ
function in data.table
. Then, we will check each pair and see if x is a substring of y (this doesn't work the other way round):
CJ(x, y)[, similarity := apply(.SD, 1, function(x) x[2] %like% x[1]), .SDcols = c("x", "y")][x != y, ]
x y similarity
1: A&A PRECISION A&A PRECISION ENGINEERING TRUE
2: A&A PRECISION B&B PRECISION FALSE
3: A&A PRECISION ENGINEERING A&A PRECISION FALSE
4: A&A PRECISION ENGINEERING B&B PRECISION FALSE
5: B&B PRECISION A&A PRECISION FALSE
6: B&B PRECISION A&A PRECISION ENGINEERING FALSE
Please keep in mind that you'll need to make sure that the strings are as neat as possible for this to work, and even then it might fail.
There are some things I'll check to clean your strings:
You can achieve that with the stringi
package.
Upvotes: 2