Reputation: 5935
I am trying to decide if two different restaurant names are similar to be able to match them. The names might be misspelled or the parts of the title might be in the wrong order.
In some cases it is a simple as matching: "Angry Diner" with "Angry Diner Restaurant". or "Burger King" with "Burgor King"
A harder case that I have found is: "Mathias Dahlgren Matbaren" and "Restaurant Mathias Dahlgren"
I have looked at several different fuzzy string difference algorithms but have not found one for this use case.
Anyone who knows about an algorithm and/or library I can use?
Upvotes: 3
Views: 2608
Reputation: 664
First of all: you will get much better results if you have more than just the name to match on, such as the address. Then you can use a record linkage engine to consider the evidence from all the properties. Using just the name is going to give poor precision in most cases.
The first thing you need to consider is if you are likely to see reordering of substrings. That is, "Restaurant Angry Diner" vs "Angry Diner Restaurant". In that case, q-grams, longest common substring and longest common subsequence are all good candidates. For q-grams you can choose between various sub-formulas and matchings.
If you want the order to matter, affine gap would probably be good for this particular. It's similar to Smith-Waterman but doesn't penalize so much for deletions. Basically, the first deletion is expensive, but then later deletions in the same place are less expensive.
As others have suggested, removing common words like "restaurant", "matbaren" etc before matching is likely to improve accuracy.
There are piles of libraries, but since you didn't specify the programming language it's hard to recommend one. What use is Java if you use PHP? Or vice versa?
But please note carefully what I wrote above: name alone is not going to work very well. Even if the names are identical it could still be two totally different restaurants.
Upvotes: 3
Reputation: 8095
I think that the best fitting algorithm would be the algorithm for optimal local alignment: Smith-Waterman-Algorithm
penalty("Angry Diner","Angry Diner Restaurant") = 0
penalty("Burger King", "Burgor King") = 1
penalty("Mathias Dahlgren Matbaren", "Restaurant Mathias Dahlgren") = 0
It is a variant of the Levensthein-Algorithm with the difference that insertions/deletions of characters at the beginning/end are not penalized.
Upvotes: 0
Reputation: 12592
You can try the diff algorithm. It creates all possible strings and find the longest common subsequence.
Well, as mentioned above the speed is O(N^3), i've done a longest common subsequence way that is O(m.n) where m and n are the length of str1 and str2, the result is a percentage and it seems to be exactly the same as similar_text percentage but with better performance... here's the 3 functions i'm using..
<?php
function LCS_Length($s1, $s2)
{
$m = strlen($s1);
$n = strlen($s2);
//this table will be used to compute the LCS-Length, only 128 chars per string are considered
$LCS_Length_Table = array(array(128),array(128));
//reset the 2 cols in the table
for($i=1; $i < $m; $i++) $LCS_Length_Table[$i][0]=0;
for($j=0; $j < $n; $j++) $LCS_Length_Table[0][$j]=0;
for ($i=1; $i <= $m; $i++) {
for ($j=1; $j <= $n; $j++) {
if ($s1[$i-1]==$s2[$j-1])
$LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j-1] + 1;
else if ($LCS_Length_Table[$i-1][$j] >= $LCS_Length_Table[$i][$j-1])
$LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j];
else
$LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i][$j-1];
}
}
return $LCS_Length_Table[$m][$n];
}
function str_lcsfix($s)
{
$s = str_replace(" ","",$s);
$s = ereg_replace("[��������]","e", $s);
$s = ereg_replace("[������������]","a", $s);
$s = ereg_replace("[��������]","i", $s);
$s = ereg_replace("[���������]","o", $s);
$s = ereg_replace("[��������]","u", $s);
$s = ereg_replace("[�]","c", $s);
return $s;
}
function get_lcs($s1, $s2)
{
//ok, now replace all spaces with nothing
$s1 = strtolower(str_lcsfix($s1));
$s2 = strtolower(str_lcsfix($s2));
$lcs = LCS_Length($s1,$s2); //longest common sub sequence
$ms = (strlen($s1) + strlen($s2)) / 2;
return (($lcs*100)/$ms);
}
?>
you can skip calling str_lcsfix if you don't worry about accentuated characters and things like that or you can add up to it or modify it for faster performance, i think ereg is not the fastest way?
hope this helps.
Georges
[1] http://php.net/manual/de/function.similar-text.php
[2] String similarity -> Levenshtein distance
Upvotes: 0