Reputation: 463
So, I'm looking at implementing Fuzzy logic matching in my company and having trouble getting good results. For starters, I'm trying to match up Company names with those on a list supplied by other companies.
My first attempt was to use soundex, but it looks like soundex only compares the first few sounds in the company name, so longer company names were too easily confused for one another.
I'm now working on my second attempt using the levenstein distance comparison. It looks promising, especially if I remove the punctuation first. However, I'm still having trouble finding duplicates without too many false positives.
One of the issues I have is companies such as widgetsco vs widgets inc. So, if I compare the substring of the length of the shorter name, I also pickup things like BBC University and CBC University campus. I suspect that a score using a combination of distance and longest common substring may be the solution.
Has anyone managed to build an algorithm that does such a matching with limited false positives?
Upvotes: 5
Views: 3032
Reputation: 109
I found success implementing a function I found here on Stack Overflow that would find the percentage of strings that match. You can then adjust tolerance till you get an appropriate amount of matches/mismatches. The function implementation will be listed below, but the gist is including something like this in your query.
DECLARE @tolerance DEC(18, 2) = 50;
WHERE dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name) > @tolerance
Credit for the following percent matching function goes to Dragos Durlut, Dec 15 '11. The credit for the LEVENSHTEIN function was included in the code by Dragos Durlut.
T-SQL Get percentage of character match of 2 strings
CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
@string1 NVARCHAR(100)
,@string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN
DECLARE @levenShteinNumber INT
DECLARE @string1Length INT = LEN(@string1)
, @string2Length INT = LEN(@string2)
DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END
SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] ( @string1 ,@string2)
DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber
DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters
-- Return the result of the function
RETURN @percentageOfGoodCharacters
END
-- =============================================
-- Create date: 2011.12.14
-- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html
-- =============================================
CREATE FUNCTION [dbo].[LEVENSHTEIN](@left VARCHAR(100),
@right VARCHAR(100))
returns INT
AS
BEGIN
DECLARE @difference INT,
@lenRight INT,
@lenLeft INT,
@leftIndex INT,
@rightIndex INT,
@left_char CHAR(1),
@right_char CHAR(1),
@compareLength INT
SET @lenLeft = LEN(@left)
SET @lenRight = LEN(@right)
SET @difference = 0
IF @lenLeft = 0
BEGIN
SET @difference = @lenRight
GOTO done
END
IF @lenRight = 0
BEGIN
SET @difference = @lenLeft
GOTO done
END
GOTO comparison
COMPARISON:
IF ( @lenLeft >= @lenRight )
SET @compareLength = @lenLeft
ELSE
SET @compareLength = @lenRight
SET @rightIndex = 1
SET @leftIndex = 1
WHILE @leftIndex <= @compareLength
BEGIN
SET @left_char = substring(@left, @leftIndex, 1)
SET @right_char = substring(@right, @rightIndex, 1)
IF @left_char <> @right_char
BEGIN -- Would an insertion make them re-align?
IF( @left_char = substring(@right, @rightIndex + 1, 1) )
SET @rightIndex = @rightIndex + 1
-- Would an deletion make them re-align?
ELSE IF( substring(@left, @leftIndex + 1, 1) = @right_char )
SET @leftIndex = @leftIndex + 1
SET @difference = @difference + 1
END
SET @leftIndex = @leftIndex + 1
SET @rightIndex = @rightIndex + 1
END
GOTO done
DONE:
RETURN @difference
END
Note: If you need to compare two or more fields (which I don't think you do) you can add another call to the function in the WHERE clause with a minimum tolerance. I also found success averaging the percentMatching and comparing it against a tolerance.
DECLARE @tolerance DEC(18, 2) = 25;
--could have multiple different tolerances for each field (weighting some fields as more important to be matching)
DECLARE @avg_tolerance DEC(18, 2) = 50;
WHERE AND dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name) > @tolerance
AND dbo.GetPercentageOfTwoStringMatching(first_table.address, second_table.address) > @tolerance
AND (dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name)
+ dbo.GetPercentageOfTwoStringMatching(first_table.address, second_table.address)
) / 2 > @avg_tolerance
The benefit of this solution is the tolerance variables can be specific per field (weighting the importance of certain fields matching) and the average can insure general matching across all fields.
Upvotes: 0
Reputation: 12243
We have had good results on name and address matching using a Metaphone function created by Lawrence Philips. It works in a similar way to Soundex, but creates a sound/consonant pattern for the whole value. You may find this useful in conjunction with some other techniques, especially if you can strip some of the fluff like 'co.' and 'inc.' as mentioned in other comments:
create function [dbo].[Metaphone](@str as nvarchar(70), @KeepNumeric as bit = 0)
returns nvarchar(25)
/*
Metaphone Algorithm
Created by Lawrence Philips.
Metaphone presented in article in "Computer Language" December 1990 issue.
*********** BEGIN METAPHONE RULES ***********
Lawrence Philips' RULES follow:
The 16 consonant sounds:
|--- ZERO represents "th"
|
B X S K J T F H L M N P R 0 W Y
Drop vowels
Exceptions:
Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-" ----> drop first letter
Beginning of word: "wh-" ----> change to "w"
Beginning of word: "x" ----> change to "s"
Beginning of word: vowel or "H" + vowel ----> Keep it
Transformations:
B ----> B unless at the end of word after "m", as in "dumb", "McComb"
C ----> X (sh) if "-cia-" or "-ch-"
S if "-ci-", "-ce-", or "-cy-"
SILENT if "-sci-", "-sce-", or "-scy-"
K otherwise
K "-sch-"
D ----> J if in "-dge-", "-dgy-", or "-dgi-"
T otherwise
F ----> F
G ----> SILENT if "-gh-" and not at end or before a vowel
"-gn" or "-gned"
"-dge-" etc., as in above rule
J if "gi", "ge", "gy" if not double "gg"
K otherwise
H ----> SILENT if after vowel and no vowel follows
or "-ch-", "-sh-", "-ph-", "-th-", "-gh-"
H otherwise
J ----> J
K ----> SILENT if after "c"
K otherwise
L ----> L
M ----> M
N ----> N
P ----> F if before "h"
P otherwise
Q ----> K
R ----> R
S ----> X (sh) if "sh" or "-sio-" or "-sia-"
S otherwise
T ----> X (sh) if "-tia-" or "-tio-"
0 (th) if "th"
SILENT if "-tch-"
T otherwise
V ----> F
W ----> SILENT if not followed by a vowel
W if followed by a vowel
X ----> KS
Y ----> SILENT if not followed by a vowel
Y if followed by a vowel
Z ----> S
*/
as
begin
declare @Result varchar(25)
,@str3 char(3)
,@str2 char(2)
,@str1 char(1)
,@strp char(1)
,@strLen tinyint
,@cnt tinyint
set @strLen = len(@str)
set @cnt = 0
set @Result = ''
-- Preserve first 5 numeric values when required
if @KeepNumeric = 1
begin
set @Result = case when isnumeric(substring(@str,1,1)) = 1
then case when isnumeric(substring(@str,2,1)) = 1
then case when isnumeric(substring(@str,3,1)) = 1
then case when isnumeric(substring(@str,4,1)) = 1
then case when isnumeric(substring(@str,5,1)) = 1
then left(@str,5)
else left(@str,4)
end
else left(@str,3)
end
else left(@str,2)
end
else left(@str,1)
end
else ''
end
set @str = right(@str,len(@str)-len(@Result))
end
--Process beginning exceptions
set @str2 = left(@str,2)
if @str2 = 'wh'
begin
set @str = 'w' + right(@str , @strLen - 2)
set @strLen = @strLen - 1
end
else
if @str2 in('ae', 'gn', 'kn', 'pn', 'wr')
begin
set @str = right(@str , @strLen - 1)
set @strLen = @strLen - 1
end
set @str1 = left(@str,1)
if @str1 = 'x'
set @str = 's' + right(@str , @strLen - 1)
else
if @str1 in ('a','e','i','o','u')
begin
set @str = right(@str, @strLen - 1)
set @strLen = @strLen - 1
set @Result = @Result + @str1
end
while @cnt <= @strLen
begin
set @cnt = @cnt + 1
set @str1 = substring(@str,@cnt,1)
set @strp = case when @cnt <> 0
then substring(@str,(@cnt-1),1)
else ' '
end
-- Check if the current character is the same as the previous character.
-- If we are keeping numbers, only compare non-numeric characters.
if case when @KeepNumeric = 1 and @strp = @str1 and isnumeric(@str1) = 0 then 1
when @KeepNumeric = 0 and @strp = @str1 then 1
else 0
end = 1
continue -- Skip this loop
set @str2 = substring(@str,@cnt,2)
set @Result = case when @KeepNumeric = 1 and isnumeric(@str1) = 1
then @Result + @str1
when @str1 in('f','j','l','m','n','r')
then @Result + @str1
when @str1 = 'q'
then @Result + 'k'
when @str1 = 'v'
then @Result + 'f'
when @str1 = 'x'
then @Result + 'ks'
when @str1 = 'z'
then @Result + 's'
when @str1 = 'b'
then case when @cnt = @strLen
then case when substring(@str,(@cnt - 1),1) <> 'm'
then @Result + 'b'
else @Result
end
else @Result + 'b'
end
when @str1 = 'c'
then case when @str2 = 'ch' or substring(@str,@cnt,3) = 'cia'
then @Result + 'x'
else case when @str2 in('ci','ce','cy') and @strp <> 's'
then @Result + 's'
else @Result + 'k'
end
end
when @str1 = 'd'
then case when substring(@str,@cnt,3) in ('dge','dgy','dgi')
then @Result + 'j'
else @Result + 't'
end
when @str1 = 'g'
then case when substring(@str,(@cnt - 1),3) not in ('dge','dgy','dgi','dha','dhe','dhi','dho','dhu')
then case when @str2 in('gi', 'ge','gy')
then @Result + 'j'
else case when @str2 <> 'gn' or (@str2 <> 'gh' and @cnt+1 <> @strLen)
then @Result + 'k'
else @Result
end
end
else @Result
end
when @str1 = 'h'
then case when @strp not in ('a','e','i','o','u') and @str2 not in ('ha','he','hi','ho','hu')
then case when @strp not in ('c','s','p','t','g')
then @Result + 'h'
else @Result
end
else @Result
end
when @str1 = 'k'
then case when @strp <> 'c'
then @Result + 'k'
else @Result
end
when @str1 = 'p'
then case when @str2 = 'ph'
then @Result + 'f'
else @Result + 'p'
end
when @str1 = 's'
then case when substring(@str,@cnt,3) in ('sia','sio') or @str2 = 'sh'
then @Result + 'x'
else @Result + 's'
end
when @str1 = 't'
then case when substring(@str,@cnt,3) in ('tia','tio')
then @Result + 'x'
else case when @str2 = 'th'
then @Result + '0'
else case when substring(@str,@cnt,3) <> 'tch'
then @Result + 't'
else @Result
end
end
end
when @str1 = 'w'
then case when @str2 not in('wa','we','wi','wo','wu')
then @Result + 'w'
else @Result
end
when @str1 = 'y'
then case when @str2 not in('ya','ye','yi','yo','yu')
then @Result + 'y'
else @Result
end
else @Result
end
end
return @Result
end
Upvotes: 1
Reputation: 20302
Are you using Access? If so, consider the '*' character, without the quotes. If you're using SQL Server, use the '%' character. However, this really isn't fuzzy logic, it's really the Like operator. If you really need fuzzy logic, export your data-set to Excel and load the AddIn from the URL below.
https://www.microsoft.com/en-us/download/details.aspx?id=15011
Read the instructions very carefully. It definitely works, and it works great, but you need to follow the instructions, and it's not completely intuitive. The first time I tried it, I didn't follow the instructions, and I wasted a lot of time trying to get it to work. Eventually I figured it out, and it worked great!!
Upvotes: 0