Reputation: 3
I am learning SQL and need to write a code to identify potential duplicates in the column 'party_name' in the image. The matches not necessarily be 100%, I need to identify any % of matches between two or more entries in the party name. Can anyone please guide?
I found SOUNDEX() but it does not work .. or maybe I am not doing it correct.
Upvotes: 0
Views: 242
Reputation: 2853
Providing your demo data as an object is really helpful, especially when you have long strings.
DECLARE @table TABLE (PartyID NVARCHAR(10), Party_Name NVARCHAR(50))
INSERT INTO @table (PartyID, Party_Name) VALUES
('K0164', 'Bank of China (Hong Kong) Ltd.'),
('GP' , 'EFG Private Bank'),
('R0054', 'Star Magnolia Capital Limited'),
('E0054', 'Alternative Capital Investments SPC, Ltd.'),
('E0045', 'Alternativa Capital Ltd')
So it's important to note here that this will likely not be super performant, but as an exercise in "can I do that" it was kinda neat.
What we're doing here is breaking the strings down to single words (by splitting on the white space) and then using SOUNDEX to give them a value. Using that we can look for matches between the words on other PartyIDs. From this we can figure out a total of words that match and use that to determine a percentage of match.
In the case of 'Alternativa Capital Ltd'
compared to 'Alternative Capital Investments SPC, Ltd.'
we found all three close-enough matches so it's 100%.
;WITH matchEmUp AS (
SELECT *, SOUNDEX(value) AS SndEx, ROW_NUMBER() OVER (PARTITION BY PartyID ORDER BY SOUNDEX(value)) AS WordID, COUNT(PartyID) OVER (PARTITION BY PartyID ORDER BY (SELECT 1)) AS Cnt
FROM @table
CROSS APPLY STRING_SPLIT(Party_name, ' ')
), matchy AS (
SELECT a.PartyID AS ThisPartyID, b.PartyID AS OtherPartyID, COUNT(b.PartyID) OVER (PARTITION BY a.PartyID, B.PartyID) AS OtherCnt
FROM matchEmUp a
LEFT OUTER JOIN matchEmUp b
ON a.PartyID <> b.PartyID
AND a.SndEx = b.SndEx
)
SELECT ThisPartyID, OtherPartyID, MAX(Cnt+.0) AS ThisCnt, MAX(OtherCnt+.0) AS OtherCnt, MAX(OtherCnt+.0) / MAX(Cnt+.0) AS MatchPct
FROM matchy my
INNER JOIN matchEmUp meu
ON my.ThisPartyID = meu.PartyID
WHERE OtherPartyID IS NOT NULL
GROUP BY ThisPartyID, OtherPartyID
ThisPartyID OtherPartyID ThisCnt OtherCnt MatchPct
----------------------------------------------------------------
K0164 E0045 6.0 1.0 0.16666666666666
E0054 R0054 5.0 1.0 0.20000000000000
E0045 E0054 3.0 3.0 1.00000000000000
R0054 E0045 4.0 1.0 0.25000000000000
E0045 K0164 3.0 1.0 0.33333333333333
E0054 E0045 5.0 3.0 0.60000000000000
E0045 R0054 3.0 1.0 0.33333333333333
GP K0164 3.0 1.0 0.33333333333333
K0164 E0054 6.0 1.0 0.16666666666666
K0164 GP 6.0 1.0 0.16666666666666
R0054 E0054 4.0 1.0 0.25000000000000
E0054 K0164 5.0 1.0 0.20000000000000
Now you have to decide if the very fuzzy soundex matching is close enough for your needs or not.
Upvotes: 1