esinmal
esinmal

Reputation: 3

How to determine the percentage of match between rows of the same column

enter image description here

I am learning SQL and need to write a code to identify potential duplicates in the column 'party_name' in the image. The matches not necessarily be 100%, I need to identify any % of matches between two or more entries in the party name. Can anyone please guide?

I found SOUNDEX() but it does not work .. or maybe I am not doing it correct.

Upvotes: 0

Views: 242

Answers (1)

Patrick Hurst
Patrick Hurst

Reputation: 2853

Providing your demo data as an object is really helpful, especially when you have long strings.

DECLARE @table TABLE (PartyID NVARCHAR(10), Party_Name NVARCHAR(50))
INSERT INTO @table (PartyID, Party_Name) VALUES
('K0164', 'Bank of China (Hong Kong) Ltd.'),
('GP'   , 'EFG Private Bank'),
('R0054', 'Star Magnolia Capital Limited'),
('E0054', 'Alternative Capital Investments SPC, Ltd.'),
('E0045', 'Alternativa Capital Ltd')

So it's important to note here that this will likely not be super performant, but as an exercise in "can I do that" it was kinda neat.

What we're doing here is breaking the strings down to single words (by splitting on the white space) and then using SOUNDEX to give them a value. Using that we can look for matches between the words on other PartyIDs. From this we can figure out a total of words that match and use that to determine a percentage of match. In the case of 'Alternativa Capital Ltd' compared to 'Alternative Capital Investments SPC, Ltd.' we found all three close-enough matches so it's 100%.

;WITH matchEmUp AS (
SELECT *, SOUNDEX(value) AS SndEx, ROW_NUMBER() OVER (PARTITION BY PartyID ORDER BY SOUNDEX(value)) AS WordID, COUNT(PartyID) OVER (PARTITION BY PartyID ORDER BY (SELECT 1)) AS Cnt
  FROM @table 
  CROSS APPLY STRING_SPLIT(Party_name, ' ')
), matchy AS (
SELECT a.PartyID AS ThisPartyID, b.PartyID AS OtherPartyID, COUNT(b.PartyID) OVER (PARTITION BY a.PartyID, B.PartyID) AS OtherCnt
  FROM matchEmUp a
    LEFT OUTER JOIN matchEmUp b
      ON a.PartyID <> b.PartyID
      AND a.SndEx = b.SndEx
)

SELECT ThisPartyID, OtherPartyID, MAX(Cnt+.0) AS ThisCnt, MAX(OtherCnt+.0) AS OtherCnt, MAX(OtherCnt+.0) / MAX(Cnt+.0) AS MatchPct
  FROM matchy my
    INNER JOIN matchEmUp meu
      ON my.ThisPartyID = meu.PartyID
 WHERE OtherPartyID IS NOT NULL
 GROUP BY ThisPartyID, OtherPartyID
ThisPartyID OtherPartyID    ThisCnt OtherCnt    MatchPct
----------------------------------------------------------------
K0164       E0045           6.0     1.0         0.16666666666666
E0054       R0054           5.0     1.0         0.20000000000000
E0045       E0054           3.0     3.0         1.00000000000000
R0054       E0045           4.0     1.0         0.25000000000000
E0045       K0164           3.0     1.0         0.33333333333333
E0054       E0045           5.0     3.0         0.60000000000000
E0045       R0054           3.0     1.0         0.33333333333333
GP          K0164           3.0     1.0         0.33333333333333
K0164       E0054           6.0     1.0         0.16666666666666
K0164       GP              6.0     1.0         0.16666666666666
R0054       E0054           4.0     1.0         0.25000000000000
E0054       K0164           5.0     1.0         0.20000000000000

Now you have to decide if the very fuzzy soundex matching is close enough for your needs or not.

Upvotes: 1

Related Questions