Dave
Dave

Reputation: 49

Transitive matching in SQL

I'm working on a requirement where I need to match a set of records in a group (G1) on some fields, and re-group the matching records into unique new groups (NG1, NG2...). The requirement is something like below:

Sample Data

DECLARE @table TABLE ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10')

In the attached sample data, M1, M3, M4, and M8 should come into same group as M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone; which in turn matches with M8 on Email. ie, they are related by one or many of the attributes.

Likewise, M6,M7, and M9 should be in another unique group ; and M2,M5 in same group (Email match).

M10 alone will be in a group as it doesn't have any matching records.

Like G1, there would be different main groups.

Can anyone please help? NOTE: I'm using MS SQL Server

Upvotes: 0

Views: 377

Answers (2)

Daniel Brughera
Daniel Brughera

Reputation: 1651

It took me 3 CTEs and a couple of cups of coffee but here you have it... My primary concern is that I read this from the comments

It's a repeatable task. There will be several groups and we will have to do it for each group. The total record count across all groups could be millions.

This cannot be a repeatable task as the resources consumption will be high, I recommend you to use it to normalize your groups once and add the logic in your application or stored procedures to store the new data with the desired groups

DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

with 
/* Find all matches
id  Member  MatchWith
1   M1  M3
2   M2  M5
3   M3  M1
3   M3  M4 ...
*/
matches as (
    SELECT t.id, t.[Group], t.Member, a.member as MatchWith
    from 
    @table t
    outer apply (
        select distinct member 
        from @table 
        where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
    ) a
)
/* Stuffing the matches per member
id  Member  AllMatches
1   M1  M1,M3
2   M2  M2,M5
3   M3  M1,M3,M4 .....
*/
, matchsummary as (
    SELECT DISTINCT id, [Group], Member, STUFF((
                SELECT ',' + Member FROM (
                SELECT m.Member
                UNION ALL
                SELECT DISTINCT MatchWith
                FROM matches
                WHERE Member = m.Member) U
                ORDER BY Member
                FOR XML PATH('')
                ), 1, 1, '') as AllMatches
    FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id  Member  AllMatches  gr
1   M1  M1,M3   1
2   M2  M2,M5   2
3   M3  M1,M3,M4    1
4   M4  M3,M4,M8    1
*/
, tree as (
    select *, ROW_NUMBER() over (order by id) as gr
    from matchsummary where AllMatches LIKE member+'%'
    /* The groups are created using the Members who are the first one in their matches 
    id  Member  AllMatches  gr
    1   M1  M1,M3   1
    2   M2  M2,M5   2
    6   M6  M6,M7   3
    10  M10 M10 4
    */
    union all
    select s.*, t.gr 
    from matchsummary s
    join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)

Output:

ID Group Member NewGroup

1 G1 M1 1

2 G1 M2 2

3 G1 M3 1

4 G1 M4 1

5 G1 M5 2

6 G1 M6 3

7 G1 M7 3

8 G1 M8 1

9 G1 M9 3

10 G1 M10 4

Second Option

Given the size of your table I recommend you to use this, I am not a big fan of loops but here I think they worth it, that way you don't need to process all your data at once,

First, you need to add a new column on your table to store the new group, my first thought was that would be better to change your application's logic to calculate that group when a new record is inserted, but thinking it better, an insert can cause several groups becoming one, and you probably need fast response in your application. So, you can set a job to regroup your data as often as you need it, if you have an UpdatedDate field in your table you also can refine this solution using a Log table and reprocess only groups that were modified after their last execution.

 IF OBJECT_ID('tempdb..#table') IS NOT NULL
    DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

ALTER TABLE #table ADD newGroup INT

/******************************************************************
START HERE
******************************************************************/

IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
    DROP TABLE #Groups;

SELECT DISTINCT [Group] INTO #Groups FROM #table

DECLARE @Group VARCHAR(3)

WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN

    SELECT TOP 1 @Group = [Group] FROM #Groups

    UPDATE #table SET newGroup = NULL 
    WHERE [Group] = @Group

    DECLARE @newGroup INT = 1
    DECLARE @member varchar(3)

    WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
    BEGIN

        SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL
    
        UPDATE #table SET newGroup = @newGroup
        WHERE Member = @member

        WHILE @@ROWCOUNT > 0
        BEGIN
            UPDATE T
            SET newGroup = @newGroup
            FROM #table T
            WHERE [Group] = @group AND newGroup IS NULL
            AND EXISTS (
                SELECT 1 FROM #table 
                WHERE newGroup = @newGroup
                AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
            )
        END
        SET @newGroup += 1
    END
    DELETE #Groups WHERE [Group] = @Group
END

SELECT * FROM #table

Upvotes: 0

Bart Hofland
Bart Hofland

Reputation: 3905

In Microsoft SQL Server I would do the following, assuming that the data is in the table called "DataTable":

WITH
    [Matches] AS
    (
        SELECT
            D1.[Group],
            D1.[Member],
            D2.[Member] AS [PreviousMatchingMember]
        FROM
            [DataTable] AS D1
            OUTER APPLY (SELECT TOP (1) [Member]
                         FROM [DataTable]
                         WHERE
                             [Group] = D1.[Group] AND
                             [Member] < D1.[Member] AND
                             ([Address] = D1.[Address] OR
                              [Phone] = D1.[Phone] OR
                              [Email] = D1.[Email])
                         ORDER BY
                             [Member]) AS D2
    ),
    [Groups] AS
    (
        SELECT
            [Group],
            [Member],
            [PreviousMatchingMember],
            'NG' + LTRIM(ROW_NUMBER() OVER (ORDER BY [Group], [Member])) AS [NewGroup]
        FROM
            [Matches]
        WHERE
            [PreviousMatchingMember] IS NULL
    UNION ALL
        SELECT
            M.[Group],
            M.[Member],
            M.[PreviousMatchingMember],
            G.[NewGroup]
        FROM
            [Groups] AS G
            INNER JOIN [Matches] AS M ON
                M.[Group] = G.[Group] AND
                M.[PreviousMatchingMember] = G.[Member]
    )
SELECT
    G.[NewGroup],
    G.[Member],
    D.[Address],
    D.[Phone],
    D.[Email]
FROM
    [Groups] AS G
    INNER JOIN [DataTable] AS D ON
        D.[Group] = G.[Group] AND
        D.[Member] = G.[Member]
ORDER BY
    G.[NewGroup],
    G.[Member];

Edit:

As APC pointed out in his comment to your question, you have a (huge) problem if a record refers multiple other records (using different address/phone/email fields). You might end up having records that would potentially belong to different groups. You might decide that those groups should be considered as one group, but my solution here is not fit to solve such a complex problem.

Upvotes: 1

Related Questions