Siyual
Siyual

Reputation: 16917

SQL Server Create Grouping For Related Records

I'm running into an interesting scenario trying to assign an arbitrary FamilyId to fields that are related to each other.

Here is the structure that we're currently working with:

DataId  OriginalDataId
3       1
4       1
5       1
6       1
3       2
4       2
5       2
6       2
7       10
8       10
9       10
11      15

What we're attempting to do is add a FamilyId column to all DataIds that have a relationship between each other.

In this case, Id's 3, 4, 5, and 6 have a relationship to 1. But 3, 4, 5, and 6 also have a relationship with 2. So 1, 2, 3, 4, 5, and 6 should all be considered to be in the same FamilyId.

7, 8, and 9 only have a relationship to 10, which puts this into a separate FamilyId. Same for 11 and 15.

What I am expecting as a result from this are the following results:

DataId  FamilyId
1       1
2       1
3       1
4       1
5       1
6       1
7       2
8       2
9       2
10      2
11      3
15      3

Sample data, structure, and queries:

Declare @Results_Stage Table
(
    DataId          BigInt Not Null,
    OriginalDataId  BigInt Null
)


Insert @Results_Stage
Values (3,1), (4,1), (5,1), (6,1), (3,2), (4,2), (5,2), (6,2), (7,10), (8, 10), (9, 10), (11, 15)


Select DataId, Row_Number() Over(Partition By DataId Order By OriginalDataId Asc) FamilyId
From   @Results_Stage       R
Union
Select OriginalDataId, Row_Number() Over(Partition By DataId Order By OriginalDataId Asc) FamilyId
From   @Results_Stage

I'm positive my attempt is nowhere near correct, but I'm honestly not sure where to even start on this -- or if it's even possible in SQL Server.

Does anyone have an idea on how to tackle this issue, or at least, something to point me in the right direction?

Edit Below is a query I've come up with so far to identify the other DataId records that should belong to the same FamilyId

Declare @DataId BigInt = 1

;With Children As
(
    Select      Distinct X.DataId
    From        @Results_Stage  S
    Outer Apply
    (
        Select  Distinct DataId
        From    @Results_Stage  R
        Where   R.OriginalDataId = S.DataId
        Or      R.OriginalDataId = S.OriginalDataId
    ) X
    Where   S.DataId = @DataId
    Or      S.OriginalDataId = @DataId
)
Select  Distinct O.OriginalDataId
From    Children    C
Outer Apply
(
    Select  S.OriginalDataId
    From    @Results_Stage  S
    Where   S.DataId = C.DataId
) O
Union 
Select  DataId
From    Children

Upvotes: 2

Views: 163

Answers (2)

leoinfo
leoinfo

Reputation: 8215

Check this ... it doesn't look too nice but is doing the job :)

DECLARE @T TABLE (DataId INT, OriginalDataId INT)
INSERT INTO @T(DataId , OriginalDataId)
          select 3,1
union all select 4,1
union all select 5,1
union all select 6,1
union all select 3,2
union all select 4,2
union all select 5,2
union all select 6,2
union all select 7,10
union all select 8,10
union all select 9,10
union all select 11,15


SELECT * FROM @T


;WITH f AS (
  SELECT DISTINCT OriginalDataId FROM @T
)
, m AS (
  SELECT DISTINCT 
    DataId , OriginalDataId = MIN(OriginalDataId) 
  FROM @T 
  GROUP BY DataId
)
, m2 AS (
  SELECT DISTINCT 
    x.DataId , x.OriginalDataId 
  FROM @T AS x
  LEFT OUTER JOIN  m ON x.DataId = m.DataId AND x.OriginalDataId = m.OriginalDataId
  WHERE m.DataId IS NULL
)
, m3 AS (
  SELECT DISTINCT DataId = x.OriginalDataId , m.OriginalDataId 
  FROM m2 AS x
  INNER JOIN m ON x.DataId = m.DataId
)
, m4 AS (
  SELECT  DISTINCT 
    DataId = OriginalDataId , OriginalDataId 
  FROM @T 
  WHERE OriginalDataId NOT IN(SELECT DataId FROM m3)

  UNION 
  SELECT DISTINCT 
    x.DataId  , f.OriginalDataId
  FROM f
  INNER JOIN m AS x on x.OriginalDataId = f.OriginalDataId
  WHERE x.DataId NOT IN(SELECT DataId FROM m3)

  UNION 
  SELECT DataId , OriginalDataId FROM m3
) 
, list AS (
  SELECT 
    x.DataId, FamilyId = DENSE_RANK() OVER(ORDER BY x.OriginalDataId ) 
  FROM m4 AS x
)
SELECT * FROM list


-- OUTPUT
DataId  FamilyId
1       1
2       1
3       1
4       1
5       1
6       1
7       2
8       2
9       2
10      2
11      3
15      3

Upvotes: 1

Giorgos Betsos
Giorgos Betsos

Reputation: 72195

The following query, which employs FOR XML PATH:

SELECT R.OriginalDataId,
          STUFF((
             SELECT ', ' + + CAST([DataId] AS VARCHAR(MAX)) 
             FROM #Results_Stage 
             WHERE (OriginalDataId = R.OriginalDataId) 
             FOR XML PATH(''),TYPE).value('(./text())[1]','VARCHAR(MAX)')
          ,1,2,'') AS GroupValues
   FROM #Results_Stage R
   GROUP BY R.OriginalDataId

can be used to produce this output:

OriginalDataId  GroupValues
===========================
1               3, 4, 5, 6
2               3, 4, 5, 6
10              7, 8, 9
15              11

Using the above result set, we can easily identify each group and thus have something upon which DENSE_RANK() can be applied:

;WITH GroupedData AS (
   SELECT R.OriginalDataId,
          STUFF((
             SELECT ', ' + + CAST([DataId] AS VARCHAR(MAX)) 
             FROM #Results_Stage 
             WHERE (OriginalDataId = R.OriginalDataId) 
             FOR XML PATH(''),TYPE).value('(./text())[1]','VARCHAR(MAX)')
          ,1,2,'') AS GroupValues
   FROM #Results_Stage R
   GROUP BY R.OriginalDataId
), Families AS (
   SELECT OriginalDataId, DENSE_RANK() OVER (ORDER BY GroupValues) AS FamilyId
   FROM GroupedData 
)
SELECT OriginalDataId AS DataId, FamilyId  
FROM Families

UNION 

SELECT DataId, F.FamilyId
FROM #Results_Stage R
INNER JOIN Families F ON R.OriginalDataId = F.OriginalDataId

ORDER BY FamilyId

Output from above is:

  DataId    FamilyId
   ===================
    11      1
    15      1
    1       2
    2       2
    3       2
    4       2
    5       2
    6       2
    7       3
    8       3
    9       3
    10      3

Upvotes: 1

Related Questions