user3897860
user3897860

Reputation: 21

Find overlapping sets of data in a table

I need to identify duplicate sets of data and give those sets who's data is similar a group id.

id     threshold     cost
--     ----------    ----------
1      0             9
1      100           7
1      500           6
2      0             9
2      100           7
2      500           6

I have thousands of these sets, most are the same with different id's. I need find all the like sets that have the same thresholds and cost amounts and give them a group id. I'm just not sure where to begin. Is the best way to iterate and insert each set into a table and then each iterate through each set in the table to find what already exists?

Upvotes: 1

Views: 147

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269873

This is one of those cases where you can try to do something with relational operators. Or, you can just say: "let's put all the information in a string and use that as the group id". SQL Server seems to discourage this approach, but it is possible. So, let's characterize the groups using:

select d.id,
       (select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
        from data d2
        where d2.id = d.id
        for xml path ('')
        order by threshold
       ) as groupname
from data d
group by d.id;

Oh, I think that solves your problem. The groupname can serve as the group id. If you want a numeric id (which is probably a good idea, use dense_rank():

select d.id, dense_rank() over (order by groupname) as groupid
from (select d.id,
             (select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
              from data d2
              where d2.id = d.id
              for xml path ('')
              order by threshold
             ) as groupname
      from data d
      group by d.id
     ) d;

Upvotes: 1

Stefan Steiger
Stefan Steiger

Reputation: 82186

Here's the solution to my interpretation of the question:

IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;


;
WITH BaseTable AS 
(
              SELECT 1 id, 0 as threshold, 9 as cost

        UNION SELECT 1, 100, 7

        UNION SELECT 1, 500, 6

        UNION SELECT 2, 0, 9

        UNION SELECT 2, 100, 7

        UNION SELECT 2, 500, 6

        UNION SELECT 3, 1, 9

        UNION SELECT 3, 100, 7

        UNION SELECT 3, 500, 6
)

, BaseCTE AS 
(

    SELECT 
        id
        --,dense_rank() over (order by threshold, cost ) as GroupId  
        ,
        (
            SELECT CAST(TblGrouping.threshold AS varchar(8000)) + '/' + CAST(TblGrouping.cost AS varchar(8000)) + ';'
            FROM BaseTable AS TblGrouping 
            WHERE TblGrouping.id = BaseTable.id
            ORDER BY TblGrouping.threshold, TblGrouping.cost
            FOR XML PATH ('')
       ) AS MultiGroup 

    FROM BaseTable 

    GROUP BY id 
) 
,
CTE AS 
(
    SELECT 
         * 
        ,DENSE_RANK() OVER (ORDER BY MultiGroup) AS GroupId  
    FROM BaseCTE 
)
SELECT * 
INTO #tempGrouping
FROM CTE  



-- SELECT * FROM #tempGrouping; 


UPDATE BaseTable 
    SET BaseTable.GroupId = #tempGrouping.GroupId 
FROM BaseTable 

INNER JOIN #tempGrouping 
    ON BaseTable.Id = #tempGrouping.Id 


IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;

Where BaseTable is your table, and and you don't need the CTE "BaseTable", because you have a data table.
You may need to take extra-precautions if your threshold and cost fields can be NULL.

Upvotes: 0

Related Questions