Naux Liu
Naux Liu

Reputation: 41

How to merge overlapped groups in Snowflake

I have a many-many relationship table, and I want to find the overlapped groups and merge them into one.

In the example below, user 2 is in groups 7 and 8, so groups 7 and 8 should be merged into one that contains the records 1, 2, 4. The merged group id can be either 7 or 8, it doesn't matter.

user_id group
1 7
2 7
2 8
4 8
5 9
6 9

I wish to see output like this:

user_id group
1 7
2 7
4 7
5 9
6 9

Upvotes: 0

Views: 428

Answers (3)

Pankaj
Pankaj

Reputation: 2746

This is similar to the one asked earlier, where-in grouping needs to be done to the top level in hierarchy.

The below query aggregates user_id based on group_id into array and then compares those arrays with each other. When two arrays match they both get same group id. Once arrays match and they have been assigned their parent group id based on minimum group value, we need to get the top of the hierarchy. There could also be multiple hierarchies in the data-set, so we set starting point of each hierarchy as NULL. Lastly, we use hierarchical query to get the final grouping.

with data(user_id,group_id) as (
  select * from values
  (1,7),(2,7),(2,8),(4,9),(5,9),(5,8),
  (6,9),(70,8),(21,51),(22,51),(23,52),
  (24,51),(24,52),(25,26)
),cte_1 as
  (select group_id,array_agg(user_id) arr
  from data
  group by group_id
), cte_2 as
  (select c1.group_id g1, c2.group_id g2 ,
  c1.arr arr1, c2.arr arr2,
  case when arrays_overlap(arr1, arr2) then g1 end flag,
  min(flag) over (partition by g2) grp,
  case when g2 <> grp then grp end final_grp
  from cte_1 c1, cte_1 c2
), cte_3 as 
  (select distinct g2, connect_by_root g2 as parent from cte_2
  start with final_grp is null
  connect by final_grp = prior g2
  order by g2
), cte_4 as
  (select c3.parent, c1.arr
  from cte_1 c1 left join cte_3 c3
  where c1.group_id = c3.g2
) select distinct value, parent as final_group 
  from cte_4,
  lateral flatten(input=>arr)
  order by value;
  
VALUE FINAL_GROUP
1 7
2 7
4 7
5 7
6 7
21 51
22 51
23 51
24 51
25 26
70 7

Adding another query, that is simpler.

with data(user_id,group_id) as (
  select * from values
  (1,7),(2,7),(2,8),(4,9),(5,9),(5,8),
  (6,9),(70,8),(21,51),(22,51),(22,52),
  (22,53),(23,52),(25,26)
), cte_1 as
  (select a.group_id grp1, b.group_id grp2
  from data a, data b
  where a.user_id = b.user_id
  and a.group_id < b.group_id
), cte_2 as
  (select grp2, connect_by_root grp1 as parent 
  from cte_1
  start with grp1 not in (select grp2 from cte_1)
  connect by grp1 = prior grp2
) select a.user_id, 
  coalesce(b.parent, a.group_id) final_grp
  from data a left join cte_2 b 
  on a.group_id = b.grp2;

Upvotes: 2

Naux Liu
Naux Liu

Reputation: 41

Answering my own question here, below is the SQL I built that fits my needs. This is inspired by @pankaj 's answer.

with data(user_id,group_id) as (
  select * from values
    (1,7),(2,7),(2,8),(4,9),(5,9),(5,8),
    (6,9),(70,8),(21,51),(22,51),(23,52),
    (24,51),(24,52),(25,26)
), group_members as (
    select 
        group_id, array_agg(user_id) users
    from data
    group by group_id
), overlapped_group as (
    select 
        c1.group_id g1,
        c2.group_id g2,
        -- c1.users,
        -- c2.users,
        least(g1, coalesce(g2, g1)) as min_group,
        min(min_group) over (partition by g2) as merge_to
    from group_members c1
    left join 
        group_members c2 on arrays_overlap(c1.users, c2.users) 
        and g1 <> g2
), merge_mapping as (
    select distinct
        g1 as group_id,
        iff(g2 is null, g1, min(merge_to) over (partition by g1)) as merge_to
    from overlapped_group
)

select 
    user_id,
    m.merge_to as group_id
from data 
left join merge_mapping m using(group_id);

Upvotes: 2

psabela
psabela

Reputation: 384

One way:

select user_id, STRTOK(listagg(group, ', ') within group (ORDER BY user_id ),',',1)
from <table>
GROUP BY user_id ORDER BY user_id;

Upvotes: 0

Related Questions