dannier
dannier

Reputation: 41

Mark duplicates based on either one of two columns

Lets say my dataset looks like this:

email name
  a    f
  b    g
  a    g
  o    k

and my desired output would be:

email name  group 
  a    f      1
  b    g      1
  a    g      1
  o    k      2

Because the first three rows are the same person as they either share an email or lastname. I'm struggling to figure out how write such query to get the group column.

Upvotes: 0

Views: 68

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269443

This requires a recursive CTE. You can assign a group by creating edges between emails (or names) and then traversing the graph:

with edges as (
      select t1.email as email1, t2.email as email2
      from t join
           t t2
           on t1.name = t2.name
     ),
     cte as (
      select email1, email2, least(email1, email2) as min_email
             array_construct(email1, email2) as visited
      from edges e
      union all
      select cte.email1, e.email2, least(cte.min_email, e.email2),
             array_append(cte.visited, e.email2)
      from cte join
           edges e
           on cte.email2 = e.email1
      where not array_contains(cte.visited, e.email2)
     )
select email1, min(min_email),
       dense_rank() over (order by min_email) as grp
from cte
group by email1;

A tweak on this assigns the grp to the original data:

with edges as (
      select t1.email as email1, t2.email as email2
      from t join
           t t2
           on t1.name = t2.name
     ),
     cte as (
      select email1, email2, least(email1, email2) as min_email
             array_construct(email1, email2) as visited
      from edges e
      union all
      select cte.email1, e.email2, least(cte.min_email, e.email2),
             array_append(cte.visited, e.email2)
      from cte join
           edges e
           on cte.email2 = e.email1
      where not array_contains(cte.visited, e.email2)
     )
select t.*, grp
from t join
     (select email1, min(min_email) as min_email,
             dense_rank() over (order by min_email) as grp
      from cte
      group by email1
     ) e
     on t.email = e.email;

Upvotes: 1

Related Questions