Reputation: 41
Lets say my dataset looks like this:
email name
a f
b g
a g
o k
and my desired output would be:
email name group
a f 1
b g 1
a g 1
o k 2
Because the first three rows are the same person as they either share an email or lastname. I'm struggling to figure out how write such query to get the group column.
Upvotes: 0
Views: 68
Reputation: 1269443
This requires a recursive CTE. You can assign a group by creating edges between emails (or names) and then traversing the graph:
with edges as (
select t1.email as email1, t2.email as email2
from t join
t t2
on t1.name = t2.name
),
cte as (
select email1, email2, least(email1, email2) as min_email
array_construct(email1, email2) as visited
from edges e
union all
select cte.email1, e.email2, least(cte.min_email, e.email2),
array_append(cte.visited, e.email2)
from cte join
edges e
on cte.email2 = e.email1
where not array_contains(cte.visited, e.email2)
)
select email1, min(min_email),
dense_rank() over (order by min_email) as grp
from cte
group by email1;
A tweak on this assigns the grp
to the original data:
with edges as (
select t1.email as email1, t2.email as email2
from t join
t t2
on t1.name = t2.name
),
cte as (
select email1, email2, least(email1, email2) as min_email
array_construct(email1, email2) as visited
from edges e
union all
select cte.email1, e.email2, least(cte.min_email, e.email2),
array_append(cte.visited, e.email2)
from cte join
edges e
on cte.email2 = e.email1
where not array_contains(cte.visited, e.email2)
)
select t.*, grp
from t join
(select email1, min(min_email) as min_email,
dense_rank() over (order by min_email) as grp
from cte
group by email1
) e
on t.email = e.email;
Upvotes: 1