Reputation: 1181
There is a fact table about shopping with below attributes.
CustomerId | ProductId |
---|---|
C1 | P1 |
C1 | P3 |
C2 | P2 |
C2 | P3 |
C3 | P4 |
C4 | P2 |
C5 | P4 |
C5 | P6 |
C6 | P6 |
It describes which Customer purchased what product.
Now, I want to build a cluster with same interest.
For e.g.
C1 bought P1 & P3.
P3 bought by C2 as well, so C1 & C2 have common interest because both bought P3.
Now C2 also bought P2 and P2 bought by C4 as well.
So C2 and C4 also have common interest because both bought P2.
Thus, C1 is connected to C2 and C2 connect to C4.
hence C1, C2 & C4 all together forms a network.
I want an output like this, Where NetworkId should be an unique id for each network.
NetworkId | CustomerId |
---|---|
N1 | C1 |
N1 | C2 |
N1 | C4 |
N2 | C3 |
N2 | C5 |
N2 | C6 |
It seems like a graph problem but I am trying to solve it using BigQuery (SQL), any suggestion would be appreciated.
Thanks in advance.
Upvotes: 0
Views: 107
Reputation: 172974
Consider below approach
with recursive init as (
select distinct least(t1.CustomerId, t2.CustomerId) id1, greatest(t1.CustomerId, t2.CustomerId) id2
from your_table t1
join your_table t2
on t1.ProductId = t2.ProductId
and t1.CustomerId != t2.CustomerId
), iterations as (
select id1 networkId, id1, [id1] net from init where id1 not in (select distinct id2 from init)
union all
select networkId, id2, net || [id2]
from iterations a
join init b
using(id1)
)
select row_number() over() networkId, array (
select distinct id
from t.net id
) CustomerId
from (
select networkId, array_concat_agg(net) net
from iterations
group by networkId
) t
if applied to sample data in your question - output is
Upvotes: 1