Reputation: 149
Background - I have a set of customer data and used a string matching algorithm to compare all records for similarity. I then need to group the results which relate to each other either directly or by association, and apply a unique ID for each group.
Problem - I can't think of a way to link records together and apply a unique ID for each group
Example
Data currently looks like this for matches which have been found (MatchScore is not relevant to the problem here but to demonstrate where the data has come from).
+-------------+-------------+------------+
| CustomerID1 | CustomerID2 | MatchScore |
+-------------+-------------+------------+
| 2021000 | 2707799 | 0.075 |
| 2021000 | 3856308 | 0.082 |
| 774062 | 774063 | 0.041 |
| 998328 | 2278386 | 0.063 |
| 998328 | 998329 | 0.058 |
| 998329 | 2278386 | 0.030 |
+-------------+-------------+------------+
The bottom 3 records are all linked, therefore I want them to have the same ID associated.
visual image of these records all being related
This is what I want the data to look like
+----+-------------+-------------+------------+
| ID | CustomerID1 | CustomerID2 | MatchScore |
+----+-------------+-------------+------------+
| 1 | 998328 | 2278386 | 0.063 |
| 1 | 998328 | 998329 | 0.058 |
| 1 | 998329 | 2278386 | 0.030 |
| 2 | 2021000 | 2707799 | 0.075 |
| 2 | 2021000 | 3856308 | 0.082 |
| 3 | 774062 | 774063 | 0.041 |
+----+-------------+-------------+------------+
or similarly
+----+------------+
| ID | CustomerID |
+----+------------+
| 1 | 2278386 |
| 1 | 998328 |
| 1 | 998329 |
| 2 | 2021000 |
| 2 | 2707799 |
| 2 | 3856308 |
| 3 | 774062 |
| 3 | 774063 |
+----+------------+
code to generate the example table
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
into #tmp
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
select * from #tmp
As I say I cant think how to link the records together, I have tried all sorts of joins but the eureka moment never comes. Please can you help.
Thanks
Upvotes: 5
Views: 875
Reputation: 2774
I am not sure this is the result you are expecting,
with tmp as(
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774062' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore)
select DENSE_RANK() OVER(ORDER BY rank_value) id, t1.CustomerID1, t1.CustomerID2
from(
select
t1.*,
case
when t2.CustomerID1 IS NOT NULL
THEN t2.CustomerID1
ELSE t3.CustomerID1
end rank_value
from tmp t1
left join tmp t2
on (t1.CustomerID1 = t2.CustomerID2
and t1.CustomerID2!=t2.CustomerID1
and (t1.CustomerID1 != t1.CustomerID2 and t2.CustomerID1 != t2.CustomerID2))
or (t1.CustomerID1 = t2.CustomerID1
and t1.CustomerID2 != t2.CustomerID2
and (t1.CustomerID1 != t1.CustomerID2))
left join tmp t3
on t1.CustomerID1 = t3.CustomerID2
and t1.CustomerID2=t3.CustomerID1
)t1
I am getting the below result
Note: DENSE_RANK()
function is available from version 2012
Upvotes: 2