DataNovice
DataNovice

Reputation: 149

SQL Server Record Linkage After String Matching

Background - I have a set of customer data and used a string matching algorithm to compare all records for similarity. I then need to group the results which relate to each other either directly or by association, and apply a unique ID for each group.

Problem - I can't think of a way to link records together and apply a unique ID for each group

Example

Data currently looks like this for matches which have been found (MatchScore is not relevant to the problem here but to demonstrate where the data has come from).

+-------------+-------------+------------+
| CustomerID1 | CustomerID2 | MatchScore |
+-------------+-------------+------------+
|     2021000 |     2707799 | 0.075      |
|     2021000 |     3856308 | 0.082      |
|      774062 |      774063 | 0.041      |
|      998328 |     2278386 | 0.063      |
|      998328 |      998329 | 0.058      |
|      998329 |     2278386 | 0.030      |
+-------------+-------------+------------+

The bottom 3 records are all linked, therefore I want them to have the same ID associated.

visual image of these records all being related

This is what I want the data to look like

+----+-------------+-------------+------------+
| ID | CustomerID1 | CustomerID2 | MatchScore |
+----+-------------+-------------+------------+
|  1 |      998328 |     2278386 | 0.063      |
|  1 |      998328 |      998329 | 0.058      |
|  1 |      998329 |     2278386 | 0.030      |
|  2 |     2021000 |     2707799 | 0.075      |
|  2 |     2021000 |     3856308 | 0.082      |
|  3 |      774062 |      774063 | 0.041      |
+----+-------------+-------------+------------+

or similarly

+----+------------+
| ID | CustomerID |
+----+------------+
|  1 |    2278386 |
|  1 |     998328 |
|  1 |     998329 |
|  2 |    2021000 |
|  2 |    2707799 |
|  2 |    3856308 |
|  3 |     774062 |
|  3 |     774063 |
+----+------------+

code to generate the example table

select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
into #tmp
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore

select * from #tmp

As I say I cant think how to link the records together, I have tried all sorts of joins but the eureka moment never comes. Please can you help.

Thanks

Upvotes: 5

Views: 875

Answers (1)

Viki888
Viki888

Reputation: 2774

I am not sure this is the result you are expecting,

with tmp as(
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774062' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore)


select DENSE_RANK() OVER(ORDER BY rank_value) id, t1.CustomerID1, t1.CustomerID2
from(
    select 
        t1.*, 
        case 
            when t2.CustomerID1 IS NOT NULL 
                THEN t2.CustomerID1 
            ELSE t3.CustomerID1 
        end rank_value

    from tmp t1
    left join tmp t2 
    on (t1.CustomerID1 = t2.CustomerID2 
            and t1.CustomerID2!=t2.CustomerID1 
            and (t1.CustomerID1 != t1.CustomerID2 and t2.CustomerID1 != t2.CustomerID2))
       or (t1.CustomerID1 = t2.CustomerID1 
             and t1.CustomerID2 != t2.CustomerID2 
             and (t1.CustomerID1 != t1.CustomerID2)) 
    left join tmp t3 
        on t1.CustomerID1 = t3.CustomerID2 
            and t1.CustomerID2=t3.CustomerID1
)t1

I am getting the below result

enter image description here

Note: DENSE_RANK() function is available from version 2012

Upvotes: 2

Related Questions