Lukasz
Lukasz

Reputation: 2616

Joining to perform a comparison match with possible swapped data

I'm trying to do a comparison match in snowflake between the first_name, middle_name, and last_name from a target table (700K rows) and from source table (1.5M rows). The values in the first_name, middle_name, and last_name can be partially, or fully swapped in the source table, that is the value in the first_name can be swapped with the value in the last_name or any other way. For a positive probable identification at least two matches need to be correct. I then do further processing to guarantee the accuracy of the match.

Example:

case 1: first name (target) = first name (source) | middle name (target) = middle name. (MATCH)

case 2: first name (target) = last name (source) | middle name (target) = middle name (source) | last name (target) = first name (source) -> first name swapped with last lame (MATCH)

Example (with table):

Target Table
id   | first_na_t    |   middle_name_t   |     last_name_t
1    |   ffa         |       mma         |         lla    
2    |   ffa         |       mmb         |         lla
3    |   ffb         |       ffa         |         llb
4    |   ffc         |       mmd         |         lla
...

Source Table
id      | first_name_s  |   middle_name_s |     last_name_s
1       |   lla         |       mmb         |         ffa    
5       |   ffa         |       mmb         |         lla
3       |   ffb         |       ffa         |         llb
4       |   mmd         |       ffc         |         lla
...

From the example tables the attempt that I used was to compare the first row in the target table with all the other rows in the source table using a CROSS JOIN with a CASE afterwards but this is proving to be inefficient and slow.

Is there any way to JOIN this cleverly where you would compare the rows in the target table against the rows in the source table and come out with at least 2 match even though any of the name columns could be swapped. Also I can not use the id column.

Upvotes: 1

Views: 61

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270873

You can express the logic like this:

from TARGET_TABLE T join
     SOURCE_TABLE S
     on ((case when s.FIRST_NAME_s in (t.FIRST_NAME_T, t.MIDDLE_NAME_T, t.LAST_NAME_T then 1 else 0 end) +
         (case when s.MIDDLE_NAME_s in (t.FIRST_NAME_T, t.MIDDLE_NAME_T, t.LAST_NAME_T then 1 else 0 end) +
         (case when s.LAST_NAME_s in (t.FIRST_NAME_T, t.MIDDLE_NAME_T, t.LAST_NAME_T then 1 else 0 end)
        ) >= 2

However, that might take forever to run.

Instead, let's break the names into separate rows and aggregate:

select t.id, s.id
from (target_table t cross join lateral
      (values (t.first_name_t), (t.middle_name_t), (t.last_name_t)
      ) vt(name)
     ) join
     (source_table t cross join lateral
      (values (s.first_name_s), (s.middle_name_s), (s.last_name_s)
      ) vs(name)
     )
     on vt.name = vs.name
group by t.id, s.id
having count(*) >= 2;

Note: This might get into a bit of confusion if different name components are the same within a name, but you can tweak the logic to handle that if necessary.

Upvotes: 1

Related Questions