Steven
Steven

Reputation: 3294

array_agg values with jarow-winkler distance >= N

I have a table that contains many (14000+) different customer names, many of which could have multiple spellings for the same customer, e.g.,

with data(names) as (
    select * from values
        ('Another Compnay Name'), 
        ('another comp name'), 
        ('ANOTHER COMPANY name, LLC')
)

I'd like to combine similar names based on Jaro-Winkler distance. I can do this by handling each customer name and matching based on J-W distance, then collapsing into some arrays:

similars as (
    select 
    distinct a.names as custname1, b.names custname2
    from data a, data b
    where jarowinkler_similarity(regexp_replace(lower(trim(a.names)), '[^a-z0-9]'), regexp_replace(lower(trim(b.names)),'[^a-z0-9]')) >= 95
    order by 1
), 

tmp as (
    select 
    distinct array_sort(similar_cust_names) names
    from (
        select
        distinct custname1, array_distinct(array_append(array_agg(custname2) within group(order by custname2), custname1)) similar_cust_names
        from similars
        group by 1
    )
)

select
*
from tmp

This gets me three distinct arrays, one of which has all of the values I need in a single array. However, I still have two additional arrays that I no longer need, given all the relevant information is contained in a single array.

enter image description here

How can I programmatically combine all values into one array while dropping any duplicate or near-duplicate entries? Note that the JW-distance between ANOTHER COMPANY name, LLC and another comp name is < 95 but because each has a JW-distance of >=95 to Another Compnay Name, they should appear in the same array.

Upvotes: 0

Views: 25

Answers (0)

Related Questions