Reputation: 3294
I have a table that contains many (14000+) different customer names, many of which could have multiple spellings for the same customer, e.g.,
with data(names) as (
select * from values
('Another Compnay Name'),
('another comp name'),
('ANOTHER COMPANY name, LLC')
)
I'd like to combine similar names based on Jaro-Winkler distance. I can do this by handling each customer name and matching based on J-W distance, then collapsing into some arrays:
similars as (
select
distinct a.names as custname1, b.names custname2
from data a, data b
where jarowinkler_similarity(regexp_replace(lower(trim(a.names)), '[^a-z0-9]'), regexp_replace(lower(trim(b.names)),'[^a-z0-9]')) >= 95
order by 1
),
tmp as (
select
distinct array_sort(similar_cust_names) names
from (
select
distinct custname1, array_distinct(array_append(array_agg(custname2) within group(order by custname2), custname1)) similar_cust_names
from similars
group by 1
)
)
select
*
from tmp
This gets me three distinct arrays, one of which has all of the values I need in a single array. However, I still have two additional arrays that I no longer need, given all the relevant information is contained in a single array.
How can I programmatically combine all values into one array while dropping any duplicate or near-duplicate entries? Note that the JW-distance between ANOTHER COMPANY name, LLC
and another comp name
is < 95 but because each has a JW-distance of >=95 to Another Compnay Name
, they should appear in the same array.
Upvotes: 0
Views: 25