Reputation: 588
Over the years a few duplicates were added to our data tables as a result of spelling mistakes. So for example someone misspelled the name and instead of O'leary wrote Oleary. Our system thinks its a totally different name and does not complain about it, however in most cases its the same contact entered twice(I didn't build this system).
Now what I want to do is remove all of these duplicates, but I am having hard time building a query to display them all. I did experiment with UTL_MATCH and wrote a query that would return all similar names if I supply the name.
select first_name from customers
where UTL_MATCH.edit_distance_similarity(first_name,'Oleary') > 60
order by first_name;
However I would like to build a query that would automatically return all possible duplicates without having to supply the name. Could anyone point me in the correct direction please?
Upvotes: 3
Views: 139
Reputation: 231661
Something like this would technically work.
select c1.first_name, c2.first_name
from customers c1
cross join customers c2
where utl_match.edit_distance_similarity( c1.first_name, c2.first_name ) > 60
order by c1.first_name
However, it would be extremely slow unless your customers
table happens to be very (very) small since you're comparing every row in the customers
table against every other row in the table (and your edit distance similarity cut-off is pretty low). In order to make that faster, you're probably going to have to make assumptions about your data or do something else that can be a preliminary filter. For example, if you assume that any duplicates start with the same first character or the same first few characters ignoring punctuation, then you can dramatically decrease the number of pairs that need to be matched at the risk of missing the fact that "Kustin" may be a typo duplicate of "Justin" where the first character is different. Requiring that c2.customer_id > c1.customer_id
would be another reasonable filter to consider assuming that you don't need every pair duplicated (i.e. a "Kustin/ Justin" row can exist without the equivalent "Justin/ Kustin" row).
Upvotes: 3
Reputation: 1269753
You can use this for a join
:
select c1.first_name, c2.first_name
from customers c1 join
customers c2
on UTL_MATCH.edit_distance(c1.first_name, c2.first_name) <= 3
order by c1.first_name;
Notes:
edit_distance()
to edit_distance_similarity()
, because I understand the units.Upvotes: 3