newbie
newbie

Reputation: 14950

Ways to Clean-up messy records in sql

I have the following sql data:

ID              Company Name        Customer            Address 1       City                        State  Zip   Date
0108500         AAA Test            Mish~Sara           Newa Claims     Chtiana                     CO     123   06FE0046        
0108500         AAA.Test            Mish~Sara           Newa Claims     Chtiana                     CO     123   06FE0046        
1802600         AAA Test Company    Ban, Adj.~Gorge     PO Box 83       MouLaurel                      CA      153   09JS0025        
1210600         AAA Test Company    Biwel~Brce          97kehst ve      Jacn                        CA     153   04JS0190

AAA Test, AAA.Test and AAA Test Company are considered as one company.

Since their data is messy I'm thinking either to do this:

  1. Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?

In this case, the AAA Test and AAA.Test will be AAA Test Company.

  1. OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?

If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.

Thank you in advance.

Upvotes: 1

Views: 2311

Answers (5)

Erik Douma
Erik Douma

Reputation: 9

You can use SQL query with SOUDEX of DIFFRENCE

For example: SELECT DIFFERENCE ('AAA Test','AAA Test Company')

DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)

See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

Upvotes: 1

roblogic
roblogic

Reputation: 1326

You can do the clean-up in several stages.

Create new columns

Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows

/** Populate dupstr column for fuzzy match **/
update vendor v           
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;

Identify duplicate records

Add an index on the dupstr column, then do an update like this to identify "good" records:

/** Mark the good duplicates **/
update vendor v           
set v.status = 'keep'  --indicate keeper record
where 
  --dupes to clean up
  exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr 
           and v.id != v1.id )
and 
  ( --keeper has longest name
    length(v.companyname) = 
      ( select max(length(v2.companyname)) from vendor v2
        where v.dupstr = v2.dupstr 
      )
    or
    --keeper has latest record (assuming ID is sequential)
    v.id =
      ( select max(v3.id) from vendor v3
        where v.dupstr = v3.dupstr 
      )
  )
group by v.dupstr
;

The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.

Clean Up Stragglers

Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)

Upvotes: 1

Mohsen
Mohsen

Reputation: 79

I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity. for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx

Upvotes: 1

jarlh
jarlh

Reputation: 44786

Something like a self join? || is ANSI SQL concat, some products have a concat function instead.

select *
from tablename t1
  join tablename t2 on t1.companyname like '%' || t2.companyname || '%'

Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.

And, as Miguel suggests, use REPLACE to remove commas and dots etc.

Use case-insensitive collation. SOUNDEX can be used etc etc.

Upvotes: 2

jradich1234
jradich1234

Reputation: 1425

You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.

Upvotes: 2

Related Questions