I have the following sql data: ID Company Name Customer Address 1 City State Zip Date 0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046 0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046 1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025 1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190 AAA Test, AAA.Test and AAA Test Company are considered as one company. Since their data is messy I'm thinking either to do this: Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name? In this case, the AAA Test and AAA.Test will be AAA Test Company. OR Is there a way to filter only record with company name that are almost the same then they can have option to change it? If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually. Thank you in advance.

Reputation: 14950

Ways to Clean-up messy records in sql

I have the following sql data:

ID              Company Name        Customer            Address 1       City                        State  Zip   Date
0108500         AAA Test            Mish~Sara           Newa Claims     Chtiana                     CO     123   06FE0046        
0108500         AAA.Test            Mish~Sara           Newa Claims     Chtiana                     CO     123   06FE0046        
1802600         AAA Test Company    Ban, Adj.~Gorge     PO Box 83       MouLaurel                      CA      153   09JS0025        
1210600         AAA Test Company    Biwel~Brce          97kehst ve      Jacn                        CA     153   04JS0190

AAA Test, AAA.Test and AAA Test Company are considered as one company.

Since their data is messy I'm thinking either to do this:

Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?

In this case, the AAA Test and AAA.Test will be AAA Test Company.

OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?

If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.

Thank you in advance.

Upvotes: 1

Answers (5)

Erik Douma

Reputation: 9

You can use SQL query with SOUDEX of DIFFRENCE

For example: SELECT DIFFERENCE ('AAA Test','AAA Test Company')

DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)

Upvotes: 1

roblogic

Reputation: 1326

You can do the clean-up in several stages.

Create new columns

Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows

/** Populate dupstr column for fuzzy match **/
update vendor v           
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;

Identify duplicate records

Add an index on the dupstr column, then do an update like this to identify "good" records:

/** Mark the good duplicates **/
update vendor v           
set v.status = 'keep'  --indicate keeper record
where 
  --dupes to clean up
  exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr 
           and v.id != v1.id )
and 
  ( --keeper has longest name
    length(v.companyname) = 
      ( select max(length(v2.companyname)) from vendor v2
        where v.dupstr = v2.dupstr 
      )
    or
    --keeper has latest record (assuming ID is sequential)
    v.id =
      ( select max(v3.id) from vendor v3
        where v.dupstr = v3.dupstr 
      )
  )
group by v.dupstr
;

The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.

Clean Up Stragglers

Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)

Upvotes: 1

Mohsen

Reputation: 79

I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity. for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx

Upvotes: 1

jarlh

Reputation: 44786

Something like a self join? || is ANSI SQL concat, some products have a concat function instead.

select *
from tablename t1
  join tablename t2 on t1.companyname like '%' || t2.companyname || '%'

Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.

And, as Miguel suggests, use REPLACE to remove commas and dots etc.

Use case-insensitive collation. SOUNDEX can be used etc etc.

Upvotes: 2

jradich1234

Reputation: 1425

You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.

Upvotes: 2

Ways to Clean-up messy records in sql

Answers (5)

Create new columns

Identify duplicate records

Clean Up Stragglers

Related Questions