jalexsmith
jalexsmith

Reputation: 207

MySQL Remove/Combine Similar Rows

I've got a problem that I just can't seem to find the answer to. I've developed a very small CRM-like application in PHP that's driven by MySQL. Users of this application can import new data to the database via an uploaded CSV file. One of the issues we're working to solve right now is duplicate, or more importantly, near duplicate records. For example, if I have the following:

Record A: [1, Bob, Jones, Atlanta, GA, 30327, (404) 555-1234]

and

Record B: [2, Bobby, Jones, Atlanta, GA, 30327, Bob's Shoe Store, (404) 555-1234]

I need a way to see that these are both similar, take the record with more information (in this case record B) and remove record A.

But here's where it gets even more complicated. This must be done upon importing new data, and a function I can execute to remove duplicates from the database at any time. I have been able to put something together in PHP that gets all duplicate rows from the MySQL table and matches them up by phone number, or by using implode() on all columns in the row and then using strlen() to decide the longest record. There has got to be a better way of doing this, and one that is more accurate.

Do any of you have any brilliant suggestions that I may be able to implement or build on? It's obvious that when importing new data I'll need to open their CSV file into an array or temporary MySQL table, do the duplicate/similar search, then recompile the CSV file or add everything from the temporary table to the main table. I think. :)

I'm hoping that some of you can point out something that I may be missing that can scale somewhat decently and that's somewhat accurate. I'd rather present a list of duplicates we're 'unsure' about to a user that's 5 records long, not 5,000.

Thanks in advance! Alex

Upvotes: 3

Views: 420

Answers (2)

Savas Vedova
Savas Vedova

Reputation: 5692

If I were you I'd give a UNIQUE key to name, surname and phone number since in theory if all these three are equal then it means that it is a duplicate. I am thinking so because a phone number can have only one owner. Anyways, you should find a combination of 2-3 or maybe 4 columns and assign them a unique key. Once you have such a structure, run something like this:

// assuming that you have defined something like the following in your CREATE TABLE: 
UNIQUE(phone, name, surname)
// then you should perform something like:
INSERT INTO your_table (phone, name, surname) VALUES ($val1, $val2, $val3) 
ON DUPLICATE KEY UPDATE phone = IFNULL($val1, phone), 
                        name = IFNULL($val2, name),
                        surname = IFNULL($val3, surname);

So basically, if the inserted value is a duplicate, this code will update the row, rather than inserting a new one. The IFNULL function performs a check to see whether the first expression is null or not. If it is null, then it picks the second expression, which in this case is the column value that already exists in your table. Hence, it will update your row with as much as information possible.

Upvotes: 2

dmitry
dmitry

Reputation: 5049

I don't think there're brilliant solutions. You need to determine priority of your data fields you can rely on for detecting similarity, for example phone, some kind of IDs, of some uniform address or official name. You can save some cleaned up values (reduced to the same format like only digits in phones, concatenated full address) along with row which you would be able to use for similarity search when adding records. Then you need to decide on data completeness in any case to update existing rows with more complete fields, or delete old and add new row.

Don't know any ready solutions for such a variable task and doubt they exist.

Upvotes: 0

Related Questions