How to find duplicates in database?

Question

There are many questions on how to find duplicates in a database, but not with the specific problem that I have.

I have a table with approx. 120000 entries. I need to find duplicates. To find them, I use a php script that is structured like the following:

//get all entries from database
//loop through them
    //get entries with greater id
    //compare all of them with the original one
    //update database (delete duplicate, update information in linked tables, etc.)

It is not possible to sort out all duplicates already in the initial query, because I have to loop through all entries since my duplicate search is sensitive not only to entries that are 100% alike, but also entries that are 90% alike. I use similar_text() for that.

I think the first loop is okay, but looping through all other entries within the loop is just too much. With 120000 entries this would be close to (120000^2)/2 iterations.

So instead of using a loop within the loop, there must be a better way to do it. Do you have any ideas? I thought about using in_array(), but it is not sensitive to something like 90% string similarity, and also doesn't give me the array's fields it found the duplicates in - I would need those to get the entries' ids to update the database correctly.

Any ideas?

Thank you very much!

Charles

UPDATE 1

The query I am using right now is the following:

SELECT a.host_id
FROM host_webs a
JOIN host_webs b ON a.host_id != b.host_id AND a.web = b.web
GROUP BY a.host_id

It shows originals and duplicates perfectly, but I need to get rid of the originals, i.e. the first ones found with the associated data. How can I accomplish that?

DaveRandom · Accepted Answer

You can JOIN the table onto itself and do it all in SQL (I know you say you don't think you can, but I would be surprised if this is the case). All you need to do is put all the columns you use to test for duplicates into the ON clause of the JOIN.

SELECT id
FROM tablename a
JOIN tablename b ON a.id != b.id AND a.col1 = b.col1 AND a.col2 = b.col2
GROUP BY id

This will return just the ids of the rows where col1 and col2 are duplicated. You can incorporate whatever string comparisons you need into this, the ON clause can be as complicated as you need it to be. For example:

SELECT id
FROM tablename a
JOIN tablename b ON a.id != b.id AND
  (a.col1 = b.col1 AND (a.col2 = b.col2 OR a.col3 = b.col3))
  OR ((a.col1 = b.col1 OR a.col2 = b.col2) AND a.col3 = b.col3)
  OR (SOUNDEX(a.col1) = SOUNDEX(b.col1) AND SOUNDEX(a.col2) = SOUNDEX(b.col2) AND SOUNDEX(a.col3) = SOUNDEX(b.col3))
GROUP BY id

EDIT

Since all you are actually doing with your query is looking for rows where the web column is identical, this would do the job of finding only the duplicates and not the original "good" records - assuming host_id is numeric and that the "good" record would be the one with the lowest host_id:

SELECT b.host_id
FROM host_webs a
INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id
GROUP BY b.host_id

I imagine the end game here would be to remove the duplicates, so if you are feeling brave you could actually delete them in one go:

DELETE b.*
FROM host_webs a
INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id

The GROUP BY is not necessary in the DELETE statement because it doesn't matter if you try and delete the same row more than once in a single statement.

How to find duplicates in database?

Answers (2)

Related Questions