Reputation: 14275
There are many questions on how to find duplicates in a database, but not with the specific problem that I have.
I have a table with approx. 120000 entries. I need to find duplicates. To find them, I use a php script that is structured like the following:
//get all entries from database
//loop through them
//get entries with greater id
//compare all of them with the original one
//update database (delete duplicate, update information in linked tables, etc.)
It is not possible to sort out all duplicates already in the initial query, because I have to loop through all entries since my duplicate search is sensitive not only to entries that are 100% alike, but also entries that are 90% alike. I use similar_text() for that.
I think the first loop is okay, but looping through all other entries within the loop is just too much. With 120000 entries this would be close to (120000^2)/2 iterations.
So instead of using a loop within the loop, there must be a better way to do it. Do you have any ideas? I thought about using in_array(), but it is not sensitive to something like 90% string similarity, and also doesn't give me the array's fields it found the duplicates in - I would need those to get the entries' ids to update the database correctly.
Any ideas?
Thank you very much!
Charles
UPDATE 1
The query I am using right now is the following:
SELECT a.host_id
FROM host_webs a
JOIN host_webs b ON a.host_id != b.host_id AND a.web = b.web
GROUP BY a.host_id
It shows originals and duplicates perfectly, but I need to get rid of the originals, i.e. the first ones found with the associated data. How can I accomplish that?
Upvotes: 0
Views: 1420
Reputation: 88647
You can JOIN
the table onto itself and do it all in SQL (I know you say you don't think you can, but I would be surprised if this is the case). All you need to do is put all the columns you use to test for duplicates into the ON
clause of the JOIN
.
SELECT id
FROM tablename a
JOIN tablename b ON a.id != b.id AND a.col1 = b.col1 AND a.col2 = b.col2
GROUP BY id
This will return just the id
s of the rows where col1
and col2
are duplicated. You can incorporate whatever string comparisons you need into this, the ON
clause can be as complicated as you need it to be. For example:
SELECT id
FROM tablename a
JOIN tablename b ON a.id != b.id AND
(a.col1 = b.col1 AND (a.col2 = b.col2 OR a.col3 = b.col3))
OR ((a.col1 = b.col1 OR a.col2 = b.col2) AND a.col3 = b.col3)
OR (SOUNDEX(a.col1) = SOUNDEX(b.col1) AND SOUNDEX(a.col2) = SOUNDEX(b.col2) AND SOUNDEX(a.col3) = SOUNDEX(b.col3))
GROUP BY id
EDIT
Since all you are actually doing with your query is looking for rows where the web
column is identical, this would do the job of finding only the duplicates and not the original "good" records - assuming host_id
is numeric and that the "good" record would be the one with the lowest host_id
:
SELECT b.host_id
FROM host_webs a
INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id
GROUP BY b.host_id
I imagine the end game here would be to remove the duplicates, so if you are feeling brave you could actually delete them in one go:
DELETE b.*
FROM host_webs a
INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id
The GROUP BY
is not necessary in the DELETE
statement because it doesn't matter if you try and delete the same row more than once in a single statement.
Upvotes: 2
Reputation: 17598
If you're doing a 1-time removal of duplicate items, I wouldn't bother writing a php script - it's cleaner to do it in sql.
The general algorithm for removing duplicates that I find works the best is:
1. duplicate the table
2. truncate the original table
3. set a unique index on whichever columns need to be unique
4. reinsert the rows using either INSERT IGNORE INTO original_table SELECT * FROM duplicate_table
OR REPLACE INTO original_table SELECT * FROM duplicate table
5. fixed linked tables - remove orphaned rows (DELETE x FROM x LEFT JOIN original TABLE ON (...) WHERE original_table.id IS NULL
)
Upvotes: 0