Reputation: 3568
I am running a delete which removes all of the duplicates within a table. A duplicate is defined as a row where the tag_id, user_id, and is_self are all the same. My technique here is pretty standard, to preform this delete, since the tags_users table itself needs to be referenced to know if a duplicate exists a temp table is created so that a delete can be preformed from the same table that is being referenced. The problem is that this table is about a million rows so this query takes about an hour to run. I know this is related to the slow speed of defining this temp table and then referencing it as it is un-indexed.
DELETE FROM tags_users WHERE id IN (
SELECT id FROM (
SELECT A.id FROM tags_users as A, tags_users as B WHERE A.id > B.id AND A.user_id = B.user_id AND A.tag_id = B.tag_id AND A.is_self = B.is_self GROUP BY A.id
) temp_dup_delete
);
I have reviewed the explain from this query listed here (Please note I'm on mysql 5.5 so I'm using EXPLAIN SELECT 1 to simulate EXPLAIN DELETE). I think the best possible solution to this is to define an index on the temp table, but I cannot figure out how to do this yet. The crux of my question here is: is there a way to improve the speed of this query considering the way it defines a temp table. Thank you to anyone that can help.
Upvotes: 0
Views: 684
Reputation: 1270613
Here is an alternative approach. Use an aggregation query to find the minimum id for each set of key values -- this seems to be the row you want to keep.
Then, use left outer join
to match to this table and delete all the rows in the original data that do not match.
delete tu
from tags_users tu left outer join
(select tag_id, user_id, is_self, min(id) as minid
from tags_users
group by tag_id, user_id, is_self
) tui
on tui.id = tu.id
where tui.id is null;
Upvotes: 1