Reputation: 37
I am trying to delete records from table with duplicate column values but it's taking forever. Basically it gets stuck and no response for hours. I have a significantly large table with over 1.3M records. Is the query inefficient? any wat to optimize it?
delete n1 from ids n1, ids n2 where n1.id > n2.id and n1.user_id = n2.user_id
Database is remote, and am using putty to run queries.
Upvotes: 0
Views: 6007
Reputation: 15941
yes, that query is very inefficient. Even if you used explicit joins you need to keep in mind that basically every row "N" is being matched up with every row before "N", and every row "N-1" is being matched up with the rows before it.
Try something like this:
DROP TEMPORARY TABLE IF EXISTS keeps;
CREATE TEMPORARY TABLE keeps (
user_id INT,
keepID INT,
INDEX (user_id, keepID)
)
INSERT INTO keeps (user_id, keepID)
SELECT user_id, MIN(id) As keepID
FROM ids
GROUP BY user_id;
DELETE FROM ids WHERE (user_id, id) NOT IN (SELECT user_id, keepID FROM keeps);
DROP TEMPORARY TABLE IF EXISTS keeps;
I'm also tempted to suggest trying something like the below, but I can't remember if MySQL allows subquerying the delete table in the delete query ... which is why I suggested the temp table in the first one.
DELETE a
FROM ids AS a
WHERE EXISTS (
SELECT *
FROM ids AS b
WHERE b.id < a.id
AND b.user_id = a.user_id
)
Upvotes: 0
Reputation: 780818
Add an index:
ALTER TABLE ids ADD INDEX (user_id, id);
This makes it efficient to find all the rows with the same user ID and higher IDs.
It will also help to join with a subquery.
DELETE n1
FROM ids AS n1
JOIN (SELECT user_id, MIN(id) AS minid
FROM ids
GROUP BY user_id) AS n2
ON n1.user_id = n2.user_id AND n1.id > n2.minid
This will still be faster with the above index.
Upvotes: 2