Reputation: 1567
Using: MySQL 5.6 on Windows, with default configuration file settings in my.ini
Table: datatbl1
row_id | emailaddr | valid
--------------------------------
INT, PK | VARCHAR(255) | BIT
Both emailaddr and row_id columns each have an index defined.
There are 600,000 rows in the table, and the objective is to remove duplicates. The query is:
delete dt2 from datatbl1 dt1 JOIN datatbl1 dt2 on (dt1.emailaddr = dt2.emailaddr) and (dt1.row_id < dt2.row_id);
On my system, it takes about 15 minutes to complete this query, I watch the mysqld process in the Task Manager, processor usage is 100% the whole time, but memory usage never crosses about 140MB, even though there is about 3GB of memory (RAM) installed and plenty memory available.
Questions:
Keep in mind that this query later needs to be applied to other tables, ie delete matches in datatbl1 for records that match other tables (datatbl2, datatbl3, datatbl4 etc) which have the same table structure.
On my client's system, the same query is taking 2 hours. Difference is that he has a normal hard drive, whereas I have a SSD.
The application is a client server application with a Delphi front-end, and is meant to be used by ordinary users on Windows PCs, thus MySQL will be almost always be running on a end-user Windows PC.
Thanks in advance.
EDIT: The Explain output as requested is:
mysql> explain delete dt2 from datatbl1 dt1 JOIN datatbl1 dt2 on (dt1.emailaddr
= dt2.emailaddr) and (dt1.row_id < dt2.row_id);
+----+-------------+-------+-------+------------------------------+-------------
+---------+--------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------------+-------------
+---------+--------------------------+------+-------------+
| 1 | SIMPLE | dt1 | index | PRIMARY,ixemailaddr,ixrow_id | ixemailaddr
| 257 | NULL | 1 | Using index |
| 1 | SIMPLE | dt2 | ref | PRIMARY,ixemailaddr,ixrow_id | ixemailaddr
| 257 | emailmgrdb.dt1.emailaddr | 1 | Using where |
+----+-------------+-------+-------+------------------------------+-------------
+---------+--------------------------+------+-------------+
2 rows in set (0.01 sec)
Upvotes: 0
Views: 682
Reputation: 780843
Maybe this query will be faster:
DELETE dt1.*
FROM datatbl1 dt1
JOIN (SELECT emailaddr, MIN(row_id) minrow
FROM datatbl1
GROUP BY emailaddr) dt2
USING (emailaddr)
WHERE dt1.row_id > dt2.minrow
The size of the intermediate table in your original query is O(n^2) (because it joins each row with all the duplicates that follow it), but this one is O(n) (because it only joins the first row of each set of duplicates with the ones that follow it).
It depends on whether the slowness is in finding the rows or performing all the deletions. You can find this out by doing SELECT instead of DELETE and noting the performance difference.
Upvotes: 1
Reputation: 4748
Have you tried moving the row_id
comparison to the WHERE
clause?
DELETE dt1
FROM datatbl1 dt1
INNER JOIN datatbl1 dt2 ON dt1.emailaddr = dt2.emailaddr
WHERE dt1.row_id > dt2.row_id
Upvotes: 0