Stephane Gosselin
Stephane Gosselin

Reputation: 9148

How to optimize a delete query with a subselect?

This query needs to delete over 17 million rows, from a table containing 20 million.

DELETE
FROM statements
WHERE agreement_id IN
    (SELECT id
     FROM agreements
     WHERE created < DATE_SUB(CURDATE(), INTERVAL 6 MONTH));


DELETE
FROM agreements
WHERE created < DATE_SUB(CURDATE(), INTERVAL 6 MONTH)

It takes hours to run, am I missing something that could speed things up a bit?

The subselect by itself takes a few seconds, I don't understand why the delete takes so long.

Upvotes: 1

Views: 164

Answers (3)

Rick James
Rick James

Reputation: 142540

Use a "multi-table delete" instead of the usually inefficient IN ( SELECT ... ).

Several techniques for large deletes are discussed here.

To delete 85% of the table, it is really best to build a new table with the 15% you are keeping, then swap the table into place. (More on that in the link above.)

Upvotes: 1

sticky bit
sticky bit

Reputation: 37507

Try to rewrite the first statement to use EXISTS.

DELETE FROM statements
            WHERE EXISTS (SELECT *
                                 FROM agreements
                                 WHERE agreements.id = statements.aggreement_id
                                       AND agreements.created < date_sub(curdate(), interval 6 month));

And put an index on agreements (id, created) (if not already there).

CREATE INDEX agreements_id_created
             ON agreements
                (id,
                 created);

For the second one create an index on agreements (created) (if not already there).

CREATE INDEX agreements_created
             ON agreements
                (created);

Upvotes: 1

Derviş Kayımbaşıoğlu
Derviş Kayımbaşıoğlu

Reputation: 30663

If you have this much delete to be undertaken. I suggest you to:

  1. create new temporary table with the data which will stay.
  2. Truncate your main table
  3. Move data from temporary table to your main table

or

  1. create new temporary table with the data which will stay.
  2. Drop your main table
  3. Rename your Temp table as main table (dont forget to create constraints)

Also for your query,

never use IN clause for BIG data. Instead use exists which is more performant.

Basic script:

CREATE TABLE tmp_statements as
  SELECT * FROM statements s where exists 
  (
     select 1 FROM agreements a 
     WHERE 
       created < DATE_SUB(CURDATE(), INTERVAL 6  MONTH AND
       s.agreement_id = a.agreement_id
  ));

 DROP TABLE statements;

 RENAME TABLE tmp_statements TO statements ;

 --DONT FORGET TO RECREATE  CREATE YOUR INDEXES, CONSTRAINTS;

Upvotes: 1

Related Questions