Novocaine
Novocaine

Reputation: 4786

Optimising "NOT IN(...)" query for millions of rows

Note: I do not have access to the source code/database to which this question pertains. The two tables in question are located on different servers.

I'm working with a 3rd party company that have systems integrated with our own. They have a query that runs something like this;

DELETE FROM table WHERE column NOT IN(1,2,3,4,5,.....3 000 000)

It's pretty much referencing around 3 million values in the NOT IN.

I'm trying to point out that this seems like an inefficient method for deleting multiple rows and keeping all the ones noted in the query. The problem is, as I don't have the access myself to the source code/database I'm not totally sure what to suggest as a solution.

I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.

With this limited knowledge, what possible suggestions could I present to them?

The first thing that comes to mind is having some kind of flag column that indicates whether it's been deleted or not. When the sync script runs it would first perform an update on the target server for all rows marked as deleted (or insert for new rows), then a second query to delete all rows marked for deletion.

Is there more logical way to do something like this, bearing in mind complete overhauls in functionality are out of the question. Only small tweaks to the current process will be possible for a number of reasons.

Upvotes: 0

Views: 228

Answers (3)

Rick James
Rick James

Reputation: 142298

Deleting a large number of rows will take a huge amount of time. This is likely to require a full table scan. As it finds rows to delete, it will stress the undo/redo log. It will clog replication (if using such). Etc.

How many rows do you expect to delete?

Better would be to break the list up into chunks of 1000. (This applies whether using IN(list of constants) or JOIN.) But, since you are doing NOT, it gets stickier. Possibly the best way is to copy over what you want:

CREATE TABLE new LIKE real;
INSERT INTO new
    SELECT * FROM real WHERE id IN (...);  -- without NOT
RENAME TABLE real TO old,
             new TO real;
DROP TABLE old;

I go into details of chunking, partitioning, and other techniques in Big Deletes .

Upvotes: 0

Bill Karwin
Bill Karwin

Reputation: 562330

I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.

I know this is obvious, but why don't these two servers stay in sync using replication? I'm guessing it's because aside from this one table, they don't have identical data.

If out-of-the-box replication isn't flexible enough, you could use a change-data capture tool.

The idea is that the tool monitors changes in a MySQL binary log stream, and reacts to them. The reaction is user-defined, and it can include applying the same change to another MySQL instance, which would keep them in sync.

Here's a blog that shows how to use Maxwell, which is one of the open-source CDC tools, this one released from Zendesk: https://www.percona.com/blog/2016/09/13/mysql-cdc-streaming-binary-logs-and-asynchronous-triggers/

A couple of advantages of this approach:

  • No need to re-sync the whole table. You'd only apply incremental changes as they occur.
  • No need to schedule re-syncs daily or whatever. Since incremental changes are likely to be small, you could apply the changes nearly immediately.

Upvotes: 1

juergen d
juergen d

Reputation: 204766

Instead of

DELETE FROM your_table 
WHERE column NOT IN(1,2,3,4,5,.....3 000 000)

you could do

delete t1
from your_table t1
left join table_where_the_ids_come_from t2 on t1.column = t2.id
where t2.id is null

Upvotes: 2

Related Questions