moustachio
moustachio

Reputation: 3004

Is it quicker to join with a where clause, or delete rows afterward?

Ok, so I'm loading some data into table from another a join like so (where ts is a timestamp column):

insert ignore into target_table select s.user_id,i.item_id,i.artist_id,s.ts 
from source_table s join itemlist i on s.item_url = i.item_url;

The thing is, I only want data from within in a certain time range, so my question is: Will it be faster to do this with a where clause on the join, i.e.:

insert ignore into target_table select s.user_id,i.item_id,i.artist_id,s.ts 
from source_table s join itemlist i on s.item_url = i.item_url where 
ts< '2013-01-01' and ts >= '2005-07-01';

Or, to do the first query, then afterwards do a delete, i.e.:

delete from target_table where ts>= '2013-01-01' and ts < '2005-07-01';

I know I could just test this myself, but I'd also like to understand why one method is preferred

Upvotes: 0

Views: 32

Answers (2)

Matt Gibson
Matt Gibson

Reputation: 38238

(Assuming that you meant the second method's WHERE clause to actually be the opposite of the first's—otherwise you'll end up with a different result...)

It will depend on the indexes on the tables, and the amount of the data in the tables.

If both tables have an index on ts, then I can't see it'll make a lot of difference, performance-wise, assuming not many rows, but it would be more usual to limit the initial SELECT, so you only perform one operation.

The amount of data in the tables will be relevant. Think of this example: there's a billion rows in source_table, but only one row that matches your criteria. In that case, an indexed SELECT with a WHERE clause will find the one row very quickly and copy it into the target table. However, the second method will copy a billion rows to the target table, which will take a long time (and a lot of disk space), and then you'll just delete all but one of them, which may take quite some time even if finding the row you want to keep is quite fast because of the index.

Inserting and then deleting may also lead to more "fragmented" data in the destination table, but that'll probably depend on the storage engine and I don't know enough about how MySQL handles things like that to give an opinion.

Other things to bear in mind: if you INSERT and then DELETE, there will be a period of time where the data in the table is "wrong". And if you're not wrapped in a transaction and something goes wrong in between the INSERT and the DELETE, it might stay wrong. If you just do the INSERT with a WHERE, it's a single atomic operation that will either succeed entirely or roll back.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270493

The first version should be faster:

insert ignore into target_table(user_id, item_id, artis_it, ts)
    select s.user_id,i.item_id,i.artist_id,s.ts 
    from source_table s join
         itemlist i
         on s.item_url = i.item_url
    where ts < '2013-01-01' and ts >= '2005-07-01';

Here are some reasons why:

  • Doing one command instead of multiple commands is faster, because of the overhead of running the command.
  • Each insert has overhead (logging, copying records onto data pages, allocating data pages). Doing unnecessary inserts takes longer than necessary.
  • Each delete has overhead (logging, deallocating pages).
  • Inserts and deletes might have triggers that would cause more overhead.
  • Between the insert and the delete, the data in the table is not accurate.

Upvotes: 1

Related Questions