Chris
Chris

Reputation: 53

MySQL most efficient method to find missing values from multiple tables

I have three tables, the first is a list of email addresses:

addresses:
id - integer, this is the primary key<br>
email - varchar(255) field holding the address

sent:
sid - integer, foreign key references id in addresses table

received:
rid - integer, foreign key references id in addresses table

Obviously the "sent" and "received" tables have other columns, but they are not important for this question. the sent and received tables are populated every time an email is sent or received and if the address is not already in the "addresses" table, it gets added. The tables can get quite large (100,000+).

Entries for the "sent" and "received" tables are purged on a regular basis and entries removed for various reasons, leaving orphaned entries in the "addresses" table.

I am looking for the most efficient method in MySQL to purge orphaned entries in the "addresses" table. The query I have so far is:

delete 
from addresses 
where id not in 
         (select rid from received) 
  and id not in 
         (select sid from sent);

This works, but it can take a looong time to run and is definitely not the most efficient way of doing this! I also tried this:

delete 
from addresses 
where not exists 
      (select 'x' from sent where sent.sid=addresses.id) 
  and not exists 
      (select 'x' from rceieved where recieved.rid=addresses.id);

This was a bit quicker, but still takes a long time, I suspect I need to use the JOIN syntax but my sql knowledge has run out on me at this point !

Upvotes: 0

Views: 858

Answers (4)

Andrew
Andrew

Reputation: 1178

Did some testing using 2 300k myisam tables which contained 2 id columns (and several other non-identical columns). The ids were identical except for 2 records in one table. Tried the 3 methods mentioned to find these ids:

WHERE NOT EXISTS

LEFT JOIN

IN ()

Making sure to use SQL_NO_CACHE and all queries performed identically, the server returned the two results in ~14.6 seconds.

The differences mentioned above must either be caching, differing versions of mysql and/or general server configuration.

Upvotes: 0

Nicolas Kramer
Nicolas Kramer

Reputation: 1

I'm sorry I can't really give a definite answer. But I had a similar problem, and after looking around it seems there are only two main choices:

  1. using WHERE x NOT IN y
  2. using LEFT JOIN x ON y WHERE z IS NULL

I tried both methods by comparing two tables, of 2822291 and 916626 records respectively.

The performance conclusions are as follows:

  • Type 1 is significantly faster than Type 2. (600 sec vs 6000 sec)
  • Indexes or keys have a reasonable impact on performance for this operation on both types.
  • Performance is almost independent of the actual DISTINCT number of values. Thus comparing 2000 distinct values or just 15 for both tables takes about the same time.

Thus, concluding, as of now (08-2013) is seems that option 1 is still the faster way to go. Using NOT EXISTS might be even faster, but performance changes there aren't dramatic compared to type 1.

I hope this helps anyone out eventually.

Upvotes: 0

dweeves
dweeves

Reputation: 5605

This should do the trick

DELETE adresses.* FROM adresses 
LEFT JOIN sent ON sent.sid=adresses.id
LEFT JOIN received ON received.rid=adresses.id
WHERE sent.sid IS NULL AND received.rid IS NULL

Upvotes: 1

Sashi Kant
Sashi Kant

Reputation: 13465

Try this: delete from adresses a left join sent s on (a.sentid=s.id) where s.id is null

Upvotes: 0

Related Questions