JOIN vs. WHERE: Why do two queries that obtain identical results exhibit 3-4 orders of magnitude performance difference?

Question

Earlier tonight, I asked this question on StackOverflow regarding how to write a SQL query to filter rows from a table by returning only rows with duplicates in one field.

Here is the question, repeated for convenience:

If I have this data:

code1 code2
  1    10       <-- Desired (1 appears more than once)
  1    11       <-- Desired (1 appears more than once)
  2    20
  3    30       <-- Desired (3 appears more than once)
  3    31       <-- Desired (3 appears more than once)
  4    40
  5    50

... And I want to write a single SQL query whose results are this:

code1 code2
  1    10       <-- This result appears because 1 appears more than once above
  1    11       <-- This result appears because 1 appears more than once above
  3    30       <-- This result appears because 3 appears more than once above
  3    31       <-- This result appears because 3 appears more than once above

(i.e, a single SQL query that returns all rows for which any data in the code1 column appears more than once)...

How do I do it?

I received an answer with two possible SQL queries, both of which work perfectly.

Successful SQL #1:

SELECT code1, code2
FROM myTable
WHERE code1 IN 
    (SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1)

Successful SQL #2:

SELECT t.code1, code2
FROM myTable t
  INNER JOIN
    (SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1)
     s on s.code1 = t.code1

As I describe in a comment beneath the answer:

myTable has ~30000 rows, with only about 400 duplicate groups, and almost always just 2 entries per duplicate group. On my MySQL instance running on a high-end workstation, SQL #1 takes ~30 minutes to execute, whereas SQL #2 requires a fraction of a second.

This is a three to four orders of magnitude difference in performance between the two queries above.

It troubles me that it is not immediately obvious to me, looking at the queries, why one should perform three orders of magnitude better than the other in my use-case.

I would like to have a better understanding of the internals of SQL execution, and this particular example is excellent to assist with that.

My question is: Why does SQL #2 exhibit performance that is about 5,000 times faster than SQL #1 in my use case?

GarethD · Accepted Answer

MySQL has known issues with optimising queries involving correlated subqueries, or subselects. Up until version 5.6.5 it does not materialise subqueries, however it will materialise a derived table used in a join.

In essence this means that when you use a join, the first time the subquery is encountered MySQL will perform the following:

SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1

And keep the results in a temporary table (which is hashed to make lookups faster), then for each value in myTable it will lookup against the temporary table to see if the code is there.

However, since when you use IN the subquery is not materialised and is rewritten like:

SELECT t1.code1, t1.code2
FROM myTable t1
WHERE EXISTS
    (   SELECT t2.code1 
        FROM myTable t2
        WHERE t2.Code1 = t1.Code1
        GROUP BY t2.code1 
        HAVING COUNT(t2.code1) > 1
    )

Which means that for each code in myTable, it runs the subquery again. Which when your outer query is very narrow is fine, as it is more efficient to only run the subquery a few times, than run it for all values and store the results in a temporary table, however when your outer query is wide, it results in the inner query executing many times, and this is where the performance difference kicks in.

So for your row counts, instead of running the subquery ~30,000 times, you run it once, then lookup ~30,000 rows against a hashed temporary table with only 400 rows in. This would account for such a drastic performance difference.

This article in the online docs explains subquery optimisation in much more depth.

JOIN vs. WHERE: Why do two queries that obtain identical results exhibit 3-4 orders of magnitude performance difference?

Answers (1)

Related Questions