eversor
eversor

Reputation: 3073

Oddly slow behaviour IN clause SQL

I found and odd performance while writing a query, this was the original query (BTW I am using MySQL 5.5.8)

What I need to do is to select id, a, b c,d attributes from all elements (and all ocurrences of that elements) that have new tuples inserted in a time (time) between now and 2012-03-13 10:41:34.8431 (e.g.) but also, these elements must have had a number of ocurrences HAVING COUNT(id) >= '5'

SELECT id, a, b, c,d FROM table1 WHERE
id IN (SELECT id FROM table1 WHERE id IN (SELECT id FROM 
table1 WHERE time >= '2012-03-13 10:41:34.8431' AND a = '1') HAVING COUNT(id) >= '5')

In table1 there are aprox 700 tuples. This query lasts 14.5s

Since it was not acceptable to take this long, I separated both querys just to test:

SELECT id FROM table1 WHERE id IN (SELECT id FROM 
table1 WHERE time >= '2012-03-13 10:41:34.8431' AND a = '1') HAVING COUNT(id) >= '5'

Takes 0.025s and returns one tuple with id = 6
Since i knew the result, and just for testing I executed the other part of the query as:

SELECT id, a, b, c,d FROM table1 WHERE id IN (6)

And it took 0.0012s

Now, what happens there? I have no clue, just some unfounded speculations. Why there is so much difference between them?
And if this is normal, are there any known workarrounds?

Upvotes: 2

Views: 895

Answers (2)

DRapp
DRapp

Reputation: 48139

I would reverse it... start with your list of IDs that are qualified, then use THAT to join to the original table to get the data. WHERE IN Sub-selects are always a root of performance hits

SELECT
      T1.id, 
      T1.a, 
      T1.b, 
      T1.c,
      T1.d 
   FROM
      ( SELECT id, count(*)
           FROM table1
           WHERE a = 1
             AND time >= '2012-03-13 10:41:34.8431'
           group by id
           having count(*) > 4 ) as PreQualified
      JOIN table1 T1
         ON PreQualified.ID = T1.ID
        AND T1.a = 1
        AND T1.time >= '2012-03-13 10:41:34.8431'

I re-applied same criteria to the JOIN clause since I don't know the rest of the data.. as I don't know if you could have an ID that has prior times than the time indicated and want to exclude those too, but retain the premise that the ID must AT LEAST have "a = 1" and be on/after the time period specified.

By starting with a pre-filtered/qualified list of just IDs, you don't have to join every other ID, then throw them out when their total count is not enough.

Per input from Ami, I'm using his context to clarify dependency queries...

You should view the EXPLAIN result on your queries to determine if your query is running the subquery for each row of the outer query, which will appear in your EXPLAIN result as a DEPENDENT QUERY.

You can often turn a dependent subquery into a DERIVED table by joining on it, as is being done here.

Upvotes: 4

Kevin Burton
Kevin Burton

Reputation: 11936

For each row in the outer query the inner query is executed :(

'A typical case for poor IN subquery performance is when the subquery returns a small number of rows but the outer query returns a large number of rows to be compared to the subquery result.

The problem is that, for a statement that uses an IN subquery, the optimizer rewrites it as a correlated subquery. '

from http://dev.mysql.com/doc/refman/5.0/en/subquery-restrictions.html

Upvotes: 3

Related Questions