Reputation: 2498
I wanted to run the following query:
-- Main Query
SELECT COUNT(*) FROM table_name WHERE device_id IN
(SELECT DISTINCT device_id FROM table_name WHERE NAME = 'SOME_PARA')
This following query (sub query from Main Query):
SELECT DISTINCT device_id FROM table_name WHERE NAME = 'SOME_PARA'
executes in 7 seconds, giving 2691 rows from a table of 2.1M rows.
I fired the Main Query above and it is still executing after 5 mins+ of waiting.
Finally, I executed the sub query separately, took the 2691 records from the result, executed the following query:
-- Main Query (improvised)
SELECT COUNT(*) FROM table_name WHERE device_id IN
("device_id_1", "device_id_2", ....., "device_id_2691")
Surprisingly, this gave me an answer within 40 seconds.
What gives? Why doesn't MySQL use the same technique that I used and give an answer quickly? Am I doing something wrong?
Upvotes: 11
Views: 1300
Reputation: 53870
Unfortunately, MySQL is not very good at optimizing subqueries with IN. This is from MySQL documentation:
Subquery optimization for IN is not as effective as for the = operator or for the IN(value_list) operator.
A typical case for poor IN subquery performance is when the subquery returns a small number of rows but the outer query returns a large number of rows to be compared to the subquery result.
The problem is that, for a statement that uses an IN subquery, the optimizer rewrites it as a correlated subquery. Consider the following statement that uses an uncorrelated subquery:
SELECT ... FROM t1 WHERE t1.a IN (SELECT b FROM t2);
The optimizer rewrites the statement to a correlated subquery:
SELECT ... FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a);
If the inner and outer queries return M and N rows, respectively, the execution time becomes on the order of O(M×N), rather than O(M+N) as it would be for an uncorrelated subquery.
An implication is that an IN subquery can be much slower than a query written using an IN(value_list) operator that lists the same values that the subquery would return.
Try using a JOIN instead.
Because MySQL works from the inside out, sometimes you can trick MySQL by wrapping the subquery inside yet another subquery like so:
SELECT COUNT(*) FROM table_name WHERE device_id IN
(SELECT * FROM (SELECT DISTINCT device_id FROM table_name WHERE NAME = 'SOME_PARA') tmp)
Here's the JOIN solution:
SELECT COUNT(DISTINCT t2.id) FROM table_name t1
JOIN table_name t2
ON t2.device_id = t1.device_id
WHERE t1.NAME = 'SOME_PARA'
Notice that I start from the inside and go out also.
Upvotes: 5
Reputation: 1270633
I think you could rewrite the query as:
SELECT sum(NumOnDevice)
from (SELECT device_id, count(*) as NumOnDevice
FROM table_name
having sum(case when NAME = 'SOME_PARA' then 1 else 0 end) > 0
) t
I realize this doesn't answer your question, but it might help you.
In terms of optimization, there is a world of difference between giving a query a bunch of constants and giving a query a subquery (even if the results are the same). In the first case, the query optimizer has much more information for deciding on the query plan. In the second, the information is not available at compile-time.
Mysql -- more than most databases -- seems to generate a query plan based on how the query is expressed. SQL was designed as a declarative language rather than a procedural language. That means that SQL queries describe the desired result set, and the query engine is supposed to decide on the best way to achieve that result. However, there are many cases where one has to help out the database engine to get the best results.
Upvotes: 2
Reputation: 7590
Edit: I have no idea what the reason is for MySQL's stupidity in this case :), this bug report seems relevant to the case. The workaround is to use a JOIN
SELECT
COUNT(t1.device_id)
FROM table_name t1
JOIN (
SELECT DISTINCT device_id FROM table_name WHERE NAME = 'SOME_PARA'
) as t2 ON t2.device_id = t1.device_id
Upvotes: 4
Reputation: 11054
Look at what you are asking MySQL to do, it's going to have to look at every record in table_name, determine if the device_id is in a list that it gets by running a query, then decide if it's adding it to the count. So it's running the subquery 2.1M times.
This is also why when that list is manually defined it can chomp through it rather quickly.
Upvotes: 1