Reputation: 21
I'm having trouble adapting my SQL query's IN statement to a EXISTS. I understand IN is slower and this seems to be reflected in the query's performance.
SELECT
t.dt as 'Log Time',
sn.name as 'Snake Name',
sen.type as 'Sensor Type',
t.temp as Temperature
FROM
temps as t
JOIN
sensors as sen ON t.sensor = sen.sensorid
JOIN
locations as l ON sen.location = l.id
JOIN
snakes as sn ON sen.location = sn.location
WHERE
dt IN (SELECT
max(dt)
FROM
temps
GROUP BY sensor)
ORDER BY sn.name ASC , sen.type DESC
;
Any ideas of how to improve this would be appreciated.
Upvotes: 2
Views: 83
Reputation: 27864
The problem with IN (SELECT subquery)
is that MySQL "optimizes" it with what is often a very poor performance choice.
According to the documentation, a query with a template like this:
WHERE outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
Is automatically converted by the optimizer to this:
WHERE EXISTS (SELECT 1 FROM ... WHERE subquery_where AND outer_expr=inner_expr)
The problem is that this subquery is executed for every row in the outer SELECT
. If we are talking about hundreds, thousands, or even hundreds of thousands of rows before WHERE
, MySQL will take quite a while to digest this thing, iterating the same table over and over looking for each match individually. Not cool!
The solution is to force it to create and JOIN
with a temporary table. The idea is that despite the fact of creating a temporary table is slower and takes more memory than a simple query, it is certainly faster than thousands of queries. So this is what you are going to do:
SELECT
t.dt `Log Time`,
sn.name `Snake Name`,
sen.type `Sensor Type`,
t.temp `Temperature`
FROM
temps t
JOIN
sensors sen ON t.sensor = sen.sensorid
JOIN
locations l ON sen.location = l.id
JOIN
snakes sn ON sen.location = sn.location
JOIN
(SELECT sensor, MAX(dt) maxdt FROM temps GROUP BY sensor) m
ON m.maxdt = t.dt AND m.sensor = t.sensor
ORDER BY sn.name ASC, sen.type DESC;
When you JOIN
with a subquery SELECT
it won't make dumb assumptions. A temporary table is created behind the scene, and it will be performing a JOIN
with it just as you told it to.
However, observe that this can be optimized further. We can do this by keeping a low number of records in game as you JOIN
tables. The sooner you withdraw records you don't need from the query, the less MySQL will have to process on the subsequent JOIN
s and faster it will go. You can, for instance, filter the rows you need from temps
straight from the beginning, just by reorganizing the JOIN
sequence:
SELECT
t.dt `Log Time`,
sn.name `Snake Name`,
sen.type `Sensor Type`,
t.temp `Temperature`
FROM
(SELECT sensor, MAX(dt) maxdt FROM temps GROUP BY sensor) m
JOIN
temps t ON m.maxdt = t.dt AND m.sensor = t.sensor
JOIN
sensors sen ON t.sensor = sen.sensorid
JOIN
locations l ON sen.location = l.id
JOIN
snakes sn ON sen.location = sn.location
ORDER BY sn.name ASC, sen.type DESC;
This seemingly simple change is very significative for performance, this query should be considerably faster than the first one, specially if temps
is a big table.
You can also use SELECT STRAIGHT_JOIN ...
to enforce the order of JOIN
s in case it performs even better (it usually does).
Upvotes: 1
Reputation: 48139
Although it appears you have a solution and very good time, Havenard has a good point about not necessarilyt getting correct per sensor. I would propose the following.
On your temps table, have an index on (sensor, dt), then, your first from will be a select grouped by each sensor, so each sensor appears once with its respective date/time. Then, use that as basis to rejoin to temps by that same sensor/max(dt) and get rest of the data.
This is very close to what Havenard posted except I am moving my pre-query up front and adding "STRAIGHT_JOIN" to force the join in the order as written. Start with the very limited set via max date/time, then join the rest to get the descriptions and temp info.
SELECT STRAIGHT_JOIN
t.dt `Log Time`,
sn.name `Snake Name`,
sen.type `Sensor Type`,
t.temp `Temperature`
FROM
( select t1.sensor, max( t1.dt ) as MaxDT
from temps t1
group by t1.sensor ) PreQuery
JOIN temps t
on PreQuery.sensor = t.sensor
AND PreQuery.MaxDT = t.dt
JOIN sensors sen
ON PreQuery.sensor = sen.sensorid
JOIN locations l
ON sen.location = l.id
JOIN snakes sn
ON sen.location = sn.location
ORDER BY
sn.name,
sen.type DESC;
Upvotes: 0
Reputation: 21
It turns out both EXISTS and IN aren't the optimal solution. After having a play around I came up with the following:
SELECT distinct
t.dt as 'Log Time',
sn.name as 'Snake Name',
sen.type as 'Sensor Type',
t.temp as Temperature
FROM
(SELECT
*
FROM
temps
ORDER BY dt DESC) as t
JOIN
sensors as sen ON t.sensor = sen.sensorid
JOIN
locations as l ON sen.location = l.id
JOIN
snakes as sn ON sen.location = sn.location
WHERE
dt != '0000-00-00 00:00:00'
GROUP BY sensor
ORDER BY sn.name ASC , sen.type DESC
Which takes 0.047 sec to run as oppose to the original ~50 sec query.
Upvotes: 0
Reputation: 32290
You are right, IN
subqueries are usually slower than EXISTS
.
EXISTS
works differently because you can use previous columns directly in it. You can also improve by using LIMIT
to limit your subset to only the amount you really need. When using EXISTS
, it doesn't matter what you select because it just asks: is there at least 1 row returned or not.
Make sure to use the prefix on every column.
SELECT
t.dt as 'Log Time',
sn.name as 'Snake Name',
sen.type as 'Sensor Type',
t.temp as Temperature
FROM
temps as t
JOIN
sensors as sen ON t.sensor = sen.sensorid
JOIN
locations as l ON sen.location = l.id
JOIN
snakes as sn ON sen.location = sn.location
WHERE
EXISTS(
SELECT 'hi'
FROM temps
GROUP BY temps.sensor
HAVING max(temps.dt) = t.dt
LIMIT 1
)
ORDER BY sn.name ASC , sen.type DESC
;
Upvotes: 0