Alex Gregory
Alex Gregory

Reputation: 21

MYSQL IN Optimisation

I'm having trouble adapting my SQL query's IN statement to a EXISTS. I understand IN is slower and this seems to be reflected in the query's performance.

SELECT 
    t.dt as 'Log Time',
    sn.name as 'Snake Name',
    sen.type as 'Sensor Type',
    t.temp as Temperature
FROM
    temps as t
        JOIN
    sensors as sen ON t.sensor = sen.sensorid
        JOIN
    locations as l ON sen.location = l.id
        JOIN
    snakes as sn ON sen.location = sn.location
WHERE
    dt IN (SELECT 
            max(dt)
        FROM
            temps
        GROUP BY sensor)
ORDER BY sn.name ASC , sen.type DESC
;

Any ideas of how to improve this would be appreciated.

Upvotes: 2

Views: 83

Answers (4)

Havenard
Havenard

Reputation: 27864

The problem with IN (SELECT subquery) is that MySQL "optimizes" it with what is often a very poor performance choice.

According to the documentation, a query with a template like this:

WHERE outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)

Is automatically converted by the optimizer to this:

WHERE EXISTS (SELECT 1 FROM ... WHERE subquery_where AND outer_expr=inner_expr)

The problem is that this subquery is executed for every row in the outer SELECT. If we are talking about hundreds, thousands, or even hundreds of thousands of rows before WHERE, MySQL will take quite a while to digest this thing, iterating the same table over and over looking for each match individually. Not cool!

The solution is to force it to create and JOIN with a temporary table. The idea is that despite the fact of creating a temporary table is slower and takes more memory than a simple query, it is certainly faster than thousands of queries. So this is what you are going to do:

SELECT 
    t.dt `Log Time`,
    sn.name `Snake Name`,
    sen.type `Sensor Type`,
    t.temp `Temperature`
FROM
    temps t
        JOIN
    sensors sen ON t.sensor = sen.sensorid
        JOIN
    locations l ON sen.location = l.id
        JOIN
    snakes sn ON sen.location = sn.location
        JOIN
    (SELECT sensor, MAX(dt) maxdt FROM temps GROUP BY sensor) m
        ON m.maxdt = t.dt AND m.sensor = t.sensor
ORDER BY sn.name ASC, sen.type DESC;

When you JOIN with a subquery SELECT it won't make dumb assumptions. A temporary table is created behind the scene, and it will be performing a JOIN with it just as you told it to.

However, observe that this can be optimized further. We can do this by keeping a low number of records in game as you JOIN tables. The sooner you withdraw records you don't need from the query, the less MySQL will have to process on the subsequent JOINs and faster it will go. You can, for instance, filter the rows you need from temps straight from the beginning, just by reorganizing the JOIN sequence:

SELECT 
    t.dt `Log Time`,
    sn.name `Snake Name`,
    sen.type `Sensor Type`,
    t.temp `Temperature`
FROM
    (SELECT sensor, MAX(dt) maxdt FROM temps GROUP BY sensor) m
        JOIN
    temps t ON m.maxdt = t.dt AND m.sensor = t.sensor
        JOIN
    sensors sen ON t.sensor = sen.sensorid
        JOIN
    locations l ON sen.location = l.id
        JOIN
    snakes sn ON sen.location = sn.location
ORDER BY sn.name ASC, sen.type DESC;

This seemingly simple change is very significative for performance, this query should be considerably faster than the first one, specially if temps is a big table.

You can also use SELECT STRAIGHT_JOIN ... to enforce the order of JOINs in case it performs even better (it usually does).

Upvotes: 1

DRapp
DRapp

Reputation: 48139

Although it appears you have a solution and very good time, Havenard has a good point about not necessarilyt getting correct per sensor. I would propose the following.

On your temps table, have an index on (sensor, dt), then, your first from will be a select grouped by each sensor, so each sensor appears once with its respective date/time. Then, use that as basis to rejoin to temps by that same sensor/max(dt) and get rest of the data.

This is very close to what Havenard posted except I am moving my pre-query up front and adding "STRAIGHT_JOIN" to force the join in the order as written. Start with the very limited set via max date/time, then join the rest to get the descriptions and temp info.

SELECT STRAIGHT_JOIN
      t.dt `Log Time`,
      sn.name `Snake Name`,
      sen.type `Sensor Type`,
      t.temp `Temperature`
   FROM
      ( select t1.sensor, max( t1.dt ) as MaxDT
           from temps t1
           group by t1.sensor ) PreQuery
         JOIN temps t
            on PreQuery.sensor = t.sensor
           AND PreQuery.MaxDT = t.dt
        JOIN sensors sen 
           ON PreQuery.sensor = sen.sensorid
           JOIN locations l 
              ON sen.location = l.id
           JOIN snakes sn 
              ON sen.location = sn.location
   ORDER BY 
      sn.name,
      sen.type DESC;

Upvotes: 0

Alex Gregory
Alex Gregory

Reputation: 21

It turns out both EXISTS and IN aren't the optimal solution. After having a play around I came up with the following:

SELECT distinct
    t.dt as 'Log Time',
    sn.name as 'Snake Name',
    sen.type as 'Sensor Type',
    t.temp as Temperature
FROM
    (SELECT 
        *
    FROM
        temps
    ORDER BY dt DESC) as t
        JOIN
    sensors as sen ON t.sensor = sen.sensorid
        JOIN
    locations as l ON sen.location = l.id
        JOIN
    snakes as sn ON sen.location = sn.location
WHERE
    dt != '0000-00-00 00:00:00'
GROUP BY sensor
ORDER BY sn.name ASC , sen.type DESC

Which takes 0.047 sec to run as oppose to the original ~50 sec query.

Upvotes: 0

Daniel W.
Daniel W.

Reputation: 32290

You are right, IN subqueries are usually slower than EXISTS.

EXISTS works differently because you can use previous columns directly in it. You can also improve by using LIMIT to limit your subset to only the amount you really need. When using EXISTS, it doesn't matter what you select because it just asks: is there at least 1 row returned or not.

Make sure to use the prefix on every column.

SELECT 
    t.dt as 'Log Time',
    sn.name as 'Snake Name',
    sen.type as 'Sensor Type',
    t.temp as Temperature
FROM
    temps as t
        JOIN
    sensors as sen ON t.sensor = sen.sensorid
        JOIN
    locations as l ON sen.location = l.id
        JOIN
    snakes as sn ON sen.location = sn.location
WHERE
    EXISTS(
        SELECT 'hi'
        FROM   temps
        GROUP  BY temps.sensor
        HAVING max(temps.dt) = t.dt
        LIMIT 1
    )
ORDER BY sn.name ASC , sen.type DESC
;

Upvotes: 0

Related Questions