3kstc
3kstc

Reputation: 1956

Optimizing SQL Query for max value with various conditions from a single MySQL table

I have the following SQL query

SELECT  *
    FROM  `sensor_data` AS `sd1`
    WHERE  (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
                              AND '2017-05-14 00:00:00')
      AND  (`id` = 
            (
                SELECT  `id`
                    FROM  `sensor_data` AS `sd2`
                    WHERE  sd1.mid = sd2.mid
                      AND  sd1.sid = sd2.sid
                    ORDER BY  `value` DESC, `id` DESC
                    LIMIT  1)
           ) 

Background:

I've checked the validity of the query by changing LIMIT 1 to LIMIT 0, and the query works without any problem. However with LIMIT 1 the query doesn't complete, it just states loading until I shutdown and restart.

Breaking the Query down:

I have broken down the query with the date boundary as follows:

SELECT  *
    FROM  `sensor_data` AS `sd1`
    WHERE  (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
                              AND '2017-05-14 00:00:00') 

This takes about 0.24 seconds to return the query with 8200 rows each having 5 columns.

Question:

I suspect the second half of my Query, is not correct or well optimized. The tables are as follows:

Current Table:

+------+-------+-------+-----+-----------------------+
|  id  |  mid  |  sid  |  v  |       timestamp       |
+------+-------+-------+-----+-----------------------+
|  51  |  10   |   1   |  40 |  2015-05-13 11:56:01  |
|  52  |  10   |   2   |  39 |  2015-05-13 11:56:25  |
|  53  |  10   |   2   |  40 |  2015-05-13 11:56:42  |
|  54  |  10   |   2   |  40 |  2015-05-13 11:56:45  |
|  55  |  10   |   2   |  40 |  2015-05-13 11:57:01  |
|  56  |  11   |   1   |  50 |  2015-05-13 11:57:52  |
|  57  |  11   |   2   |  18 |  2015-05-13 11:58:41  |
|  58  |  11   |   2   |  19 |  2015-05-13 11:58:59  |
|  59  |  11   |   3   |  58 |  2015-05-13 11:59:01  |
|  60  |  11   |   3   |  65 |  2015-05-13 11:59:29  |
+------+-------+-------+-----+-----------------------+

Q: How would I get the MAX(v)for each sid for each mid? NB#1: In the example above ROW 53, 54, 55 have all the same value (40), but I would like to retrieve the row with the most recent timestamp, which is ROW 55.

Expected Output:

+------+-------+-------+-----+-----------------------+
|  id  |  mid  |  sid  |  v  |       timestamp       |
+------+-------+-------+-----+-----------------------+
|  51  |  10   |   1   |  40 |  2015-05-13 11:56:01  |
|  55  |  10   |   2   |  40 |  2015-05-13 11:57:01  |
|  56  |  11   |   1   |  50 |  2015-05-13 11:57:52  |
|  58  |  11   |   2   |  19 |  2015-05-13 11:58:59  |
|  60  |  11   |   3   |  65 |  2015-05-13 11:59:29  |
+------+-------+-------+-----+-----------------------+

Structure of the table:

enter image description here

NB#2: Since this table has over 110 million entries, it is critical to have have date boundaries, which limits to ~8000 entries over a 24 hour period.

Upvotes: 1

Views: 160

Answers (3)

Rick James
Rick James

Reputation: 142278

IN ( SELECT ... ) does not optimize well. It is even worse because of being correlated.

What you are looking for is a groupwise-max .

Please provide SHOW CREATE TABLE; we need to know at least what the PRIMARY KEY is.

Suggested code

You will need:

  • With the WHERE: INDEX(timestamp, mid, sid, v, id)
  • Without the WHERE: INDEX(mid, sid, v, timestamp, id)

Code:

SELECT id, mid, sid, v, timestamp
    FROM ( SELECT @prev_mid := 99999,   -- some value not in table
                  @prev_sid := 99999,
                  @n := 0 ) AS init
    JOIN (
             SELECT  @n := if(mid != @prev_mid OR
                              sid != @prev_sid,
                              1, @n + 1) AS n,
                     @prev_mid := mid,
                     @prev_sid := sid,
                     id, mid, sid, v, timestamp
                 FROM sensor_data
                 WHERE timestamp >= '2017-05-13'
                       timestamp  < '2017-05-13' + INTERVAL 1 DAY
                 ORDER BY mid DESC, sid DESC, v DESC, timestamp DESC
         ) AS x
    WHERE n = 1
    ORDER BY mid, sid;   -- optional

Notes:

  • The index is 'composite' and 'covering'.
  • This should make one pass over the index, thereby providing 'good' performance.
  • The final ORDER BY is optional; the results may be in reverse order.
  • All the DESC in the inner ORDER BY must be in place to work correctly (unless you are using MySQL 8.0).
  • Note how the WHERE avoids including both midnights? And avoids manually computing leap-days, year-ends, etc?
  • With the WHERE (and associated INDEX), there will be filtering, but a 'sort'.
  • Without the WHERE (and the other INDEX), sort will not be needed.

You can test the performance of any competing formulations via this trick, even if you do not have enough rows (yet) to get reliable timings:

FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';

This can also be used to compare different versions of MySQL and MariaDB -- I have seen 3 significantly different performance characteristics in a related groupwise-max test.

Upvotes: 0

Dhruv Saxena
Dhruv Saxena

Reputation: 1346

The query can be written as follows:

SELECT t1.id, t1.mid, t1.sid, t1.v, t1.ts
FROM yourtable t1
INNER JOIN (
        SELECT mid, sid, MAX(v) as v 
            FROM yourtable
        WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
        GROUP BY mid, sid
    ) t2
    ON  t1.mid = t2.mid
    AND t1.sid = t2.sid
    AND t1.v   = t2.v
INNER JOIN (
        SELECT mid, sid, v, MAX(ts) as ts
            FROM yourtable
        WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
        GROUP BY mid, sid, v
    ) t3
    ON  t1.mid = t3.mid
    AND t1.sid = t3.sid
    AND t1.v   = t3.v
    AND t1.ts  = t3.ts; 

Edit and Explanation:

The first sub-query (first INNER JOIN) fetches MAX(v) per (mid, sid) combination. The second sub-query is to identify MAX(ts) for every (mid, sid, v). At this point, the two queries do not influence each others' results. It is also important to note that ts date range selection is done in the two sub-queries independently such that the final query has fewer rows to examine and no additional WHERE filters to apply.

Effectively, this translates into getting MAX(v) per (mid, sid) combination initially (first sub-query); and if there is more than one record with the same value MAX(v) for a given (mid, sid) combo, then the excess records get eliminated by the selection of MAX(ts) for every (mid, sid, v) combination obtained by the second sub-query. We then simply associate the output of the two queries by the two INNER JOIN conditions to get to the id of the desired records.

Demo

Upvotes: 1

G.y
G.y

Reputation: 31

select * from sensor_data s1 where s1.v in (select max(v) from sensor_data s2 group by s2.mid)

union

select * from sensor_data s1 where s1.v in (select max(v) from sensor_data s2 group by s2.sid);

Upvotes: 0

Related Questions