Reputation: 1956
I have the following SQL query
SELECT *
FROM `sensor_data` AS `sd1`
WHERE (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
AND '2017-05-14 00:00:00')
AND (`id` =
(
SELECT `id`
FROM `sensor_data` AS `sd2`
WHERE sd1.mid = sd2.mid
AND sd1.sid = sd2.sid
ORDER BY `value` DESC, `id` DESC
LIMIT 1)
)
Background:
I've checked the validity of the query by changing LIMIT 1
to LIMIT 0
, and the query works without any problem. However with LIMIT 1
the query doesn't complete, it just states loading
until I shutdown and restart.
Breaking the Query down:
I have broken down the query with the date boundary as follows:
SELECT *
FROM `sensor_data` AS `sd1`
WHERE (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
AND '2017-05-14 00:00:00')
This takes about 0.24 seconds to return the query with 8200 rows each having 5 columns.
Question:
I suspect the second half of my Query, is not correct or well optimized. The tables are as follows:
Current Table:
+------+-------+-------+-----+-----------------------+
| id | mid | sid | v | timestamp |
+------+-------+-------+-----+-----------------------+
| 51 | 10 | 1 | 40 | 2015-05-13 11:56:01 |
| 52 | 10 | 2 | 39 | 2015-05-13 11:56:25 |
| 53 | 10 | 2 | 40 | 2015-05-13 11:56:42 |
| 54 | 10 | 2 | 40 | 2015-05-13 11:56:45 |
| 55 | 10 | 2 | 40 | 2015-05-13 11:57:01 |
| 56 | 11 | 1 | 50 | 2015-05-13 11:57:52 |
| 57 | 11 | 2 | 18 | 2015-05-13 11:58:41 |
| 58 | 11 | 2 | 19 | 2015-05-13 11:58:59 |
| 59 | 11 | 3 | 58 | 2015-05-13 11:59:01 |
| 60 | 11 | 3 | 65 | 2015-05-13 11:59:29 |
+------+-------+-------+-----+-----------------------+
Q: How would I get the MAX(v)
for each sid
for each mid
?
NB#1: In the example above ROW
53
, 54
, 55
have all the same value (40
), but I would like to retrieve the row with the most recent timestamp, which is ROW
55
.
Expected Output:
+------+-------+-------+-----+-----------------------+
| id | mid | sid | v | timestamp |
+------+-------+-------+-----+-----------------------+
| 51 | 10 | 1 | 40 | 2015-05-13 11:56:01 |
| 55 | 10 | 2 | 40 | 2015-05-13 11:57:01 |
| 56 | 11 | 1 | 50 | 2015-05-13 11:57:52 |
| 58 | 11 | 2 | 19 | 2015-05-13 11:58:59 |
| 60 | 11 | 3 | 65 | 2015-05-13 11:59:29 |
+------+-------+-------+-----+-----------------------+
Structure of the table:
NB#2: Since this table has over 110 million entries, it is critical to have have date boundaries, which limits to ~8000 entries over a 24 hour period.
Upvotes: 1
Views: 160
Reputation: 142278
IN ( SELECT ... )
does not optimize well. It is even worse because of being correlated.
What you are looking for is a groupwise-max .
Please provide SHOW CREATE TABLE
; we need to know at least what the PRIMARY KEY
is.
Suggested code
You will need:
WHERE
: INDEX(timestamp, mid, sid, v, id)
WHERE
: INDEX(mid, sid, v, timestamp, id)
Code:
SELECT id, mid, sid, v, timestamp
FROM ( SELECT @prev_mid := 99999, -- some value not in table
@prev_sid := 99999,
@n := 0 ) AS init
JOIN (
SELECT @n := if(mid != @prev_mid OR
sid != @prev_sid,
1, @n + 1) AS n,
@prev_mid := mid,
@prev_sid := sid,
id, mid, sid, v, timestamp
FROM sensor_data
WHERE timestamp >= '2017-05-13'
timestamp < '2017-05-13' + INTERVAL 1 DAY
ORDER BY mid DESC, sid DESC, v DESC, timestamp DESC
) AS x
WHERE n = 1
ORDER BY mid, sid; -- optional
Notes:
ORDER BY
is optional; the results may be in reverse order.DESC
in the inner ORDER BY
must be in place to work correctly (unless you are using MySQL 8.0).WHERE
avoids including both midnights? And avoids manually computing leap-days, year-ends, etc?WHERE
(and associated INDEX
), there will be filtering, but a 'sort'.WHERE
(and the other INDEX
), sort will not be needed.You can test the performance of any competing formulations via this trick, even if you do not have enough rows (yet) to get reliable timings:
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';
This can also be used to compare different versions of MySQL and MariaDB -- I have seen 3 significantly different performance characteristics in a related groupwise-max test.
Upvotes: 0
Reputation: 1346
The query can be written as follows:
SELECT t1.id, t1.mid, t1.sid, t1.v, t1.ts
FROM yourtable t1
INNER JOIN (
SELECT mid, sid, MAX(v) as v
FROM yourtable
WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
GROUP BY mid, sid
) t2
ON t1.mid = t2.mid
AND t1.sid = t2.sid
AND t1.v = t2.v
INNER JOIN (
SELECT mid, sid, v, MAX(ts) as ts
FROM yourtable
WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
GROUP BY mid, sid, v
) t3
ON t1.mid = t3.mid
AND t1.sid = t3.sid
AND t1.v = t3.v
AND t1.ts = t3.ts;
Edit and Explanation:
The first sub-query (first INNER JOIN
) fetches MAX(v)
per (mid, sid)
combination. The second sub-query is to identify MAX(ts)
for every (mid, sid, v)
. At this point, the two queries do not influence each others' results. It is also important to note that ts
date range selection is done in the two sub-queries independently such that the final query has fewer rows to examine and no additional WHERE
filters to apply.
Effectively, this translates into getting MAX(v)
per (mid, sid)
combination initially (first sub-query); and if there is more than one record with the same value MAX(v)
for a given (mid, sid)
combo, then the excess records get eliminated by the selection of MAX(ts)
for every (mid, sid, v)
combination obtained by the second sub-query. We then simply associate the output of the two queries by the two INNER JOIN
conditions to get to the id
of the desired records.
Upvotes: 1
Reputation: 31
select * from sensor_data s1 where s1.v in (select max(v) from sensor_data s2 group by s2.mid)
union
select * from sensor_data s1 where s1.v in (select max(v) from sensor_data s2 group by s2.sid);
Upvotes: 0