Reputation: 2241
I'm trying to optimize the following query:
SELECT a2 AS 'b_actual_pair',
a1 AS 'c_actual_date',
a3 AS 'd_actual_value',
b1 AS 'e_1m_date',
b3 AS 'f_1m_value',
c1 AS 'g_2m_date',
c3 AS 'h_2m_value',
d1 AS 'i_3m_date',
d3 AS 'j_3m_value',
e1 AS 'k_4m_date',
e3 AS 'l_4m_value',
f1 AS 'm_5m_date',
f3 AS 'n_5m_value'
FROM (SELECT crd.b_date AS 'a1',
crd.c_pair AS 'a2',
crd.d_value AS 'a3'
FROM item_raw_data crd
WHERE crd.a_unique_id > ( (SELECT crd.a_unique_id
FROM item_raw_data crd
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) - ((SELECT
Count(DISTINCT c_pair)
FROM item_raw_data)) )
ORDER BY crd.b_date DESC) a,
(SELECT crd.b_date AS 'b1',
crd.c_pair AS 'b2',
crd.d_value AS 'b3'
FROM item_raw_data crd
WHERE crd.b_date < ( (SELECT crd.b_date
FROM item_raw_data crd
ORDER BY crd.b_date DESC
LIMIT 0, 1) - INTERVAL 1 minute )
AND crd.a_unique_id > ( (SELECT Max(x.a_unique_id)
FROM (SELECT crd.a_unique_id,
crd.b_date,
crd.c_pair,
crd.d_value
FROM item_raw_data crd
WHERE crd.b_date < (
(SELECT crd.b_date
FROM item_raw_data
crd
ORDER BY crd.b_date
DESC
LIMIT 0, 1) -
INTERVAL 1 minute )
ORDER BY crd.b_date DESC) x) -
((SELECT Count(DISTINCT c_pair)
FROM item_raw_data)) )
ORDER BY crd.b_date DESC) b,
(SELECT crd.b_date AS 'c1',
crd.c_pair AS 'c2',
crd.d_value AS 'c3'
FROM item_raw_data crd
WHERE crd.b_date < ( (SELECT crd.b_date
FROM item_raw_data crd
ORDER BY crd.b_date DESC
LIMIT 0, 1) - INTERVAL 2 minute )
AND crd.a_unique_id > ( (SELECT Max(x.a_unique_id)
FROM (SELECT crd.a_unique_id,
crd.b_date,
crd.c_pair,
crd.d_value
FROM item_raw_data crd
WHERE crd.b_date < (
(SELECT crd.b_date
FROM item_raw_data
crd
ORDER BY crd.b_date
DESC
LIMIT 0, 1) -
INTERVAL 2 minute )
ORDER BY crd.b_date DESC) x) -
((SELECT Count(DISTINCT c_pair)
FROM item_raw_data)) )
ORDER BY crd.b_date DESC) c,
(SELECT crd.b_date AS 'd1',
crd.c_pair AS 'd2',
crd.d_value AS 'd3'
FROM item_raw_data crd
WHERE crd.b_date < ( (SELECT crd.b_date
FROM item_raw_data crd
ORDER BY crd.b_date DESC
LIMIT 0, 1) - INTERVAL 3 minute )
AND crd.a_unique_id > ( (SELECT Max(x.a_unique_id)
FROM (SELECT crd.a_unique_id,
crd.b_date,
crd.c_pair,
crd.d_value
FROM item_raw_data crd
WHERE crd.b_date < (
(SELECT crd.b_date
FROM item_raw_data
crd
ORDER BY crd.b_date
DESC
LIMIT 0, 1) -
INTERVAL 3 minute )
ORDER BY crd.b_date DESC) x) -
((SELECT Count(DISTINCT c_pair)
FROM item_raw_data)) )
ORDER BY crd.b_date DESC) d,
(SELECT crd.b_date AS 'e1',
crd.c_pair AS 'e2',
crd.d_value AS 'e3'
FROM item_raw_data crd
WHERE crd.b_date < ( (SELECT crd.b_date
FROM item_raw_data crd
ORDER BY crd.b_date DESC
LIMIT 0, 1) - INTERVAL 4 minute )
AND crd.a_unique_id > ( (SELECT Max(x.a_unique_id)
FROM (SELECT crd.a_unique_id,
crd.b_date,
crd.c_pair,
crd.d_value
FROM item_raw_data crd
WHERE crd.b_date < (
(SELECT crd.b_date
FROM item_raw_data
crd
ORDER BY crd.b_date
DESC
LIMIT 0, 1) -
INTERVAL 4 minute )
ORDER BY crd.b_date DESC) x) -
((SELECT Count(DISTINCT c_pair)
FROM item_raw_data)) )
ORDER BY crd.b_date DESC) e,
(SELECT crd.b_date AS 'f1',
crd.c_pair AS 'f2',
crd.d_value AS 'f3'
FROM item_raw_data crd
WHERE crd.b_date < ( (SELECT crd.b_date
FROM item_raw_data crd
ORDER BY crd.b_date DESC
LIMIT 0, 1) - INTERVAL 5 minute )
AND crd.a_unique_id > ( (SELECT Max(x.a_unique_id)
FROM (SELECT crd.a_unique_id,
crd.b_date,
crd.c_pair,
crd.d_value
FROM item_raw_data crd
WHERE crd.b_date < (
(SELECT crd.b_date
FROM item_raw_data
crd
ORDER BY crd.b_date
DESC
LIMIT 0, 1) -
INTERVAL 5 minute )
ORDER BY crd.b_date DESC) x) -
((SELECT Count(DISTINCT c_pair)
FROM item_raw_data)) )
ORDER BY crd.b_date DESC) f
WHERE
a.a2 = b.b2
and
b.b2 = c.c2
and
c.c2 = d.d2
and
d.d2 = e.e2
and
e.e2 = f.f2
The output of this query is the following:
1. The data behind items_raw_data
is the following:
-every 5 seconds 110 items are inserted in the database with their current prices
-actual_pair
or c_pair
is a reference to a main table which contains the full item description, not of much relevance
-luckily the 110 rows are inserted in 2 seconds, leaving a gap of 3 seconds. This makes it easier to build the queries.
2. The aim of this query is to use this data to do a chart with live prices but we need to further expand the query to (example) 10 minutes, 15 minutes, 1 hour, 2 hours
etc... So you will be able to see an items price with its value now and few moments back in time.
3. The problem is that this query takes 2.5 seconds to run with 450.000 (few hours data present and we need up to 1 week of data) total data rows and 6 set of data (actual, 1m, 2m, 3m, 4m, 5m).
What have we tried so far:
Using tables as Memory instead of INNO reduced the query times from 2.5 seconds to 2 seconds. The system has 64gb ECC RAM and a 12 core CPU and an NVMe drive, hardware shouldn't be a problem.
Finding all this data for each item individually yields worse results than all the items together.
Doing exactly the same in server-side language (Java) through threaded code is slower also.
Tried using Inner Join's instead of Where, similar result.
An easier to view query:
SELECT crd.b_date AS 'a1',
crd.c_pair AS 'a2',
crd.d_value AS 'a3'
FROM items_raw_data crd
WHERE crd.a_unique_id > ( (SELECT crd.a_unique_id
FROM items_raw_data crd
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) - ((SELECT Count(DISTINCT c_pair)
FROM items_raw_data)) )
ORDER BY crd.b_date DESC
And the results:
This is for the actual prices only.
Table description:
Update 1
Added explain:
Update 2
Here are the links to download the database, with the query also (own hosting server):
Item raw data SQL: https://cloud.technorah.com/index.php/s/sR3mdK2Oos2EbC3
SQL Query: https://cloud.technorah.com/index.php/s/bdndmLGAUfpduif
Update 3
Using @hunteke query gave a result of 4.7 seconds, which was very strange because the query and the advices seem logical.
Using @hunteke's tips we changed the following:
SELECT a_unique_id FROM item_raw_data ORDER BY a_unique_id DESC LIMIT 0, 1
to
SELECT MAX(a_unique_id) FROM item_raw_data
this lowered the query time from 2.8 seconds to 2.7 seconds. Adding USE INDEX(primary)
on main queries further improved the time from 2.7 seconds to 2.6 seconds.
Update 4
We failed at a basic task, using Timestamp
instead of int(11) - primary key
in the order by
. The latest inserted date is also the latest unique id inserted. So changing from ORDER BY crd.b_date
to ORDER BY crd.a_unique_id
lowered the query by more than 1 SECOND, from 2.6 to 1.3, almost half.
So the actual query looks like this, completly reworked. The execution time changed from 1.3 seconds to 0.55s
SELECT *
FROM
(SELECT sub.a_unique_id AS 'a0',
sub.b_date AS 'a1',
sub.c_pair AS 'a2',
sub.d_value AS 'a3'
FROM (SELECT *
FROM item_raw_data) sub,
(SELECT crd.a_unique_id AS 'max_id',
crd.b_date AS 'xdate'
FROM item_raw_data crd
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) aux
WHERE sub.b_date <= aux.xdate
AND sub.a_unique_id > ( aux.max_id - (SELECT
Count(DISTINCT c_pair) AS
max_rows
FROM item_raw_data)
)
ORDER BY sub.a_unique_id DESC) a,
(SELECT sub.a_unique_id AS 'b0',
sub.b_date AS 'b1',
sub.c_pair AS 'b2',
sub.d_value AS 'b3'
FROM (SELECT *
FROM item_raw_data) sub,
(SELECT crd.a_unique_id AS 'max_id',
crd.b_date AS 'xdate'
FROM item_raw_data crd
WHERE crd.b_date < (SELECT ( crdx.b_date - INTERVAL 1 minute )
FROM item_raw_data crdx
ORDER BY crdx.a_unique_id DESC
LIMIT 0, 1)
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) aux
WHERE sub.b_date <= aux.xdate
AND sub.a_unique_id > ( aux.max_id - (SELECT
Count(DISTINCT c_pair) AS
max_rows
FROM item_raw_data)
)
ORDER BY sub.a_unique_id DESC) b,
(SELECT sub.a_unique_id AS 'c0',
sub.b_date AS 'c1',
sub.c_pair AS 'c2',
sub.d_value AS 'c3'
FROM (SELECT *
FROM item_raw_data) sub,
(SELECT crd.a_unique_id AS 'max_id',
crd.b_date AS 'xdate'
FROM item_raw_data crd
WHERE crd.b_date < (SELECT ( crdx.b_date - INTERVAL 2 minute )
FROM item_raw_data crdx
ORDER BY crdx.a_unique_id DESC
LIMIT 0, 1)
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) aux
WHERE sub.b_date <= aux.xdate
AND sub.a_unique_id > ( aux.max_id - (SELECT
Count(DISTINCT c_pair) AS
max_rows
FROM item_raw_data)
)
ORDER BY sub.a_unique_id DESC) c,
(SELECT sub.a_unique_id AS 'd0',
sub.b_date AS 'd1',
sub.c_pair AS 'd2',
sub.d_value AS 'd3'
FROM (SELECT *
FROM item_raw_data) sub,
(SELECT crd.a_unique_id AS 'max_id',
crd.b_date AS 'xdate'
FROM item_raw_data crd
WHERE crd.b_date < (SELECT ( crdx.b_date - INTERVAL 3 minute )
FROM item_raw_data crdx
ORDER BY crdx.a_unique_id DESC
LIMIT 0, 1)
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) aux
WHERE sub.b_date <= aux.xdate
AND sub.a_unique_id > ( aux.max_id - (SELECT
Count(DISTINCT c_pair) AS
max_rows
FROM item_raw_data)
)
ORDER BY sub.a_unique_id DESC) d,
(SELECT sub.a_unique_id AS 'e0',
sub.b_date AS 'e1',
sub.c_pair AS 'e2',
sub.d_value AS 'e3'
FROM (SELECT *
FROM item_raw_data) sub,
(SELECT crd.a_unique_id AS 'max_id',
crd.b_date AS 'xdate'
FROM item_raw_data crd
WHERE crd.b_date < (SELECT ( crdx.b_date - INTERVAL 4 minute )
FROM item_raw_data crdx
ORDER BY crdx.a_unique_id DESC
LIMIT 0, 1)
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) aux
WHERE sub.b_date <= aux.xdate
AND sub.a_unique_id > ( aux.max_id - (SELECT
Count(DISTINCT c_pair) AS
max_rows
FROM item_raw_data)
)
ORDER BY sub.a_unique_id DESC) e,
(SELECT sub.a_unique_id AS 'f0',
sub.b_date AS 'f1',
sub.c_pair AS 'f2',
sub.d_value AS 'f3'
FROM (SELECT *
FROM item_raw_data) sub,
(SELECT crd.a_unique_id AS 'max_id',
crd.b_date AS 'xdate'
FROM item_raw_data crd
WHERE crd.b_date < (SELECT ( crdx.b_date - INTERVAL 5 minute )
FROM item_raw_data crdx
ORDER BY crdx.a_unique_id DESC
LIMIT 0, 1)
ORDER BY crd.a_unique_id DESC
LIMIT 0, 1) aux
WHERE sub.b_date <= aux.xdate
AND sub.a_unique_id > ( aux.max_id - (SELECT
Count(DISTINCT c_pair) AS
max_rows
FROM item_raw_data)
)
ORDER BY sub.a_unique_id DESC) f
WHERE
a.a2 = b.b2
AND
b.b2 = c.c2
AND
c.c2 = d.d2
AND
d.d2 = e.e2
AND
e.e2 = f.f2
While this is good, lowering from 1.3s almost to 0.55s, we can now use this but we are still looking for further improvements, not only to yield a better result, but an more in-depth understanding of optimization on such big queries and MySQL. Will keep updating the query execution time as table grows.
Upvotes: 1
Views: 85
Reputation: 3716
The EXPLAIN output says that the optimizer is choosing to ignore any indexes for 4 of the PRIMARY selections. I can't say with certainty why it's not, but I have a strong suspicion it's to do with the heavyish use of subqueries: MySQL is notoriously bad at optimizing subqueries, and the general advice is "don't use them." Not always tenable.
So the improvements that I can offer now:
When you need the maximum, use MAX
. Don't ORDER BY
and LIMIT
. The ORDER BY
requires O(n log n) operations to sort, while the MAX
function's worst case requires just O(n) to find the maximum value (scan through the entire table). If you have an index on that column with high-cardinality, then MAX
will likely take even less, at O(log n). You can see this slow down via all the filesort's in the EXPLAIN output.
-- okay
SELECT a_unique_id FROM item_raw_data ORDER BY a_unique_id DESC LIMIT 0, 1
-- better
SELECT MAX(a_unique_id) FROM item_raw_data
There should be no need to select columns that aren't utilized in subqueries. I would assume the query optimizer would get rid of them, but I have assumed wrong with MySQL before. Especially when it comes to subqueries. So, with a grain of salt:
-- okay
SELECT MAX(a_unique_id)
FROM (SELECT a_unique_id, b_date, c_pair, d_value FROM ...)
-- better
SELECT MAX(a_unique_id)
FROM (SELECT a_unique_id FROM ...)
The MySQL optimizer is ignoring indexes. You might give MySQL a strong hint to use the proper indexes with USE INDEX(<indexname>)
. Doing so results in a far simpler query plan that also makes better use of indexes:
+----+-------------+-------+------------+-------+-------------------------+-------------------------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+-------------------------+-------------------------+---------+------+--------+----------+----------------------------------------------------+
| 1 | PRIMARY | bb | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 1540 | 33.33 | Using where |
| 1 | PRIMARY | aa | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 110 | 0.90 | Using where; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | cc | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 2860 | 0.30 | Using where; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | dd | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 7568 | 0.30 | Using where; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | ee | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 10188 | 0.30 | Using where; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | ff | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 12972 | 0.30 | Using where; Using join buffer (Block Nested Loop) |
| 21 | SUBQUERY | ll | NULL | ALL | RAW_DATA_PAIR_UNIQUE_ID | NULL | NULL | NULL | 499198 | 50.00 | Using where |
| 17 | SUBQUERY | kk | NULL | ALL | RAW_DATA_PAIR_UNIQUE_ID | NULL | NULL | NULL | 499198 | 50.00 | Using where |
| 13 | SUBQUERY | jj | NULL | ALL | RAW_DATA_PAIR_UNIQUE_ID | NULL | NULL | NULL | 499198 | 50.00 | Using where |
| 9 | SUBQUERY | ii | NULL | ALL | RAW_DATA_PAIR_UNIQUE_ID | NULL | NULL | NULL | 499198 | 50.00 | Using where |
| 5 | SUBQUERY | hh | NULL | ALL | RAW_DATA_PAIR_UNIQUE_ID | NULL | NULL | NULL | 499198 | 50.00 | Using where |
| 3 | SUBQUERY | gg | NULL | index | RAW_DATA_PAIR_UNIQUE_ID | RAW_DATA_PAIR_UNIQUE_ID | 4 | NULL | 499198 | 100.00 | Using index |
+----+-------------+-------+------------+-------+-------------------------+-------------------------+---------+------+--------+----------+----------------------------------------------------+
And finally, I've reorganizing the query so it's much more obvious (if you scroll right) how the sections differ (the INTERVAL
s) and where the USE INDEX
's are. The aa
, bb
, cc
(etc.) tables names are there merely to identify where each part maps to the query plan (above):
SELECT
a2 AS 'b_actual_pair', a1 AS 'c_actual_date', a3 AS 'd_actual_value',
b1 AS 'e_1m_date', b3 AS 'f_1m_value',
c1 AS 'g_2m_date', c3 AS 'h_2m_value',
d1 AS 'i_3m_date', d3 AS 'j_3m_value',
e1 AS 'k_4m_date', e3 AS 'l_4m_value',
f1 AS 'm_5m_date', f3 AS 'n_5m_value'
FROM
(SELECT b_date AS a1, c_pair AS a2, d_value AS a3 FROM item_raw_data aa USE INDEX(PRIMARY) WHERE a_unique_id > (SELECT MAX(a_unique_id) - COUNT(DISTINCT c_pair) FROM item_raw_data gg USE INDEX(RAW_DATA_PAIR_UNIQUE_ID)) ORDER BY b_date DESC) AS a,
(SELECT b_date AS b1, c_pair AS b2, d_value AS b3 FROM item_raw_data bb USE INDEX(PRIMARY) WHERE a_unique_id > (SELECT MAX(a_unique_id) - COUNT(DISTINCT c_pair) FROM item_raw_data hh WHERE b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 1 minute) AND b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 1 minute ORDER BY b_date DESC) AS b,
(SELECT b_date AS c1, c_pair AS c2, d_value AS c3 FROM item_raw_data cc USE INDEX(PRIMARY) WHERE a_unique_id > (SELECT MAX(a_unique_id) - COUNT(DISTINCT c_pair) FROM item_raw_data ii WHERE b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 2 minute) AND b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 2 minute ORDER BY b_date DESC) AS c,
(SELECT b_date AS d1, c_pair AS d2, d_value AS d3 FROM item_raw_data dd USE INDEX(PRIMARY) WHERE a_unique_id > (SELECT MAX(a_unique_id) - COUNT(DISTINCT c_pair) FROM item_raw_data jj WHERE b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 3 minute) AND b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 3 minute ORDER BY b_date DESC) AS d,
(SELECT b_date AS e1, c_pair AS e2, d_value AS e3 FROM item_raw_data ee USE INDEX(PRIMARY) WHERE a_unique_id > (SELECT MAX(a_unique_id) - COUNT(DISTINCT c_pair) FROM item_raw_data kk WHERE b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 4 minute) AND b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 4 minute ORDER BY b_date DESC) AS e,
(SELECT b_date AS f1, c_pair AS f2, d_value AS f3 FROM item_raw_data ff USE INDEX(PRIMARY) WHERE a_unique_id > (SELECT MAX(a_unique_id) - COUNT(DISTINCT c_pair) FROM item_raw_data ll WHERE b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 5 minute) AND b_date < (SELECT MAX(b_date) FROM item_raw_data mm) - INTERVAL 5 minute ORDER BY b_date DESC) AS f
WHERE
a.a2 = b.b2
AND a.a2 = c.c2
AND a.a2 = d.d2
AND a.a2 = e.e2
AND a.a2 = f.f2
And sure enough, the better use of indexes improves the speed. It's not instant, but it takes 50% of the time of the original query on my laptop. Meanwhile, note that if you're trying to meet a 5 second budget for data going back a week, you may still be hosed. A warning that you may need to rethink your approach.
Upvotes: 1
Reputation: 1269693
You can definitely simplify the where
clause by doing:
WHERE crd.a_unique_id > (SELECT MAX(crd.a_unique_id) - COUNT(DISTINCT c_pair)
FROM items_raw_data crd
)
This might help a bit with performance. You can also move the subquery to the FROM
clause to ensure that it is executed only once.
Upvotes: 0