wau
wau

Reputation: 830

Optimize MySQL range query with group by

I have a table with the temperature for each day (huge table) and a table with period start and end dates (small table). Now I want to know the average temperature for each period, but the query takes a long time. Can it be improved?

NOTE: the long response times dissappear after upgrading to version 5.6.19-1~exp1ubuntu2, and may be caused by a bug in MySQL versions prior to 5.6.8 (see comment by Quassnoi)

To rebuild the day and period tables with random data:

create table days (
  day int not null auto_increment primary key,
  temperature float not null);

insert into days values(null,rand()),(null,rand()),
  (null,rand()),(null,rand()),(null,rand()),(null,rand()),
  (null,rand()),(null,rand()); # 8 rows

insert into days select null, d1.temperature
  from days d1, days d2, days d3, days d4,
  days d5, days d6, days d7; # 2M rows

create table periods(id int not null auto_increment primary key,
  first int not null,
  last int not null,
  index(first) using btree,
  index(last) using btree,
  index(first,last) using btree);

# add 10 periods of 1-11 days each
insert into periods(first,last)
  select floor(rand(day)*2000000), floor(rand(day)*2000000 + rand()*10)
  from days limit 10;

Listing all day temperatures for each periods is no problem (returns in 1ms):

select id, temperature
  from periods join days on day >= first and day <= last;

Now, with GROUP BY, it is actually quite slow (~1750ms)

# ALT1
select id, avg(temperature)
  from periods join days on day >= first and day <= last group by id;

Replacing the <= and >= with BETWEEN speeds it up slightly (~1600ms):

# ALT2
select id, avg(temperature)
  from periods join days on day between first and last group by id;

It turns out that results for a single period are returned immediately (1ms):

select id, (select avg(temperature)
  from days where day >= first and day <= last) from periods
  where id=1;

However, without the WHERE, it takes as much as 4200 ms, which averages to 420 ms per period!

# ALT3
select id,
  (select avg(temperature) from days where day >= first and day <= last)
  from periods;

What makes the queries so slow -- even (a lot) more than 10 times slower than getting results for a single period, although the periods table has only 10 rows? Is there any way to optimize this query?

EDIT: some more info:

mysql> select @@version;
+-------------------------+
| @@version               |
+-------------------------+
| 5.5.41-0ubuntu0.14.04.1 |
+-------------------------+

# ALT1
mysql> explain select id, avg(temperature) from periods join days on day >= first and day <= last group by id;
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                        |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2097596 | Using where; Using join buffer               |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+

# ALT1 without GROUP BY
mysql> explain select id, temperature from periods join days on day >= first and day <= last;
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+------------------------------------------------+
| id | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                          |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+------------------------------------------------+
|  1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index                                    |
|  1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2097596 | Range checked for each record (index map: 0x1) |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+------------------------------------------------+

# ALT2
mysql> explain select id, avg(temperature) from periods join days on day between first and last group by id;
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                        |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2097596 | Using where; Using join buffer               |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+

# ALT3
mysql> explain select id, (select avg(temperature) from days where day >= first and day <= last) from periods;
+----+--------------------+---------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type        | table   | type  | possible_keys | key     | key_len | ref  | rows    | Extra       |
+----+--------------------+---------+-------+---------------+---------+---------+------+---------+-------------+
|  1 | PRIMARY            | periods | index | NULL          | first_2 | 8       | NULL |      10 | Using index |
|  2 | DEPENDENT SUBQUERY | days    | ALL   | PRIMARY       | NULL    | NULL    | NULL | 2097596 | Using where |
+----+--------------------+---------+-------+---------------+---------+---------+------+---------+-------------+

# ALT3 with where
mysql> explain select id, (select avg(temperature) from days where day >= first and day <= last) from periods where id = 1;
+----+--------------------+---------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type        | table   | type  | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+--------------------+---------+-------+---------------+---------+---------+-------+------+-------------+
|  1 | PRIMARY            | periods | const | PRIMARY       | PRIMARY | 4       | const |    1 |             |
|  2 | DEPENDENT SUBQUERY | days    | range | PRIMARY       | PRIMARY | 4       | NULL  |   10 | Using where |
+----+--------------------+---------+-------+---------------+---------+---------+-------+------+-------------+

EDIT2: execution plan for nested query in FROM, as suggested by Lennart (query execution time 3ms)

mysql> explain select id,avg(temperature) from (select id,temperature from periods join days on day between first and last) as t group by id;
+----+-------------+------------+-------+--------------------+---------+---------+------+----------+------------------------------------------------+
| id | select_type | table      | type  | possible_keys      | key     | key_len | ref  | rows     | Extra                                          |
+----+-------------+------------+-------+--------------------+---------+---------+------+----------+------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL               | NULL    | NULL    | NULL |       50 | Using temporary; Using filesort                |
|  2 | DERIVED     | periods    | index | first,last,first_2 | first_2 | 8       | NULL |       10 | Using index                                    |
|  2 | DERIVED     | days       | range | PRIMARY,day        | PRIMARY | 4       | NULL |        5 | Range checked for each record (index map: 0x3) |
+----+-------------+------------+-------+--------------------+---------+---------+------+----------+------------------------------------------------+

Upvotes: 4

Views: 999

Answers (2)

Lennart - Slava Ukraini
Lennart - Slava Ukraini

Reputation: 7171

Here's one ugly trick, since:

select id, temperature 
from periods join days 
    on day between first and last;

is fast, we could try to provoke the optimizer to evaluate this first. Using a sub-query alone is not sufficient:

select id, avg(temperature) 
from (
    select id, temperature 
    from periods 
    join days 
        on day between first and last
) as t 
group by id;
[...]
10 rows in set (1.67 sec)

However, calling a non deterministic function in the sub-query seems to do the trick:

select id, avg(temperature) 
from (
    select id, temperature, rand() 
    from periods 
    join days 
        on day between first and last
) as t 
group by id;
[...]
10 rows in set (0.00 sec)

Unless critical and necessary I would stay away from such tricks. As the optimizer gets better (maybe next fix) it might skip the call to rand(), all the sudden your old plan and performance is back in business.

If you you do use such tricks, make sure to document them carefully in the code so you can clean them up when they are no longer necessary.

MariaDB [test]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.0.20-MariaDB |
+-----------------+
1 row in set (0.00 sec)

explain select id, avg(temperature) from periods join days on day between first and last group by id;
| id   | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                           |
|    1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort    |
|    1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2094315 | Using where; Using join buffer (flat, BNL join) |

explain select id, avg(temperature) from (select id, temperature from periods join days on day between first and last) as t group by id;
| id   | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | extra                                           |
|    1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort    |
|    1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2094315 | Using where; Using join buffer (flat, BNL join) |


explain select id, avg(temperature) from (select id, temperature, rand() from periods join days on day between first and last) as t group by id;
| id   | select_type | table      | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                          |
|    1 | PRIMARY     | <derived2> | ALL   | NULL               | NULL    | NULL    | NULL |       2 | Using temporary; Using filesort                |
|    2 | DERIVED     | periods    | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index                                    |
|    2 | DERIVED     | days       | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2094315 | Range checked for each record (index map: 0x1) |

Upvotes: 2

O. Jones
O. Jones

Reputation: 108686

Try creating a compound covering index on days (day,temperature). It should crank up your speed a bit.

Upvotes: 0

Related Questions