user729103
user729103

Reputation: 651

MySQL - group by and count - best query

We have a statistics database of which we would like to group some results. Every entry has a timestamp 'tstarted'.

We would like to group by every quarter of the day. For each quarter, we would like to know the day count where we have > 0 results (for that quarter).

We could resolve this by using a subquery:

select quarter, sum(q), count(quarter), sum(q) / count(quarter) as average
from (
    select SEC_TO_TIME((TIME_TO_SEC(tstarted) DIV 900) * 900) as quarter, sum(qdelivered) as q
    from statistics 
    where stat_field = 1
    group by SEC_TO_TIME((TIME_TO_SEC(tstarted) DIV 900) * 900), date(tstarted)
    order by SEC_TO_TIME((TIME_TO_SEC(tstarted) DIV 900) * 900) asc
) as sub
group by quarter

My question: is there a more efficient way to retrieve this result (e.g. join or other way)?

Upvotes: 2

Views: 248

Answers (2)

Kickstart
Kickstart

Reputation: 21513

I would be tempted to set up a table of quarters in the day. Use this table and LEFT JOIN your statistics table it.

CREATE TABLE quarters
(
    id  INT,
    start_qtr   INT,
    end_qtr INT
);

INSERT INTO quarters (id, start_qtr, end_qtr) VALUES
(1,0,899),
(2,900,1799),
(3,1800,2699),
(4,2700,3599),
(5,3600,4499),
(6,4500,5399),
(7,5400,6299),
(8,6300,7199),
etc;

Your query can then be:-

SELECT SEC_TO_TIME(quarters.start_qtr) AS quarter, 
        sum(statistics.qdelivered), 
        count(statistics.qdelivered), 
        sum(statistics.qdelivered) / count(statistics.qdelivered) as average
FROM quarters
LEFT OUTER JOIN statistics
ON TIME_TO_SEC(statistics.tstarted) BETWEEN quarters.start_qtr AND quarters.end_qtr
AND statistics.stat_field = 1
AND DATE(statistics.tstarted) = '2014-06-30'
GROUP BY quarter
ORDER BY quarter;

Advantage of this is that it will give you entries with a count of 0 (and an average of NULL) for quarters where there are no statistics, and it saves some of the calculations.

You could save more calculations by adding time columns to the quarters table:-

CREATE TABLE quarters
(
    id  INT,
    start_qtr   INT,
    end_qtr INT
    start_qtr_time  TIME,
    end_qtr_time    TIME,
);

INSERT INTO quarters (id, start_qtr, end_qtr, start_qtr_time, end_qtr_time) VALUES
(1,0,899, '00:00:00', '00:14:59'),
(2,900,1799, '00:15:00', '00:29:59'),
(3,1800,2699, '00:30:00', '00:44:59'),
(4,2700,3599, '00:45:00', '00:59:59'),
(5,3600,4499, '01:00:00', '01:14:59'),
(6,4500,5399, '01:15:00', '01:29:59'),
(7,5400,6299, '01:30:00', '01:44:59'),
(8,6300,7199, '01:45:00', '01:59:59'),
etc

Then this saves the use of a function on the JOIN:-

SELECT start_qtr_time AS quarter, 
        sum(statistics.qdelivered), 
        count(statistics.qdelivered), 
        sum(statistics.qdelivered) / count(statistics.qdelivered) as average
FROM quarters
LEFT OUTER JOIN statistics
ON TIME(statistics.tstarted) BETWEEN quarters.start_qtr_time AND quarters.end_qtr_time
AND statistics.stat_field = 1
AND DATE(statistics.tstarted) = '2014-06-30'
GROUP BY quarter
ORDER BY quarter;

These both assume you are interested in a particular day.

Upvotes: 1

spencer7593
spencer7593

Reputation: 108370

Efficiency could be improved by eliminating the inline view (derived table aliased as sub), and doing all the work in a single query. (This is because of the way that MySQL processes the inline view, creating and populating a temporary MyISAM table.)

I don't understand why the expression date(tstarted) needs to be included in the GROUP BY clause; I don't see that removing that would change the result set returned by the query.

I do now see the effect of including the date(tstarted) in the GROUP BY of the inline view query.

I think this query returns the same result as the original:

SELECT SEC_TO_TIME((TIME_TO_SEC(s.tstarted) DIV 900) * 900) AS `quarter`
     , SUM(s.qdelivered)                                    AS `q`
     , COUNT(DISTINCT DATE(s.tstarted))                     AS `day_count`
     , SUM(s.qdelivered) / COUNT(DISTINCT DATE(s.tstarted)) AS `average`
  FROM statistics s
 WHERE s.stat_field = 1 
 GROUP BY SEC_TO_TIME((TIME_TO_SEC(s.tstarted) DIV 900) * 900)

This should be more efficient since it avoids materializing an intermediate derived table.


Your question said you wanted a "day count"; that sounds like you want a count of the each day that had a row within a particular quarter hour.

To get that, you could just add an aggregate expression to the SELECT list,

     , COUNT(DISTINCT DATE(s.tstarted))                     AS `day_count`

Upvotes: 2

Related Questions