Reputation: 919
I'm trying to make a query for get some trend stats, but the benchmark is really slow. The query execution time is around 134 seconds.
I have a MySQL table called table_1
.
CREATE TABLE `table_1` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`original_id` bigint(11) DEFAULT NULL,
`invoice_num` bigint(11) DEFAULT NULL,
`registration` timestamp NULL DEFAULT NULL,
`paid_amount` decimal(10,6) DEFAULT NULL,
`cost_amount` decimal(10,6) DEFAULT NULL,
`profit_amount` decimal(10,6) DEFAULT NULL,
`net_amount` decimal(10,6) DEFAULT NULL,
`customer_id` bigint(11) DEFAULT NULL,
`recipient_id` text,
`cashier_name` text,
`sales_type` text,
`sales_status` text,
`sales_location` text,
`invoice_duration` text,
`store_id` double DEFAULT NULL,
`is_cash` int(11) DEFAULT NULL,
`is_card` int(11) DEFAULT NULL,
`brandid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_registration_compound` (`id`,`registration`)
) ENGINE=InnoDB AUTO_INCREMENT=47420958 DEFAULT CHARSET=latin1;
I have set a compound index made of id
+registration
.
SELECT
store_id,
CONCAT('[',GROUP_CONCAT(tot SEPARATOR ','),']') timeline_transactions,
SUM(tot) AS total_transactions,
CONCAT('[',GROUP_CONCAT(totalRevenues SEPARATOR ','),']') timeline_revenues,
SUM(totalRevenues) AS revenues,
CONCAT('[',GROUP_CONCAT(totalProfit SEPARATOR ','),']') timeline_profit,
SUM(totalProfit) AS profit,
CONCAT('[',GROUP_CONCAT(totalCost SEPARATOR ','),']') timeline_costs,
SUM(totalCost) AS costs
FROM (select t1.md,
COALESCE(SUM(t1.amount+t2.revenues), 0) AS totalRevenues,
COALESCE(SUM(t1.amount+t2.profit), 0) AS totalProfit,
COALESCE(SUM(t1.amount+t2.costs), 0) AS totalCost,
COALESCE(SUM(t1.amount+t2.tot), 0) AS tot,
t1.store_id
from
(
SELECT a.store_id,b.md,b.amount from ( SELECT DISTINCT store_id FROM table_1) AS a
CROSS JOIN
(
SELECT
DATE_FORMAT(a.DATE, "%m") as md,
'0' as amount
from (
select curdate() - INTERVAL (a.a + (10 * b.a) + (100 * c.a)) month as Date
from (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as a
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as b
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as c
) a
where a.Date >='2019-01-01' and a.Date <= '2019-01-14'
group by md) AS b
)t1
left join
(
SELECT
COUNT(epl.invoice_num) AS tot,
SUM(paid_amount) AS revenues,
SUM(profit_amount) AS profit,
SUM(cost_amount) AS costs,
store_id,
date_format(epl.registration, '%m') md
FROM table_1 epl
GROUP BY store_id, date_format(epl.registration, '%m')
)t2
ON t2.md=t1.md AND t2.store_id=t1.store_id
group BY t1.md, t1.store_id) AS t3 GROUP BY store_id ORDER BY total_transactions desc
Maybe I should change from timestamp to datetime in registration
column?
Upvotes: 3
Views: 953
Reputation: 142298
With PRIMARY KEY(id)
, having INDEX(id, anything)
is virtually useless.
See if you can avoid nesting subqueries.
Consider building that 'date' table permanently and have a PRIMARY KEY(md)
on it. Currently, neither subquery has an index on the join column (md).
You may have the "explode-implode" syndrome. This is where JOINs
expand the number of rows, only to have the GROUP BY
collapse them.
Don't use COUNT(xx)
unless you need to check xx
for being NULL
. Simply do COUNT(*)
.
store_id double
-- Really?
TIMESTAMP
vs DATETIME
-- they perform about the same; don't bother changing it.
Since you are only looking at 2019-01, get rid of
date_format(epl.registration, '%m')
That, alone, may speed it up a lot. (However, you lose generality.)
Upvotes: 0
Reputation: 11106
About 90% of your execution time will be used to execute GROUP BY store_id, date_format(epl.registration, '%m')
.
Unfortunately, you cannot use an index to group by
a derived value, and since this is vital to your report, you need to precalculate this. You can do this by adding that value to your table, e.g. using a generated column:
alter table table_1 add md varchar(2) as (date_format(registration, '%m')) stored
I kept the varchar
format you used for the month here, you could also use a number (e.g. tinyint
) for the month.
This requires MySQL 5.7, otherwise you can use triggers to achieve the same thing:
alter table table_1 add md varchar(2) null;
create trigger tri_table_1 before insert on table_1
for each row set new.md = date_format(new.registration,'%m');
create trigger tru_table_1 before update on table_1
for each row set new.md = date_format(new.registration,'%m');
Then add an index, preferably a covering index, starting with store_id
and md
, e.g.
create index idx_table_1_storeid_md on table_1
(store_id, md, invoice_num, paid_amount, profit_amount, cost_amount)
If you have other, similar reports, you may want to check if they use additional columns and could profit from covering more columns. The index will require about 1.5GB of storage space (and how long it takes your drive to read 1.5GB will basically single-handedly define your execution time, short of caching).
Then change your query to group by this new indexed column, e.g.
...
SUM(cost_amount) AS costs,
store_id,
md -- instead of date_format(epl.registration, '%m') md
FROM table_1 epl
GROUP BY store_id, md -- instead of date_format(epl.registration, '%m')
)t2 ...
This index will also take care of another other 9% of your execution time, SELECT DISTINCT store_id FROM table_1
, which will profit from an index starting with store_id
.
Now that 99% of your query is taken care of, some further remarks:
the subquery b
and your date range where a.Date >='2019-01-01' and a.Date <= '2019-01-14'
might not do what you think it does. You should run the part SELECT DATE_FORMAT(a.DATE, "%m") as md, ... group by md
separately to see what it does. In its current state, it will give you one row with the tuple '01', 0
, representing "january", so it is basically a complicated way of doing select '01', 0
. Unless today is the 15th or later, then it returns nothing (which is probably unintended).
Particularly, it will not limit the invoice dates to that specific range, but to all invoices that are from (the whole) january of any year. If that is what you intended, you should (additionally) add that filter directly, e.g. by using FROM table_1 epl where epl.md = '01' GROUP BY ...
, reducing your execution time by an additional factor of about 12. So (apart from the 15th and up-problem), with your current range you should get the same result if you use
...
SUM(cost_amount) AS costs,
store_id,
md
FROM table_1 epl
WHERE md = '01'
GROUP BY store_id, md
)t2 ...
For different date ranges you will have to adjust that term. And to emphasize my point, this is significantly different from filtering invoices by their date, e.g.
...
SUM(cost_amount) AS costs,
store_id,
md
FROM table_1 epl
WHERE epl.registration >='2019-01-01'
and epl.registration <= '2019-01-14'
GROUP BY store_id, md
)t2 ...
which you may (or may not) have tried to do. You would need a different index in that case though (and it would be a slightly different question).
there might be some additional optimizations, simplifications or beautifications in the rest of your query, e.g group BY t1.md, t1.store_id
looks redundant and/or wrong (indicating you are actually not on MySQL 5.7), and the b
-subquery can only give you values 1 to 12, so generating 1000 dates and reducing them again could be simplified. But since they are operating on 100-ish rows, they will not affect execution time significantly, and I haven't checked those in detail. Some of it is probably due to getting the right output format or to generalizations (although, if you are dynamically grouping by other formats than by month, you need other indexes/columns, but that would be a different question).
An alternative way to precalculate your values would be a summary table where you e.g. run your inner query (the expensive group by
) once a day and store the result in a table and then reuse it (by selecting from this table instead of doing the group by). This is especially viable for data like invoices that never change (although otherwise you can use triggers to keep the summary tables up to date). It also becomes more viable if you have several scenarios, e.g. if your user can decide to group by weekday, year, month or zodiac sign, since otherwise you would need to add an index for each of those. It becomes less viable if you need to dynamically limit your invoice range (to e.g. 2019-01-01 ... 2019-01-14). If you need to include the current day in your report, you can still precalculate and then add the values for the current date from the table (which should only involve a very limited number of rows, which is fast if you have an index starting with your date column), or use triggers to update your summary table on-the-fly.
Upvotes: 4