Reputation: 5389
I have the following query that counts the number of vessels in each zone for each week:
SELECT zone,
DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE zone IS NOT NULL
AND creation_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY zone, date;
The table has about 40 million rows. The execution plan for this is:
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| 1 | SIMPLE | vessel_position | NULL | range | creation_date,zone | zone | 5 | NULL | 21190904 | 50.00 | Using where; Using index; Using filesort |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
Columns vessel_imo
, zone
and creation_date
each indexed. Primary key is the composite key (vessel_imo
, creation_date
).
When I look at the query profile, I can see that a large amount of time i spent doing Creating sort index
.
Is there anything I can do to improve this query further?
Upvotes: 1
Views: 246
Reputation: 2343
SET @mystartdate = DATE_SUB(CURDATE(), INTERVAL 12 MONTH);
SELECT zone, DATE_FORMAT(creation_date, '%Y%u') AS date, COUNT(DISTINCT vessel_imo) AS vessel_count FROM vessel_position WHERE creation_date >= @mystartdate AND zone > 0 GROUP BY zone, date;
may provide results in less time, please post your comparative times of second run of each ( old and suggested )
Please post new EXPLAIN SELECT … to confirm index of creation date is now used.
Unless old data is allowed to change, why do you have to gather 12 months history, the numbers more than 1 month ago are NOT going to change.
Upvotes: 0
Reputation: 142366
Assuming the data, once inserted, does not change, then build and maintain a Summary Table.
The table would have three columns: the zone, the week, and the count-distinct for that week. At the start of each week, build only the rows for the previous week (one per zone; skip NULL
). Then build a query to work against that table -- it will be extremely fast since it will be fetching far fewer rows.
Meanwhile, the INDEX(creation_date, zone, vessel_imo)
as a secondary index, will make the weekly task reasonably efficient (~52 times as fast as your current query).
Upvotes: 1
Reputation: 48850
It depends on how selective your filtering condition is, and your table structure. Does the filtering condition selects 20% of the rows, 5%, 1%, 0.1%?
If your answer is less than 5% then the following index could help:
create index ix1_date_zone on vessel_position (creation_date, zone);
If your table has many and/or heavy columns, then this option could still be slow, depending on how selective your filtering condition is.
Otherwise, you could try using a more expensive index, to avoid using the table and do:
create index ix2_date_zone_imo on vessel_position
(creation_date, zone, vessel_imo);
This index is more expensive to maintain -- read insert
, update
, delete
rows -- but it would be faster for your select
.
Try both options and pick the best for your needs.
Upvotes: 0