kovac
kovac

Reputation: 5389

DISTINCT COUNT with GROUP BY query is too slow despite indexes

I have the following query that counts the number of vessels in each zone for each week:

SELECT zone, 
    DATE_FORMAT(creation_date, '%Y%u') AS date, 
    COUNT(DISTINCT vessel_imo) AS vessel_count 
  FROM vessel_position
  WHERE zone IS NOT NULL
   AND creation_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
  GROUP BY zone, date;

The table has about 40 million rows. The execution plan for this is:

+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| id | select_type | table           | partitions | type  | possible_keys      | key  | key_len | ref  | rows     | filtered | Extra                                    |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
|  1 | SIMPLE      | vessel_position | NULL       | range | creation_date,zone | zone | 5       | NULL | 21190904 |    50.00 | Using where; Using index; Using filesort |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+

Columns vessel_imo, zone and creation_date each indexed. Primary key is the composite key (vessel_imo, creation_date).

When I look at the query profile, I can see that a large amount of time i spent doing Creating sort index.

Is there anything I can do to improve this query further?

Upvotes: 1

Views: 246

Answers (3)

Wilson Hauck
Wilson Hauck

Reputation: 2343

SET @mystartdate = DATE_SUB(CURDATE(), INTERVAL 12 MONTH);

SELECT zone, DATE_FORMAT(creation_date, '%Y%u') AS date, COUNT(DISTINCT vessel_imo) AS vessel_count FROM vessel_position WHERE creation_date >= @mystartdate AND zone > 0 GROUP BY zone, date;

may provide results in less time, please post your comparative times of second run of each ( old and suggested )

Please post new EXPLAIN SELECT … to confirm index of creation date is now used.

Unless old data is allowed to change, why do you have to gather 12 months history, the numbers more than 1 month ago are NOT going to change.

Upvotes: 0

Rick James
Rick James

Reputation: 142366

Assuming the data, once inserted, does not change, then build and maintain a Summary Table.

The table would have three columns: the zone, the week, and the count-distinct for that week. At the start of each week, build only the rows for the previous week (one per zone; skip NULL). Then build a query to work against that table -- it will be extremely fast since it will be fetching far fewer rows.

Meanwhile, the INDEX(creation_date, zone, vessel_imo) as a secondary index, will make the weekly task reasonably efficient (~52 times as fast as your current query).

Upvotes: 1

The Impaler
The Impaler

Reputation: 48850

It depends on how selective your filtering condition is, and your table structure. Does the filtering condition selects 20% of the rows, 5%, 1%, 0.1%?

If your answer is less than 5% then the following index could help:

create index ix1_date_zone on vessel_position (creation_date, zone);

If your table has many and/or heavy columns, then this option could still be slow, depending on how selective your filtering condition is.

Otherwise, you could try using a more expensive index, to avoid using the table and do:

create index ix2_date_zone_imo on vessel_position 
  (creation_date, zone, vessel_imo);

This index is more expensive to maintain -- read insert, update, delete rows -- but it would be faster for your select.

Try both options and pick the best for your needs.

Upvotes: 0

Related Questions