sbrattla
sbrattla

Reputation: 5386

How can I optimize a query which depends on both COUNT and GROUP BY?

I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.

To get the number of downloads for tracks belonging to an specific album I would do the following query :

select 
    date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from        entityusage as eu
inner join  entityusage_file as euf 
        ON  euf.entityusage_id = eu.id
inner join  track as t 
        ON t.id = euf.track_id
where
    t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
        and entitytype = 't'
        and action = 1
group by date_format(eu.updated, '%Y%m%d')

I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).

This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.

This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.

Can this query be optimised further with indexes, or do I need to take another approach to get this job done?

CREATE TABLE `entityusage` (
  `id` char(36) NOT NULL,
  `title` varchar(255) DEFAULT NULL,
  `entitytype` varchar(5) NOT NULL,
  `entityid` char(36) NOT NULL,
  `externaluser` int(10) NOT NULL,
  `action` tinyint(1) NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `entityusage_file` (
  `id` char(36) NOT NULL,
  `entityusage_id` char(36) NOT NULL,
  `track_id` char(36) NOT NULL,
  `file_id` char(36) NOT NULL,
  `type` varchar(3) NOT NULL,
  `quality` int(1) NOT NULL,
  `size` int(20) NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `file_id` (`file_id`),
  KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `track` (
  `id` char(36) NOT NULL,
  `album_id` char(36) NOT NULL,
  `number` int(3) NOT NULL DEFAULT '0',
  `title` varchar(255) DEFAULT NULL,
  `updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
  PRIMARY KEY (`id`),
  KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;

An EXPLAIN on the query gives me the following :

+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id   | select_type | table | type   | possible_keys  | key            | key_len | ref                          | rows    | Extra                                        |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
|    1 | SIMPLE      | eu    | ALL    | NULL           | NULL           | NULL    | NULL                         | 7832817 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | euf   | ref    | entityusage_id | entityusage_id | 108     | func                         |       1 | Using index condition                        |
|    1 | SIMPLE      | t     | eq_ref | PRIMARY,album  | PRIMARY        | 108     | trackerdatabase.euf.track_id |       1 | Using where                                  |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+

Upvotes: 0

Views: 94

Answers (4)

Rick James
Rick James

Reputation: 142218

Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.

The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.

To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.

date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)

(The other Answers and Comments cover a number of issues; I won't repeat them here.)

Upvotes: 2

Bernd Buffen
Bernd Buffen

Reputation: 15057

can you try this. i cant really test it without some sample data from you. In this case the query looks first in table track and joins then the other tables.

 SELECT 
    date_format(eu.updated, '%Y-%m-%d') AS p
    , count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
 WHERE
    t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
        AND entitytype = 't'
        AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');

Upvotes: 1

spencer7593
spencer7593

Reputation: 108370

Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.

I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.

To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...

CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
     , DATE(eu.updated) AS updated_dt
     , SUM(IF(eu.action = 1,1,0) AS cnt
  FROM entityusage eu
 WHERE eu.track_id IS NOT NULL
   AND eu.updated IS NOT NULL
 GROUP
    BY eu.track_id
     , DATE(eu.updated)

An index ON entityusage (track_id,updated,action) may benefit performance.

Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.

The "precomputed results" table would get stale, and would need to be periodically refreshed.

This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269563

This is your query:

select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
     entityusage_file euf
     on euf.entityusage_id = eu.id join
     track t 
     on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
      eu.entitytype = 't' and
      eu.action = 1
group by date_format(eu.updated, '%Y%m%d');

I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).

Upvotes: 2

Related Questions