MySQL GROUP BY optimization

Question

This question is a more specific version of a previous question I asked

Table

CREATE TABLE Test4_ClusterMatches 
(
    `match_index` INT UNSIGNED,
    `cluster_index` INT UNSIGNED, 
    `id` INT NOT NULL AUTO_INCREMENT,
    `tfidf` FLOAT,
    PRIMARY KEY (`cluster_index`,`match_index`,`id`)
);

The query I want to run

mysql> explain SELECT `match_index`, SUM(`tfidf`) AS total 
FROM Test4_ClusterMatches WHERE `cluster_index` IN (1,2,3 ... 3000) 
GROUP BY `match_index`;

The Problem with the query

It uses temporary and filesort so its to slow

+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
|  1 | SIMPLE      | Test4_ClusterMatches | range | PRIMARY       | PRIMARY | 4       | NULL | 51540 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+

With the current indexing the query would need to sort by cluster_index first to eliminate the use of temporary and filesort, but doing so gives the wrong results for sum(tfidf). Changing the primary key to

PRIMARY KEY (`match_index`,`cluster_index`,`id`)

Doesn't use file sort or temp tables but it uses 14,932,441 rows so it is also to slow

+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows     | Extra                    |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | Test5_ClusterMatches | index | NULL          | PRIMARY | 16      | NULL | 14932441 | Using where; Using index | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+

Tight Index Scan

Using tight index scan by running the search for just one index

mysql> explain SELECT match_index, SUM(tfidf) AS total
FROM Test4_ClusterMatches WHERE cluster_index =3000 
GROUP BY match_index;

Eliminates the temporary tables and filesort.

+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table                | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | Test4_ClusterMatches | ref  | PRIMARY       | PRIMARY | 4       | const |   27 | Using where; Using index | 
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+

I'm not sure if this can be exploited with some magic sql-fu that I haven't come across yet?

Question

How can I change my query so that it use 3,000 cluster_indexes, avoids using temporary and filesort without it needing to use 14,932,441 rows?

Update

Using the table

CREATE TABLE Test6_ClusterMatches 
(
  match_index INT UNSIGNED,
  cluster_index INT UNSIGNED, 
  id INT NOT NULL AUTO_INCREMENT,
  tfidf FLOAT,
  PRIMARY KEY (id),
  UNIQUE KEY(cluster_index,match_index)
);

The query below then gives 10 rows in set (0.41 sec) :)

SELECT `match_index`, SUM(`tfidf`) AS total FROM Test6_ClusterMatches WHERE 
`cluster_index` IN (.....)
GROUP BY `match_index` ORDER BY total DESC LIMIT 0,10;

but its using temporary and filesort

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                                        |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
|  1 | SIMPLE      | Test6_ClusterMatches | range | cluster_index | cluster_index | 5       | NULL | 78663 | Using where; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+

I'm wondering if theres anyway to get it faster by eliminating the using temporary and using filesort?

Jon Black · Accepted Answer

I had a quick look and this is what I came up with - hope it helps...

SQL Table

drop table if exists cluster_matches;
create table cluster_matches
(
 cluster_id int unsigned not null,
 match_id int unsigned not null,
 ...
 tfidf float not null default 0,
 primary key (cluster_id, match_id) -- if this isnt unique add id to the end !!
)
engine=innodb;

Test Data

select count(*) from cluster_matches

count(*)
========
17974591

select count(distinct(cluster_id)) from cluster_matches;

count(distinct(cluster_id))
===========================
1000000

select count(distinct(match_id)) from cluster_matches;

count(distinct(match_id))
=========================
6000

explain select
 cm.match_id,
 sum(tfidf) as sum_tfidf,
 count(*) as count_tfidf
from
 cluster_matches cm
where
 cm.cluster_id between 5000 and 10000
group by
 cm.match_id
order by
 sum_tfidf desc limit 10;

id  select_type table   type    possible_keys   key key_len ref rows    Extra
==  =========== =====   ====    =============   === ======= === ====    =====
1   SIMPLE  cm  range   PRIMARY PRIMARY 4       290016  Using where; Using temporary; Using filesort

runtime - 0.067 seconds.

Pretty respectable runtime of 0.067 seconds but I think we can make it better.

Stored Procedure

You will have to forgive me for not wanting to type/pass in a list of 5000+ random cluster_ids !

call sum_cluster_matches(null,1); -- for testing
call sum_cluster_matches('1,2,3,4,....5000',1);

The bulk of the sproc isnt very elegant but all it does is split a csv string into individual cluster_ids and populate a temp table.

drop procedure if exists sum_cluster_matches;

delimiter #

create procedure sum_cluster_matches
(
in p_cluster_id_csv varchar(65535),
in p_show_explain tinyint unsigned
)
proc_main:begin

declare v_id varchar(10);
declare v_done tinyint unsigned default 0;
declare v_idx int unsigned default 1;

    create temporary table tmp(cluster_id int unsigned not null primary key);   

    -- not every elegant - split the string into tokens and put into a temp table...

    if p_cluster_id_csv is not null then
        while not v_done do
            set v_id = trim(substring(p_cluster_id_csv, v_idx, 
                if(locate(',', p_cluster_id_csv, v_idx) > 0, 
                        locate(',', p_cluster_id_csv, v_idx) - v_idx, length(p_cluster_id_csv))));

                if length(v_id) > 0 then
                set v_idx = v_idx + length(v_id) + 1;
                        insert ignore into tmp values(v_id);
                else
                set v_done = 1;
                end if;
        end while;
    else
        -- instead of passing in a huge comma separated list of cluster_ids im cheating here to save typing
        insert into tmp select cluster_id from clusters where cluster_id between 5000 and 10000;
        -- end cheat
    end if;

    if p_show_explain then

        select count(*) as count_of_tmp from tmp;

        explain
        select
         cm.match_id,
         sum(tfidf) as sum_tfidf,
         count(*) as count_tfidf
        from
         cluster_matches cm
        inner join tmp on tmp.cluster_id = cm.cluster_id
        group by
         cm.match_id
        order by
         sum_tfidf desc limit 10;
    end if;

    select
     cm.match_id,
     sum(tfidf) as sum_tfidf,
     count(*) as count_tfidf
    from
     cluster_matches cm
    inner join tmp on tmp.cluster_id = cm.cluster_id
    group by
     cm.match_id
    order by
     sum_tfidf desc limit 10;

    drop temporary table if exists tmp;

end proc_main #

delimiter ;

Results

call sum_cluster_matches(null,1);

count_of_tmp
============
5001

id  select_type table   type    possible_keys   key key_len ref rows    Extra
==  =========== =====   ====    =============   === ======= === ====    =====
1   SIMPLE  tmp index   PRIMARY PRIMARY 4       5001    Using index; Using temporary; Using filesort
1   SIMPLE  cm  ref PRIMARY PRIMARY 4   vldb_db.tmp.cluster_id  8   

match_id    sum_tfidf   count_tfidf
========    =========   ===========
1618        387         64
1473        387         64
3307        382         64
2495        373         64
1135        373         64
3832        372         57
3203        362         58
5464        358         67
2100        355         60
1634        354         52

runtime 0.028 seconds.

Explain plan and runtime much improved.