neverlastn
neverlastn

Reputation: 2204

`MySQL GROUP BY is slower when using index

I'm running on an AWS m4.large (2 vCPUs, 8 GB ram) and I'm seeing a slightly surprising behaviour regarding MySQL and GROUPBY's. I have this test database:

CREATE TABLE demo (
  time INT,
  word VARCHAR(30),
  count INT
);
CREATE INDEX timeword_idx ON demo(time, word);

I insert 4,000,000 records with (uniformly) random words "t%s" % random.randint(0, 30000) and times random.randint(0, 86400).

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (1 min 28.29 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows    | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
|  1 | SIMPLE      | demo  | index | NULL          | timeword_idx | 38      | NULL | 4002267 |       |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+

and then I don't use the index:

SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
3996922 rows in set (34.75 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
|  1 | SIMPLE      | demo  | ALL  | NULL          | NULL | NULL    | NULL | 4002267 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+

As you can see by using the index the query takes 3 times more time. I'm not that surprised since by using the index the query might have to avoid reading the time and word columns but unfortunately by the index being so sparse it shouldn't gain much. On the contrary it turns a direct scan to a random access pattern when it comes to retrieving count.

I just would like to confirm that this is the reason and wonder if there's a "compact rule" on when and index will bring eventually worse performance when used for GROUP BY.

EDIT:

I followed Gordon Linoff's answer and used:

CREATE INDEX timeword_idx ON demo(time, word, count);

The "covering index" computes the results 10 times faster when compared with the full scan:

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (3.36 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
|  1 | SIMPLE      | demo  | index | NULL          | timeword_idx | 43      | NULL | 4002267 | Using index |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+

Very impressive!

Upvotes: 3

Views: 1600

Answers (2)

Drew
Drew

Reputation: 24959

From the manual page How MySQL Uses Indexes

Indexes are less important for queries on small tables, or big tables where report queries process most or all of the rows. When a query needs to access most of the rows, reading sequentially is faster than working through an index. Sequential reads minimize disk seeks, even if not all the rows are needed for the query.

As for tacking on more columns to create covering indexes (ones in which datapages are not accessed but all data is available in the index), be careful. They come at a cost. In your case, your index is getting wide anyway. But a careful balance is always needed.

As spencer alludes to, cardinality always play a role with ranges. For cardinality information, use the show index from tblName command. It is not a driving issue for your query, but useful in other settings. I should rephrase that: cardinality is very high for your table. So your index is deemed a hindrance for it in that query.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270011

You have a reasonably sized table, so the problem might be sequential access of the data or thrashing. Using the index requires going through the index and then looking up the data in the data pages to get the count.

This can actually be worse than just reading the pages and doing a sort, because the pages are not read in order. Sequential reads are considerably more optimized than random reads. In the worst case, the page cache is full and the random reads require flushing pages. If this happens, a single page might need to be read multiple times. With only 4 million relatively small rows, thrashing is unlikely unless you are severely memory constrained.

If this interpretation is correct, then including count in the index should speed the query:

CREATE INDEX timeword_idx ON demo(time, word, count);

Upvotes: 3

Related Questions