jaklaassen
jaklaassen

Reputation: 1

Bad mysql index selection

I have a large table (250M rows) with a column group_id that broadly divides the table into groups (group_id). It has the following index:

mysql> show indexes from table\G;
*************************** 13. row ***************************
       Table: table
  Non_unique: 1
    Key_name: myindex
Seq_in_index: 1
 Column_name: group_id
   Collation: A
 Cardinality: 181819
    Sub_part: NULL
      Packed: NULL
        Null: YES
  Index_type: BTREE
     Comment: 
*************************** 14. row ***************************
       Table: table
  Non_unique: 1
    Key_name: myindex
Seq_in_index: 2
 Column_name: id
   Collation: A
 Cardinality: 213456239
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 

I want to execute the following query:

mysql> explain select * from `table` WHERE (`table`.`type_id` IN (11, 17, 12, 19) AND `table`.`group_id` = 310248) ORDER BY `table`.`id` ASC LIMIT 201\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: table
         type: index
possible_keys: [SOME INDEX NAMES]
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 257386914
        Extra: Using where
1 row in set (0.00 sec)

I understand the it will need to scan some rows because of the problems with indexing for WHERE ... IN (). Amazingly to me, however, it chooses to scan almost as many rows as possible by using the primary key index.

The following seems unambiguously (and obviously) superior:

mysql> explain select * from `table` USE INDEX (myindex) WHERE (`table`.`type_id` IN (11, 17, 12, 19) AND `table`.`group_id` = 310248) ORDER BY `table`.`id` ASC LIMIT 201\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: table
         type: ref
possible_keys: myindex
          key: myindex
      key_len: 5
          ref: const
         rows: 1883760
        Extra: Using where
1 row in set (0.00 sec)

Using a larger value for LIMIT (2000), using different values of group_id, removing the ORDER BY and removing the type_id filter all cause it to use the index. I have run ANALYZE TABLE.

Its worth noting that the row estimates are quite high:

mysql> select count(*) from table where group_id=310248 and type_id in (11, 17, 12, 19) ;
+----------+
| count(*) |
+----------+
|   583868 |
+----------+
1 row in set (0.61 sec)

mysql version:

Ver 5.1.57-rel12.8-log for debian-linux-gnu on x86_64 ((Percona Server (GPL), 12.8, Revision 233))

Why would mysql choose a plan that it thinks will involve scanning 257386914 rows rather than 1883760? I understand that it might value sequential reads, but why would it choose the index for 2000 rows, but not for 200 rows? Why would filtering by a different group id?

Edited: I have also tried creating the index (group_id, id, type_id) so that all sorting can be done using only an index scan, but I can't get it to ever select that index.

Upvotes: 0

Views: 923

Answers (1)

spencer7593
spencer7593

Reputation: 108400

Did you have a question?

Note that because that predicate on the type_id column has to be checked, and because your query is returning at least one column that is not in the index, MySQL will have to visit the data pages of the table, in order to access the values for those columns.

So, MySQL may be favoring the cluster key, since that's where the data pages are; the cluster key also allows MySQL to avoid a sort operation ("Using filesort"). (We do note that the execution plan that uses your index also avoids a sort operation.)

If you want MySQL to favor your index, you might consider including type_id as a third column in that index, if that is at all selective.

Alternatively, you might consider modifying your query to "ORDER BY group_id, id" to influence the optimizer.

Have you measured the performance of the query, both with the hint and without the hint?

Upvotes: 1

Related Questions