Why does SQL choose join index inconsistently?

Question

I have a join between two tables on three columns. The join was taking hours to complete, so I added a composite index on all three columns on each table. Then, sometimes the join would be really fast and sometimes it would still be slow.

Using EXPLAIN, I noticed that it was fast when it chose to join using the composite index and slow when it just chose an index on only one of the columns. But each of these runs was using the same data.

Is there randomness involved in SQL selecting which index to use? Why would it be inconsistent?

If it helps: it is a MySQL database being queried from pandas in python.

spencer7593 · Accepted Answer

Q: Is there randomness involved in SQL selecting which index to use?

Not randomness involved, per se. The optimizer makes use of table and index statistics (the number of rows and cardinality) along with predicates in the query to develop estimates, e.g. the number of rows that will need be retrieved.

MySQL also evaluates the cost for join operations, sort operations, etc. for each possible access plan (e.g. which index to use, which order to access the tables in) to come up with an estimated cost for each plan.

And then the optimizer compares the costs, and uses the plan that has the lowest cost. There are some parameters (MySQL system variables) that influence the cost estimates. (For example, tuning the expected cost for I/O operations.)

Q: Why would it be inconsistent?

For an InnoDB table, there is some randomness that comes into play with gathering statistics. InnoDB uses a sampling technique, doing a "deep dive" into a small set of "random" pages. The results from those sample pages is extrapolated into estimates for the whole table.

Some of the InnoDB tuning parameters (MySQL system variables) influence (increase/decrease) the number of pages that are sampled when gathering statistics. Sampling a smaller number of pages can be faster, but the smaller sample makes it more likely that the sample set may not be entirely representative of the entire table. Using a larger number of sample alleviates that to a degree, but the sampling takes longer. It's a tradeoff.

Note that InnoDB automatically re-collects statistics when 10% of the rows in the table are changed with DML operations. (There are some cases where the automatic collection of statistics may not be trigger, for example, creating a new (empty) table and populating it with a LOAD DATA statement, that could result in no statistics collected.)

So, the most likely explanation for the observed behavior is that at different times, there are different statistics available to the optimizer.

Note that it is possible to influence the optimizer to opt for a plan that makes use of particular indexes, by including hints in the SQL text. We typically don't need to do that, nor do we want to do that. But in some cases, where the optimizer is choosing an inefficient plan, we can help get a better plan.

A few references (from the MySQL 5.7 Reference Manual)

https://dev.mysql.com/doc/refman/5.7/en/optimizer-hints.html

https://dev.mysql.com/doc/refman/5.7/en/innodb-performance-optimizer-statistics.html

Why does SQL choose join index inconsistently?

Answers (1)

Related Questions