leemes
leemes

Reputation: 45715

Huge performance difference: "WHERE x IN (...values...)" vs. "WHERE x IN (<subquery>)"

I want to query a table T for rows identified by some column x. The values of x, for which the rows should be returned, are determined by some sub-query on table U, which carries the actual filter condition on columns a (a short string) and b (a timestamp value, looking for a range).

SELECT * FROM T
    WHERE x IN (
        SELECT x FROM U WHERE a = ? AND b BETWEEN ? AND ?
    )
    ORDER BY x, y

Both tables T and U are quite big (in the range of 1M..100M rows). Their definitions are at the end of the question. In the typical use case, about 500 to 1000 rows are in the result set for about 100 different values for x. These values for x are not consecutive, and indeed very "random".

This query takes quite some long time (in the range of 0.5 to 2 seconds), although proper indices for both tables T and U are used and indeed only a small number of rows is being "examined" (about 1000) according to the slow query log.

The indices are defined as:

The EXPLAIN for the query above is:

id  select_type  table  type   possible_keys  key    key_len  ref   rows  Extra     
1   SIMPLE       U      range  idx_u          idx_u  24       NULL  107   Using where; Using index; Using temporary; Using filesort
1   SIMPLE       T      ref    idx_t          idx_t  8        U.x   4     NULL

I also tried a JOIN instead of sub-query:

SELECT * FROM T
    JOIN U ON T.x = U.x
    WHERE a = ? AND b BETWEEN ? AND ?
    ORDER BY T.x, T.y

The EXPLAIN tells exactly the same as above.


However, when executing the sub-query separately, and listing the values for x literally, both queries together take much less time (in the range of 10 to 20 milliseconds). Also, their EXPLAIN looks different.

The first query (originally the sub-query):

SELECT x FROM U WHERE a = ? AND b BETWEEN ? AND ?

Its EXPLAIN matches the first entry from above:

id  select_type  table  type   possible_keys  key    key_len  ref   rows  Extra     
1   SIMPLE       U      range  idx_u          idx_u  24       NULL  107   Using where; Using index; Using temporary; Using filesort

Now the second query:

SELECT * FROM T
    WHERE x IN (
        3917,8525,13149,17729,22355,26908,31457,36053,40663,45250 -- ... (64 values here)
    )
    ORDER BY x, y

is explained as

id  select_type  table  type   possible_keys  key    key_len  ref   rows  Extra     
1   SIMPLE       T      range  idx_t          idx_t  8        U.x   436   Using index condition

Now I noticed that in the original (composed) query, for table T, the guess for the number of rows is only 4, which is way too low. Also, the type for T is ref, while in the separate query (with explicit values for x), it is range.

Might the misguessed row number for the lookup in T cause the query to be executed in a sub-optimal way? Is the ref type the reason for the big performance difference?

And finally, to fix this, can I force range? Note that this is not about forcing an index to be used, as it already seems to be used.


EDIT: The tables have been created like this:

CREATE TABLE `U` (
  `x` bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  `a` varchar(20) CHARACTER SET ascii NOT NULL,
  `b` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  -- (more columns irrelevant here)
) ENGINE=InnoDB;

ALTER TABLE `U`
  ADD PRIMARY KEY (`x`),
  ADD KEY `idx_u` (`a`,`b`) USING BTREE;
CREATE TABLE `T` (
  `id` bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  `x` bigint(20) UNSIGNED NOT NULL,
  `y` int(11) UNSIGNED NOT NULL,
  -- (more columns irrelevant here)
) ENGINE=InnoDB;

ALTER TABLE `T`
  ADD PRIMARY KEY (`id`),
  ADD KEY `idx_t` (`x`,`y`) USING BTREE;

Upvotes: 3

Views: 108

Answers (2)

Rahul
Rahul

Reputation: 77906

What if you change your query to use WHERE EXISTS instead of IN operator and since you already tried with JOIN conversion

SELECT * FROM T
    WHERE EXISTS (SELECT 1 FROM U WHERE T.x = U.x
                  AND a = ? 
                  AND b BETWEEN ? AND ?
    )
    ORDER BY x, y;

Strange fact is that, I see you have an index on x,y(ADD KEY idx_t (x,y) USING BTREE) but your explain shows it's performing a filesort Using where; Using index; Using filesort

Upvotes: 1

Rushabh Master
Rushabh Master

Reputation: 490

You can try with join like

SELECT T.* FROM T
         JOIN U on T.x = U.x
WHERE U.a = ? AND U.b BETWEEN ? AND ?
ORDER BY T.x, T.y

Upvotes: 0

Related Questions