Reputation: 43
I have 3 tables that look like this:
CREATE TABLE big_table_1 (
id INT(11),
col1 TINYINT(1),
col2 TINYINT(1),
col3 TINYINT(1),
PRIMARY KEY (`id`)
)
And so on for big_table_2 and big_table_3. The col1, col2, col3 values are either 0, 1 or null.
I'm looking for id's whose col1 value equals 1 in each table. I join them as follows, using the simplest method I can think of:
SELECT t1.id
FROM big_table_1 AS t1
INNER JOIN big_table_2 AS t2 ON t2.id = t1.id
INNER JOIN big_table_3 AS t3 ON t3.id = t1.id
WHERE t1.col1 = 1
AND t2.col1 = 1
AND t3.col1 = 1;
With 10 million rows per table, the query takes about 40 seconds to execute on my machine:
407231 rows in set (37.19 sec)
Explain results:
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| 1 | SIMPLE | t3 | ALL | PRIMARY | NULL | NULL | NULL | 10999387 | Using where |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
If I declare index on col1, the result is slightly slower:
407231 rows in set (40.84 sec)
I have also tried the following query:
SELECT t1.id
FROM (SELECT distinct ta1.id FROM big_table_1 ta1 WHERE ta1.col1=1) as t1
WHERE EXISTS (SELECT ta2.id FROM big_table_2 ta2 WHERE ta2.col1=1 AND ta2.id = t1.id)
AND EXISTS (SELECT ta3.id FROM big_table_3 ta3 WHERE ta3.col1=1 AND ta3.id = t1.id);
But it's slower:
407231 rows in set (44.01 sec) [with index on col1]
407231 rows in set (1 min 36.52 sec) [without index on col1]
Is the aforementioned simple method basically the fastest way to do this in MySQL? Would it be necessary to shard the table onto multiple servers in order to get the result faster?
Addendum: EXPLAIN results for Andrew's code as requested (I trimmed the tables down to 1 million rows only, and the index is on id and col1):
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 332814 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 333237 | Using where; Using join buffer |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 333505 | Using where; Using join buffer |
| 4 | DERIVED | big_table_3 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
| 3 | DERIVED | big_table_2 | index | NULL | PRIMARY | 5 | NULL | 1000507 | Using where; Using index |
| 2 | DERIVED | big_table_1 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
Upvotes: 4
Views: 3888
Reputation: 142518
INNER JOIN
(same as JOIN
) lets the optimizer pick whether to use the table to its left or the table to its right. The simplified SELECT
you presented could start with any of the three tables.
The optimizer likes to start with the table with the WHERE
clause. Your simplified example implies that each table is equally good IF there is an INDEX
starting with col1
. (See retraction below.)
The second and subsequent tables need a different rule for indexing. In your simplified example, col1
is used for filtering and id
is used for JOINing
. INDEX(col1, id)
and INDEX(id, col1)
work equally well for getting to the second table.
I keep saying "your simplified example" because as soon as you change anything, most of the advice in these answers is up for grabs.
(The retraction) When you have a column with "low cardinality" such as your col%
, with only 0,1,NULL possibilities, INDEX(col1)
is essentially useless since it it faster to blindly scan the table rather than use the index.
On the other hand, INDEX(col1, ...)
may be useful, as mentioned for the second table.
However neither is useful for the first table. If you have such an INDEX
, it will be ignored.
Then comes "covering". Again, your example is unrealistically simplistic because there are essentially no fields touched other than id
and col1
. A "covering" index includes all the fields of a table that are touched in the query. A covering index is virtually always smaller than the data, so it takes less effort to run through a covering index, hence faster.
(Retract the retraction) INDEX(col1, id)
, in that order is a useful covering index for the first table.
Imagine how my discussion had gone if you had not mentioned that col1 had only 3 values. Quite different.
And we have not gotten to ORDER BY
, IN(...)
, BETWEEN...AND...
, engine differences, tricks with the PRIMARY KEY
, LEFT JOIN
, etc.
More insight into building indexes from Selects.
ANALYZE TABLE
should not be necessary.
Upvotes: 2
Reputation: 7788
Have you tried this:
SELECT t1.id
FROM
(SELECT id from big_table_1 where col1 = 1) AS t1
INNER JOIN (SELECT id from big_table_2 where col1 = 1) AS t2 ON t2.id = t1.id
INNER JOIN (SELECT id from big_table_3 where col1 = 1) AS t3 ON t3.id = t1.id
Upvotes: 0
Reputation: 325
For kicks try it with a covered index (a composite of id,col1) So 1 index make it primary composite. No other indexes.
Then run analyze table
xxx (3 times total, once per table)
Then fire it off hoping the mysql cbo isnt to dense to figure it out.
Second idea is to see results without a where clause. Convert it all inside of join on clause
Upvotes: 1