Reputation: 505
Say I have two MyISAM tables:
tab_big: id1, id2, id_a, ord (5 billion records)
tab_small: id1, id2, id_b (1 billion records)
CREATE TABLE IF NOT EXISTS `tab_big` (
`id_a` int(10) unsigned NOT NULL,
`id1` int(10) unsigned NOT NULL,
`id2` int(10) unsigned NOT NULL,
`ord` int(10) unsigned NOT NULL DEFAULT '1',
PRIMARY KEY (`id_a`,`id1`,`id2`),
KEY `id1` (`id1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `tab_small` (
`id_b` int(10) unsigned NOT NULL,
`id1` int(10) unsigned NOT NULL,
`id2` int(10) unsigned NOT NULL,
PRIMARY KEY (`id_b`,`id1`,`id2`),
KEY `id_b` (`id_b`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
All fields are INT. In both tables the combination of the three id fields (respectively id1, id2, id_a and id1, id2, id_b) values is unique, so for both I created a primary key on these three fields.
I need an efficient query that gets unique values of id_a from the first table, where:
I believe it's not very cacheable, since both tables change all the time (rows are added).
My current query is pretty straightforward:
SELECT tab_big.id_a FROM tab_big, tab_small
WHERE tab_small.id_b = '$constant'
AND tab_big.id1 = tab_small.id1 AND tab_big.id2 = tab_small.id2
AND tab_big.id_a NOT IN ({comma delimited list of 200 ids})
GROUP BY tab_big.id_a
ORDER BY SUM(tab_big.ord) DESC
LIMIT 10
It works, but not fast enough to really use it. What can be done with it?
EXPLAIN says it first gets a ranged query from tab_big, then applies that to tab_small (Edit: added below). I don't know why (EXPLAIN says the query uses primary keys), but adding a tab_big.id1 index seems to help a bit. Also, trying to make it go the other way around with STRAIGHT_JOIN, first selecting a 10k subset from (smaller) tab_small and then using it to search in (larger) tab_big gives much worse results than default (Edit: with a small dataset that I have now to test on; on production data it would apparently be the other way around and EXPLAIN would look like the second one).
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | tab_big | range | PRIMARY,id1 | PRIMARY | 4 | NULL | 1374793 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | tab_small | eq_ref | PRIMARY,id_b | PRIMARY | 12 | const,db.tab_big.id1,db.tab_big.id2 | 1 | Using index |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
On larger datasets EXPLAIN would probably look more like this (disregard the 'rows' values though - it's taken from a smaller dataset):
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| 1 | SIMPLE | tab_small | ref | PRIMARY,id_b,id1 | PRIMARY | 4 | const | 259 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | tab_big | ref | PRIMARY,id1 | id1 | 4 | db.tab_small.id1 | 25692 | Using where |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
Any thoughts?
Upvotes: 2
Views: 606
Reputation: 425251
Create the following indexes:
CREATE INDEX ix_big_1_2_a ON tab_big (id1, id2, id_a)
CREATE UNIQUE INDEX ux_small_b_2_1 ON tab_small (id_b, id2, id1)
and try this:
SELECT DISTINCT
a.id_a
FROM tab_small b
JOIN tab_big a
ON (a.id1, a.id2) = (b.id1, b.id2)
WHERE b.id_b = 2
AND a.id_a NOT IN
(
SELECT id1
FROM tab_small b1 /* FORCE INDEX (PRIMARY) */
WHERE b1.id_b = 2
)
AND a.id_a NOT IN
(
SELECT id2
FROM tab_small b2 /* FORCE INDEX (ux_small_b_2_1) */
WHERE b2.id_b = 2
)
, which produces this query plan:
1, 'PRIMARY', 'b', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index; Using temporary'
1, 'PRIMARY', 'a', 'ref', 'ix_big_1_2', 'ix_big_1_2', '8', 'test.b.id1,test.b.id2', 2, 100.00, 'Using where'
3, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
It is not as efficient as it could be, still I'm expecting this to be faster than your query.
I commented out the FORCE INDEX
statements, but you may need to uncomment them is the optimizer will not pick these indexes.
Everything would be much simpler if MySQL
were capable of doing FULL OUTER JOIN
using MERGE
, but it does not.
Update:
Judging by your statistics, this query will be more efficient:
SELECT id_a
FROM (
SELECT DISTINCT id_a
FROM tab_big ad
) a
WHERE id_a NOT IN
(
SELECT id1
FROM tab_small b1 FORCE INDEX (PRIMARY)
WHERE b1.id_b = 2
)
AND id_a NOT IN
(
SELECT id2
FROM tab_small b2 FORCE INDEX (ux_small_b_2_1)
WHERE b2.id_b = 2
)
AND EXISTS
(
SELECT NULL
FROM tab_small be
JOIN tab_big ae
ON (ae.id1, ae.id2) = (be.id1, be.id2)
WHERE be.id_b = 2
AND ae.id_a = a.id_a
)
It works as follows:
DISTINCT id_a
(which is 100,000
rows long)id_a
, it searches the subset for the presence of (id_a, id1, id2)
. This is done by iterating the subset. Since the probability to find this value is high, most probably the search will succeed in 10
rows or so from the beginning of the subset, and EXISTS
will return that very moment.This will most probably need to evaluate just about 1,000,000
records or so.
Make sure that the following plan is used:
1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 8192, 100.00, 'Using where'
5, 'DEPENDENT SUBQUERY', 'be', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index'
5, 'DEPENDENT SUBQUERY', 'ae', 'eq_ref', 'PRIMARY,ix_big_1_2', 'PRIMARY', '12', 'a.id_a,test.be.id1,test.be.id2', 1, 100.00, 'Using index'
4, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
3, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DERIVED', 'ad', 'range', '', 'PRIMARY', '4', '', 10, 100.00, 'Using index for group-by'
, the most important part being Using index for group-by
in the last row.
Upvotes: 3
Reputation: 13181
I would suggest to put an index on all four columns that are part of the join (either four separate indexes on the tb.id1, tb.id2, ts.id1 and ts.id2 column, or two on tb.id1/id2 and ts.id1/id2). Then see if that gives you any better performance. (I think it does, but you never know unless trying it out.)
NOTE: The following idea does not work, but I left it in so the comments still make some sense.
Also, instead of using the PHP generated list, can't you express your restriction (3) in the join condition (or if you prefer, in the where clause) as well? (Similar to what rexem suggested)
SELECT tb.id_a
FROM TAB_BIG tb
JOIN TAB_SMALL ts ON ts.id1 = tb.id1
AND ts.id2 = tb.id2
AND tb.id1 <> ts.id_a
AND tb.id2 <> ts.id_a
WHERE ts.id_b = ?
But this is more for clarity and simplicity than performance. (Also note that the additional conditions may require another index on id_a and probably separate indexes on tb.id1 and tb.id2.)
Upvotes: 0
Reputation: 9400
Have you tried tab_small LEFT JOIN tab_big
? Also you can create indexes on the fields tab_small.id_b
and tab_big.id_a
Upvotes: 0