Mike
Mike

Reputation: 505

MySQL: optimizing a JOIN query

Say I have two MyISAM tables:

tab_big:   id1, id2, id_a, ord         (5 billion records)
tab_small: id1, id2, id_b              (1 billion records)


CREATE TABLE IF NOT EXISTS `tab_big` (
  `id_a` int(10) unsigned NOT NULL,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  `ord` int(10) unsigned NOT NULL DEFAULT '1',
  PRIMARY KEY (`id_a`,`id1`,`id2`),
  KEY `id1` (`id1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


CREATE TABLE IF NOT EXISTS `tab_small` (
  `id_b` int(10) unsigned NOT NULL,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id_b`,`id1`,`id2`),
  KEY `id_b` (`id_b`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

All fields are INT. In both tables the combination of the three id fields (respectively id1, id2, id_a and id1, id2, id_b) values is unique, so for both I created a primary key on these three fields.

I need an efficient query that gets unique values of id_a from the first table, where:

  1. id_b in the second table table is a given value (narrowing it down to ca. 10k entries)
  2. id1/id2 combination is identical in both tables
  3. id_a in the first table is not same as either of id1, id2 fields in the tab_small subset (as narrowed down by id_b field); after a bit of fiddling it seems that generating the list (around 200 ids) in php and providing it as text performs better than adding another JOIN).

I believe it's not very cacheable, since both tables change all the time (rows are added).

My current query is pretty straightforward:

SELECT tab_big.id_a FROM tab_big, tab_small
    WHERE tab_small.id_b = '$constant'
    AND tab_big.id1 = tab_small.id1 AND tab_big.id2 = tab_small.id2
    AND tab_big.id_a NOT IN ({comma delimited list of 200 ids})
    GROUP BY tab_big.id_a
    ORDER BY SUM(tab_big.ord) DESC
    LIMIT 10

It works, but not fast enough to really use it. What can be done with it?

EXPLAIN says it first gets a ranged query from tab_big, then applies that to tab_small (Edit: added below). I don't know why (EXPLAIN says the query uses primary keys), but adding a tab_big.id1 index seems to help a bit. Also, trying to make it go the other way around with STRAIGHT_JOIN, first selecting a 10k subset from (smaller) tab_small and then using it to search in (larger) tab_big gives much worse results than default (Edit: with a small dataset that I have now to test on; on production data it would apparently be the other way around and EXPLAIN would look like the second one).

+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| id | select_type | table     | type   | possible_keys   | key     | key_len | ref                                       | rows    | Extra                                        |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
|  1 | SIMPLE      | tab_big   | range  | PRIMARY,id1     | PRIMARY | 4       | NULL                                      | 1374793 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | tab_small | eq_ref | PRIMARY,id_b    | PRIMARY | 12      | const,db.tab_big.id1,db.tab_big.id2       |       1 | Using index                                  | 
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+

On larger datasets EXPLAIN would probably look more like this (disregard the 'rows' values though - it's taken from a smaller dataset):

+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| id | select_type | table     | type | possible_keys       | key     | key_len | ref              | rows  | Extra                                        |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
|  1 | SIMPLE      | tab_small | ref  | PRIMARY,id_b,id1    | PRIMARY | 4       | const            |   259 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | tab_big   | ref  | PRIMARY,id1         | id1     | 4       | db.tab_small.id1 | 25692 | Using where                                  | 
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+

Any thoughts?

Upvotes: 2

Views: 606

Answers (3)

Quassnoi
Quassnoi

Reputation: 425251

Create the following indexes:

CREATE INDEX ix_big_1_2_a ON tab_big (id1, id2, id_a)
CREATE UNIQUE INDEX ux_small_b_2_1 ON tab_small (id_b, id2, id1)

and try this:

SELECT  DISTINCT
        a.id_a
FROM    tab_small b
JOIN    tab_big a
ON      (a.id1, a.id2) = (b.id1, b.id2)
WHERE   b.id_b = 2
        AND a.id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 /* FORCE INDEX (PRIMARY) */
        WHERE   b1.id_b = 2
        )
        AND a.id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 /* FORCE INDEX (ux_small_b_2_1) */
        WHERE   b2.id_b = 2
        )

, which produces this query plan:

1, 'PRIMARY', 'b', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index; Using temporary'
1, 'PRIMARY', 'a', 'ref', 'ix_big_1_2', 'ix_big_1_2', '8', 'test.b.id1,test.b.id2', 2, 100.00, 'Using where'
3, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'

It is not as efficient as it could be, still I'm expecting this to be faster than your query.

I commented out the FORCE INDEX statements, but you may need to uncomment them is the optimizer will not pick these indexes.

Everything would be much simpler if MySQL were capable of doing FULL OUTER JOIN using MERGE, but it does not.

Update:

Judging by your statistics, this query will be more efficient:

SELECT  id_a
FROM    (
        SELECT  DISTINCT id_a
        FROM    tab_big ad
        ) a
WHERE   id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 FORCE INDEX (PRIMARY)
        WHERE   b1.id_b = 2
        )
        AND id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 FORCE INDEX (ux_small_b_2_1)
        WHERE   b2.id_b = 2
        )
        AND EXISTS
        (
        SELECT  NULL
        FROM    tab_small be
        JOIN    tab_big ae
        ON      (ae.id1, ae.id2) = (be.id1, be.id2)
        WHERE   be.id_b = 2
                AND ae.id_a = a.id_a
        )

It works as follows:

  • Builds the list of DISTINCT id_a (which is 100,000 rows long)
  • Filters out the values present in the subset
  • For each value of id_a, it searches the subset for the presence of (id_a, id1, id2). This is done by iterating the subset. Since the probability to find this value is high, most probably the search will succeed in 10 rows or so from the beginning of the subset, and EXISTS will return that very moment.

This will most probably need to evaluate just about 1,000,000 records or so.

Make sure that the following plan is used:

1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 8192, 100.00, 'Using where'
5, 'DEPENDENT SUBQUERY', 'be', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index'
5, 'DEPENDENT SUBQUERY', 'ae', 'eq_ref', 'PRIMARY,ix_big_1_2', 'PRIMARY', '12', 'a.id_a,test.be.id1,test.be.id2', 1, 100.00, 'Using index'
4, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
3, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DERIVED', 'ad', 'range', '', 'PRIMARY', '4', '', 10, 100.00, 'Using index for group-by'

, the most important part being Using index for group-by in the last row.

Upvotes: 3

Thorsten
Thorsten

Reputation: 13181

I would suggest to put an index on all four columns that are part of the join (either four separate indexes on the tb.id1, tb.id2, ts.id1 and ts.id2 column, or two on tb.id1/id2 and ts.id1/id2). Then see if that gives you any better performance. (I think it does, but you never know unless trying it out.)


NOTE: The following idea does not work, but I left it in so the comments still make some sense.

Also, instead of using the PHP generated list, can't you express your restriction (3) in the join condition (or if you prefer, in the where clause) as well? (Similar to what rexem suggested)

SELECT tb.id_a
  FROM TAB_BIG tb
  JOIN TAB_SMALL ts ON ts.id1 = tb.id1
                 AND ts.id2 = tb.id2
                 AND tb.id1 <> ts.id_a
                 AND tb.id2 <> ts.id_a
 WHERE ts.id_b = ?

But this is more for clarity and simplicity than performance. (Also note that the additional conditions may require another index on id_a and probably separate indexes on tb.id1 and tb.id2.)

Upvotes: 0

rubayeet
rubayeet

Reputation: 9400

Have you tried tab_small LEFT JOIN tab_big? Also you can create indexes on the fields tab_small.id_b and tab_big.id_a

Upvotes: 0

Related Questions