Reputation: 21
I have two tables containing 6M rows each. I'm trying to join the two using an inner join but the query ran for 2 days without finishing. The join is (note I've used count(*)
just to enable me to run an explain, I'm actually using the join
in a CTAS):
SELECT count(*)
FROM table1 t1,
table2 t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
After a bit of investigation I've found the below query runs fine:
SELECT count(*)
FROM
(SELECT *
FROM table1) t1,
(SELECT *
FROM table2) t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
The only difference between that instead of the table, I use the sub-query SELECT * FROM table
;
Running the explain plans shows that the latter query is building up an index when it selects table2. Whereas the first query is using a join buffer (Block Nested Loop).
Surely MySQL is clever enough to work out that the two queries are practically identical and do the same with both queries? I don't see why an index should be need because a full scan is required for both tables anyway. These are temporary/transitory tables so if I did put an index on, it would literally be just to perform this join.
Is there a way to fix this via MySQL configuration?
Upvotes: 2
Views: 2665
Reputation: 48129
You NEED the index on at least ONE of the tables, even such as
create index Temp1 on Table2 ( colA, colB )
So, your query from Table 1 joined to table 2, so even if a table scan is on all of table 1, you need it to quickly find the record(s) that match in table 2. If NEITHER has an index, then think of it this way. For every record in Table1, scan through ALL records in Table 2 and grab all records that match for ColA, ColB. Now, go back to table 1 for the SECOND record... go back through table 2 for ALL records until it finds a match.
Being that you have 6M records, you could practically choke a cow (so-to-speak) on performance. By having an index, even on the SECOND table, when the query is on the first record, it can immediately jump to the rows that match ColA, ColB and as soon as those A/B records are done, it goes back to the first table.
Now, for other overhead efficiencies. If you have BOTH tables indexed on respective Col1, Col2 and ColA, ColB, then the engine will have in its memory / cache a whole block of records for each common area and doesn't have to keep going back to the raw data pages for other elements repeatedly.
So, even though you think it might not be practical, it is still good to handle large table queries. Also, if you have multiple records in the first table with the same values for Col1, Col2, but have different other values for other columns in the table, and similarly in the second table for multiple ColA, ColB, you would get a Cartesian result. Consider the following scenario
Table1
Col1 Col2 OtherColumn
X Y blah1
X Y blah2
X Y blah3
Table2
ColA ColB OtherColumn
X Y second blah1
X Y second blah2
X Y second blah3
A simple query like you have
SELECT count(*)
FROM table1 t1,
table2 t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
would result in a count of 9. You have 6M records and a possible Cartesian result? Hopefully this clarifies some problems you may be encountering.
Upvotes: 1