Reputation: 283
I have a table citations
with 5 million rows with following info:
Paperkey1 | Year1 | Paperkey2 | Year2
100 20
200 90
300 80
And another table pub_year
with around 3 million rows with following info:
Paperkey | Year
100 2001
200 2002
20 2003
90 2004
80 2005
I want to update the table citations
by taking the year value from table pub_year
. I used the following query but its been running for more than 3 hours and still its not finished yet.
update citations T2
join pub_year T1 on T2.paperkey1= T1.paperkey
set T2.year1 = T1.year;
Is there anybody who knows what are the main reason that its taking too long? Im not sure if it ever finish if i keep letting it runs. Or is there something wrong with my query? The paperkey fields are all in varchar and the year fields are all in integer. Thank you.
Here is the update after running EXPLAIN:
Upvotes: 2
Views: 7899
Reputation: 72286
The second row has value ALL
in column type
. This is the cause of the very, very slow execution. For each of the 5 million rows from citations
it needs to scan all the 3 million rows of table pub_year
in order to find the matching rows for the JOIN
clause. An index will fix that.
Add an index on column Paperkey1
of table citations
:
ALTER TABLE `citations` ADD INDEX (`Paperkey1`);
Also add an index on column Paperkey
of table pub_year
:
ALTER TABLE `pub_year` ADD INDEX (`Paperkey`);
If one of the two tables already contain an index on the aforementioned column (or it is the first column in a multi-column index) then skip that table; having identical indexes doesn't help.
After the indexes are created (they will take some time to complete, especially if there is other activity on these tables in the same time), run the EXPLAIN
again and check the result.
You should get ref
or eq_ref
in the column type
of the second row.
Now the UPDATE
will complete faster. It will still take a couple of minutes (or even more if the tables are accessed by other processes during the query) but this OK when you update 5 million records.
For performance reasons, on INNER JOIN
s it's recommended to put first the table that produces the smallest number of rows in the final result set. In this case, that table is pub_year
:
UPDATE pub_year T1
INNER JOIN citations T2 ON T2.paperkey1 = T1.paperkey
SET T2.year1 = T1.year
(As a side note, the MySQL query optimizer is smart enough to change the query and put the tables in the order that provides the best execution time. You can see that in the result of the EXPLAIN
query from the question: table T1
(pub_year
) comes first.)
Upvotes: 6