fuschia
fuschia

Reputation: 283

update join table taking a very long time

I have a table citations with 5 million rows with following info:

Paperkey1 | Year1 | Paperkey2 | Year2 
100                   20
200                   90
300                   80

And another table pub_year with around 3 million rows with following info:

Paperkey | Year
100        2001
200        2002
20         2003
90         2004
80         2005

I want to update the table citations by taking the year value from table pub_year. I used the following query but its been running for more than 3 hours and still its not finished yet.

update citations T2

join pub_year T1 on T2.paperkey1= T1.paperkey

set T2.year1 = T1.year;

Is there anybody who knows what are the main reason that its taking too long? Im not sure if it ever finish if i keep letting it runs. Or is there something wrong with my query? The paperkey fields are all in varchar and the year fields are all in integer. Thank you.

Here is the update after running EXPLAIN:

enter image description here

Upvotes: 2

Views: 7899

Answers (1)

axiac
axiac

Reputation: 72286

The second row has value ALL in column type. This is the cause of the very, very slow execution. For each of the 5 million rows from citations it needs to scan all the 3 million rows of table pub_year in order to find the matching rows for the JOIN clause. An index will fix that.

Add an index on column Paperkey1 of table citations:

ALTER TABLE `citations` ADD INDEX (`Paperkey1`);

Also add an index on column Paperkey of table pub_year:

ALTER TABLE `pub_year` ADD INDEX (`Paperkey`);

If one of the two tables already contain an index on the aforementioned column (or it is the first column in a multi-column index) then skip that table; having identical indexes doesn't help.

After the indexes are created (they will take some time to complete, especially if there is other activity on these tables in the same time), run the EXPLAIN again and check the result. You should get ref or eq_ref in the column type of the second row.

Now the UPDATE will complete faster. It will still take a couple of minutes (or even more if the tables are accessed by other processes during the query) but this OK when you update 5 million records.

For performance reasons, on INNER JOINs it's recommended to put first the table that produces the smallest number of rows in the final result set. In this case, that table is pub_year:

UPDATE pub_year T1
INNER JOIN citations T2 ON T2.paperkey1 = T1.paperkey
SET T2.year1 = T1.year

(As a side note, the MySQL query optimizer is smart enough to change the query and put the tables in the order that provides the best execution time. You can see that in the result of the EXPLAIN query from the question: table T1 (pub_year) comes first.)

Upvotes: 6

Related Questions