Jeff Meatball Yang
Jeff Meatball Yang

Reputation: 39027

How to speed up a massive update to the clustered column?

I have a pretty large table: 20+ million rows and I need to update about 5% of that - or 1 million rows.

Unfortunately, I am updating the (int) column that is being used as the clustered index.

My question is: What is the fastest way to update these rows?

I have tried updating the rows directly:

update t1
set t1.groupId = t2.groupId
from
    table t1
join newtable t2 on t1.email = t2.email

but this takes WAY too long (I stopped it after 3 hours)

I assume that this is because the entire row (which has 2 datetimes, 2 varchars, and 2 ints) is being moved around for each update.

What if I dropped the clustered index first, then did the updates, then recreated the clustered index? Would that be faster?

Note: I have a nonclustered index on email, in case anyone thinks it's the select part of the query that is slow. It's not.

Upvotes: 7

Views: 2860

Answers (3)

Jeff Meatball Yang
Jeff Meatball Yang

Reputation: 39027

Here's what I did (and it was much faster):

  1. I dropped the clustered index.
  2. I ALSO dropped foreign keys references (the two other int columns).
  3. I ran the update statement
  4. I recreated the index, which was faster than expected. (This is the original reason I asked SO first).

This brought the entire process down to a matter of seconds. Yes, ~ 1 million rows in about 15 seconds.

The second step was crucial because the foreign keys were forcing the update to do some sort of spool on the related tables, which each also have a large number of rows.

The number of physical reads were tripled because of these foreign key lookups.

I'm not sure why SQL Server needs to do that, but my guess is that it still performs the integrity check even if I'm not updating that column but I am moving the entire row (clustered column update).


As a side note, I had also tried running the update in batches:

update top(1000) t1
set t1.groupId = t2.groupId
from
    table t1
join newtable t2 on t1.email = t2.email

This was fine (and seemed to scale up to about 10K per batch) but it still was on the order of 1-2 minutes each batch.


In summary, I've learned that for bulk updates, temporarily removing indexes can be very helpful.

Upvotes: 8

nso1
nso1

Reputation: 605

I think the comment earlier is right. You have sort of answered your own question.

Because

Clustered indexes sort and store the data rows in the table based on their key values (source msdn),

you may be better just dropping the clustered index (keep the index on email). When the operation is done then recreating the clustered index. As long as groupid is not involved in any other indexes I wouldn't touch them. If group id is involved in other indexes then drop them. I would leave at least an index on email, just to make the join fast.

Upvotes: 2

Bravax
Bravax

Reputation: 10493

Why don't you do the following:

  • Make a different column (one of the varchar's) the clustered index
  • Create an index on your groupId
  • update
  • Then reverse the process.

This should be quicker.

Upvotes: -2

Related Questions