Carlo V. Dango
Carlo V. Dango

Reputation: 13832

What are the consequences of converting heap-indexes to clustered indexes on SQL Server?

I've recently got the advice, that I should convert all our tables from using heap indexes such that each table has a clustered index. What are the consequences of persuing this strategy? E.g. is it more important to regularly reorganize the database? datagrowth? danger of really slow inserts? Danger of page-defragmentation if the PK is a GUID? Noticable speed-increase of my application? What are your experiences?

To serve as inspiration for good answers, here are some of the "facts" I've picked up from other threads here on stackoverflow

  1. Almost certainly want to establish a clustered index on every table in your database. If a table does not have one. Performance of most common queries is better.
  2. Clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. The INSERT speed will suffer, but the SELECT speed will be improved.
  3. The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
  4. Clustered index on GUID is ok in situations where the GUID has a meaning and improves performance by placing related data close to each other http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html
  5. Clustering doesn't affect lookup speed - a unique non-clustered index should do the job.

Upvotes: 6

Views: 2070

Answers (3)

Michael
Michael

Reputation: 107

I can recommend the book "SQL Performance Explained" - it is a 200 page book about indexes.

It also mentions when clustered indexes have worse performance than normal indexs. One of the problems is that the clustered index itself is a B-tree. So when you have other indexes on the same table, they can't point to a specifik row - instead they point to a "key" in the clustered index, so "the way" to the data gets longer.

Upvotes: 1

Remus Rusanu
Remus Rusanu

Reputation: 294247

If your key is a GUID, then a non-clustered index on it is probably just as effective as a clustered index on it. This is because on GUIDs you absolutely never ever can have range scans on them (what could between 'b4e8e994-c315-49c5-bbc1-f0e1b000ad7c' and '3cd22676-dffe-4152-9aef-54a6a18d32ac' possibly mean??). With a width of 16 bytes, a GUID clustered index key is wider than a row id that you'd get from a heap, so a NC index on a PK guid is actually strategy that can be defended in a discussion.

But making the primary key the clustered index key is not the only way to build a clustered index over your heap. Do you have other frequent queries that request ranges over a certain column? Typical candidates are columns like date, state or deleted. If you do, then you should consider making those columns the clustered index key (it does not have to be unique) because doing so may help queries that request ranges, like 'all records from yesterday'.

The only scenario where heaps have significant performance benefit is inserts, specially bulk inserts. IF your load is not insert heavy, then you should definitely go for a clustered index. See Clustered Index Design Guidelines.

Going over over your points:

Almost certainly want to establish a clustered index on every table in your database. If a table does not have one. Performance of most common queries is better.

A clustered index that can satisfy range requirements for most queries will dramatically improve performance, true. A clustered index that can satisfy order requirements can be helpful too, but nowhere as helpful as one that can satisfy a range.

Clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. The INSERT speed will suffer, but the SELECT speed will be improved.

Only probe SELECTs will be improved: SELECT ... WHERE key='someguid';. Queries by object ID and Foreign key lookups will benefit from this clustered index. A NC index can server the same purpose just as well.

The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.

Wrong. Insert into position in an index does not have to move data. The worst it can happen is a page-split. A Page-split is (somehow) expensive, but is not the end of the world. You comment suggest that all data (or at least a 'significant' part) has to be moved to make room for the new row, this is nowhere near true.

Clustered index on GUID is ok in situations where the GUID has a meaning and improves performance by placing related data close to each other http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html

I can't possibly imagine a scenario where GUID can have 'related data'. A GUID is the quintessential random structure how could two random GUIDs relate in any way? The scenario Donald gives has a better solution: Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads, which is cheaper to implement (less storage required) and works for unique keys too (the solution in linked article would not work for unique keys, only for foreign keys).

Clustering doesn't affect lookup speed - a unique non-clustered index should do the job.

For probes (lookup a specific unique key) yes. A NC index is almost as fast as the clustered index (the NC index lookup does require and additional key lookup to fetch in the rest of the columns). Where clustered index shines is range scans, as it the clustered index can cover any query, while a NC index that could potentially satisfy the same range may loose on the coverage and trigger the Index Tipping Point.

Upvotes: 10

marc_s
marc_s

Reputation: 754438

I would also recommend you read Kimberly Tripp's The Clustered Index Debate Continues... in which she details quite clearly all the benefits of having a *good clustering key over having a heap.

Pretty much all operations are faster - yes! even inserts and updates!

But this requires a good clustering key, and a GUID with its very random and unpredictable nature is not considered a good candidate for a clustering key. GUIDs as clustering key are bad - whether they have application meaning or not - just avoid those.

Your best bet is a key which is narrow, stable, unique and ever-increasing - a column of type INT IDENTITY fulfills all those requirements ideally.

For a lot more background on why a GUID doesn't make a good clustering key, and on just how bad it is, see more of Kim Tripp's blog posts:

Upvotes: 2

Related Questions