Clustered index in SQL Server : any advantage for the columns to be first in schema?

Question

I have a table in SQL Server with a three-column clustered index.

I have a table with columns (CustomerID, A, ProductID, C, OtherID) and I have a clustered key on (OtherID, CustomerID, ProductID).

Is there a performance hit for that column order (in the table, not the index?) Or is there a hidden advantage to re-ordering the key columns to the first three columns of the table: (OtherID,CustomerID,ProductID,A,C)

Seems like it shouldn't be a big problem, but implementations can have hidden performance costs.

(I was looking for the cause of a performance issue we were having, and this was just one of those "It shouldn't be a problem, but maybe it could be a problem..." kind of guesses.)

trenton-ftw · Accepted Answer

I won't assume what type of clustered index we are talking about here, so I will try to cover all the basics. I would have to say that, logically, the impact (performance or otherwise) of the ordinal position of the columns within your table in relation to their ordinal position within the clustered index is inconsequential (unless someone out there has something to prove me wrong).

Rowstore

Keep in mind that your table data and rowstore clustered indexes end up becoming separate logical structures. Per Microsoft regarding the clustered rowstore index architecture:

indexes are organized as B-Trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom nodes in the index are called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows.

So when we are talking about the physical storage of both the clustered index and the table data, we can think of them as separate structures. Looking at this image from the same link:

All three of these levels have at least one thing in common. They are all storing values (more or less) logically sorted by the value of your clustered index. Regardless of the ordinal position of the columns within your table structure, the leaf pages for your table data will be stored logically ordered by the columns/values within your clustered index. This is also true of your intermediate pages, which represent the storage of your clustered index values.

So all of that to say, the ordinal position of your columns within the clustered index is actually what determines how both the intermediate level and leaf pages are logically ordered, so the ordinal position of those columns within your table statement really has no impact to their storage order because of their inclusion in your clustered index.

Columnstore

Regarding clustered columnstore indexes, I would again say that it has no impact, but for a different (and simpler) reason. The columnstore index breaks up the column values in to separate logical structures, which have no relation to each other by way of their ordinal position. So regardless of the column's ordinal position within the table, when you query a value from a column you are querying the separate physical structure that represents that column's values (ignoring deltastore for simplicity here). Similarly, when you query multiple column's values, you are querying each individual logical structure that represents each column's values separately.

This is why you are not even able to specify a column list when creating a clustered columnstore index. The ordinal position of the columns within the columnstore index itself has no impact, so I'd imagine that the ordinal position of those columns within the table itself (or any relationship between the two) also has no impact.

Heap

Lastly, should anyone else ask, even with tables stored as a heap I would still argue that the ordinal position of columns within the table has no impact to any query performance. Under the hood, heaps are still stored and referenced by a sort of clustered index structure (I believe it would still be described that way).

Per Microsoft:

A rowstore is data that is logically organized as a table with rows and columns, and then physically stored in a row-wise data format. This has been the traditional way to store relational table data such as a heap or clustered B-tree index.

So heaps are still stored in an ordered fashion just like any other table created using a clustered index, but the main difference is that the value they are ordered by is simply non-business use value created in order to identify the row. As described by Microsoft:

If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).

This RID is not something you would ever normally use as a predicate to a query, which is the main disadvantage (since data is made to be queried, right?). But regardless, the ordinal position of these columns within your table still has no impact to how they are actually logically sorted/stored, so I can't imagine that it could impact your query performance.

Clustered index in SQL Server : any advantage for the columns to be first in schema?

Answers (1)

Rowstore

Columnstore

Heap

Related Questions