Reputation: 85046
I've heard that you should put columns that will be the most selective at the beginning of the index declaration. Example:
CREATE NONCLUSTERED INDEX MyINDX on Table1
(
MostSelective,
SecondMost,
Least
)
First off, is what I'm saying correct? If so, am i likely to see large differences in performance by rearranging the order of the columns in my index or is it more of a "nice to do" practice?
The reason I'm asking is because after putting a query through the DTA it recommended that I create an index that had almost all of the same columns in it as an existing index, just in a different order. I was considering just adding the missing columns to the existing index and calling it good. Thoughts?
Upvotes: 271
Views: 162117
Reputation: 382662
Let's first understand which queries composite indices speed up or not. Then we can see why the index column order matters for some of those queries.
Composite indices only fully speed up queries with WHERE
equalities on all but last queried column
And in the last queried column an inequality range is allowed.
For example given an index on columns (a, b, c)
:
CREATE INDEX myindex ON mytable (a, b ,c);
it optimally speeds up the following queries:
WHERE a = 1 AND b = 2 AND c > 10 AND c < 20
: a
and b
come before c
on the index order, and both are equalities, so inequalities in c
is fineWHERE a = 1 AND b = 2 AND c > 10
: same as aboveWHERE a = 1 AND b = 2 AND c = 10
: same as above, "equivalent to" c > 9 AND c < 11
WHERE a = 1 AND b > 10 AND b < 20
: it's OK to omit trailing columnsWHERE a > 10 AND a < 20
: same as aboveWHERE a = 10
: same as abovebut does not optimally speed up the following queries:
WHERE a = 1 AND b > 10 AND b < 20 AND c = 2
: inequality on b
(not equality), but c
also specified. b
comes before c
on the index, so this doesn't work wellWHERE a = 1 AND b > 10 AND b < 20 AND c > 10 AND c < 20
: same as above. Could be sped up with a spatial index however: What is a SPATIAL INDEX and when should I use it?WHERE a = 1 AND c = 3
: c
specified but missing equality on b
WHERE b = 1
: same as above, a
comes before b
in index but not given in query as equalityOf the above, ORDER BY
is only sped up after the last equality column
Optimally sped up:
WHERE a = 1 AND b = 1 ORDER BY c ASC
. a
and b
come before c
on the index column order and both are equalities so ORDER BY c ASC
is sped upWHERE a = 1 ORDER BY b ASC, c ASC
: same as above, and multiple ORDER BY
fine because in index order with b
then c
WHERE a = 1 ORDER BY b ASC
: same as above and it is fine to omit trailing columns from the ORDER BY
ORDER BY a ASC
: same as above, for first column works without any preceding equalitiesNot optimally sped up:
WHERE a = 1 ORDER BY c ASC
: no equality on b
, so ORDER BY c
is not sped upWHERE a = 1 ORDER BY c ASC, b ASC
: uses ORDER BY
with first c
then b
, which is unlike the index column orderWHERE a = 1 ORDER BY b ASC, c DESC
: mixed ASC
+ DESC
requires adding ASC
+ DESC
to the INDEX
itself
: SQL Server indexes - ascending or descending, what difference does it make?Conclusion: order matters if you have inequalities, want to omit columns, or are using ORDER BY
If you are just doing equalities on all columns of the index, ordering doesn't matter:
WHERE a = 1 AND b = 2 AND c = 3
is sped up by all of (a, b, c)
, (a, c, b)
, (c, b, a)
, etc.
But from the above, we can spot the following cases where ordering can matter. Considering a two column index on columns a
and b
:
inequalities:
WHERE a = 1 AND b > 10 AND b < 20
is sped up by (a, b) but not (b, a)
omitting columns:
WHERE a = 1
is sped up by (a, b) but not (b, a)
ORDER BY
:
ORDER BY a ASC, b ASC
is sped up by (a, b) but not (b, a)
Why it works like that: a query is speed up if it only does at most one bisection
Indices are normally implemented as B-trees, which are just like binary search trees but with more child nodes per node:
So what an index does it to order rows lexicographically by column order, e.g. an (a, b, c)
index might look like:
(1, 2, 3)
(1, 2, 4)
(2, 1, 5)
(2, 3, 2)
A query is optimally sped up by an index when you can:
If you have to do a bunch of bisections, it means that there is a more efficient index for your query.
For inequalities on multiple columns you need spatial indices
If you want to speed up queries such as:
WHERE a > 10 AND a < 20 AND b > 30 and b < 40
then that is exactly what spatial indices are for! See e.g. my other answer at: What is a SPATIAL INDEX and when should I use it? They are also apparently supported on Microsoft SQL Server: https://learn.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview?view=sql-server-ver16
Related: How do composite indexes work?
Upvotes: 5
Reputation: 142238
Selectivity is a very minor factor; "Leftmost" is critial
Selectivity of the individual columns in a composite index does not matter when picking the order.
Here is the simple thought process: Effectively, an index is the concatenation of the columns involved.
Giving that rationale, the only difference is comparing two 'strings' that differ earlier versus later in the string. This is a tiny part of the total cost. There is no "first pass / second pass", as mentioned in one Answer.
So, what order should be used?
=
, in any order.For example, the very-low selectivity column must come first in this:
WHERE deleted = 0 AND the_datetime > NOW() - INTERVAL 7 DAY
INDEX(deleted, the_datetime)
Swapping the order in the index would have it totally ignore deleted
.
(There are a lot more rules for ordering the columns.)
Upvotes: 16
Reputation: 453037
As Remus says it depends on your workload.
I want to address a misleading aspect of the accepted answer though.
For queries that are performing an equality search on all columns in the index there is no significant difference.
The below creates two tables and populates them with identical data. The only difference is that one has the keys ordered from most to least selective and the other the reverse.
CREATE TABLE Table1(MostSelective char(800), SecondMost TINYINT, Least CHAR(1), Filler CHAR(4000) null);
CREATE TABLE Table2(MostSelective char(800), SecondMost TINYINT, Least CHAR(1), Filler CHAR(4000) null);
CREATE NONCLUSTERED INDEX MyINDX on Table1(MostSelective,SecondMost,Least);
CREATE NONCLUSTERED INDEX MyINDX2 on Table2(Least,SecondMost,MostSelective);
INSERT INTO Table1 (MostSelective, SecondMost, Least)
output inserted.* into Table2
SELECT TOP 26 REPLICATE(CHAR(number + 65),800), number/5, '~'
FROM master..spt_values
WHERE type = 'P' AND number >= 0
ORDER BY number;
Now doing a query against both of the tables...
SELECT *
FROM Table1
WHERE MostSelective = REPLICATE('P', 800)
AND SecondMost = 3
AND Least = '~';
SELECT *
FROM Table2
WHERE MostSelective = REPLICATE('P', 800)
AND SecondMost = 3
AND Least = '~';
... Both of them use an index fine and both are given the exact same cost.
The ASCII art in the accepted answer is not in fact how indexes are structured. The index pages for Table1 are represented below (click the image to open in full size).
The index pages contain rows containing the whole key (in this case there is actually an additional key column appended for the row identifier as the index was not declared as unique but that can be disregarded further information about this can be found here).
For the query above SQL Server doesn't care about the selectivity of the columns. It does a binary search of the root page and discovers that the Key (PPP...,3,~ )
is >=(JJJ...,1,~ )
and < (SSS...,3,~ )
so it should read page 1:118
. It then does a binary search of the key entries on that page and locates the leaf page to travel down to.
Altering the index in order of selectivity doesn't affect either the expected number of key comparisons from the binary search or the number of pages that need to be navigated to do an index seek. At best it might marginally speed up the key comparison itself.
Sometimes ordering the most selective index first will make sense for other queries in your workload though.
E.g if the workload contains queries of both the following forms.
SELECT * ... WHERE MostSelective = 'P'
SELECT * ...WHERE Least = '~'
The indexes above aren't covering for either of them. MostSelective
is selective enough to make a plan with a seek and lookups worthwhile but the query against Least
isn't.
However this scenario (non covering index seek on subset of leading column(s) of a composite index) is only one possible class of query that can be helped by an index. If you never actually search by MostSelective
on its own or a combination of MostSelective, SecondMost
and always search by a combination of all three columns then this theoretical advantage is useless to you.
Conversely queries such as
SELECT MostSelective,
SecondMost,
Least
FROM Table2
WHERE Least = '~'
ORDER BY SecondMost,
MostSelective
Would be helped by having the reverse order of the commonly prescribed one - as it covers the query, can support a seek and returns rows in the desired order to boot.
So this is an often repeated piece of advice but at most it's a heuristic about the potential benefit to other queries - and it is no substitute for actually looking at your workload.
Upvotes: 83
Reputation: 294227
The order of columns is critical. Now which order is correct it depends on how you are going to query it. An index can be used to do an exact seek or an range scan. An exact seek is when values for all columns in the index are specified and the query lands exactly on the row is interested in. For seeks the order of columns is irrelevant. A range scan is when only some columns are specified, and in this case when the order becomes important. SQL Server can use an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on. If you have an index on (A,B,C) it can be used to range scan for A=@a
, for A=@a AND B=@b
but not for B=@b
, for C=@c
norB=@b AND C=@c
. The case A=@a AND C=@c
is mixed one, as in the A=@a
portion will use the index, but the C=@c
not (the query will scan all B values for A=@a
, will not 'skip' to C=@c
). Other database systems have the so called 'skip scan' operator that can take some advantage of inner columns in an index when the outer columns are not specified.
With that knowledge in hand you can look at the index definitions again. An index on (MostSelective, SecondMost, Least)
will be effective only when MostSelective
column is specified. But that being the most selective, the relevance of the inner columns will quickly degrade. Very often you'll find that a better index is on (MostSelective) include (SecondMost, Least)
or on (MostSelective, SecondMost) include (Least)
. Because the inner columns are less relevant, placing low selectivity columns in such right positions in the index makes them nothing but noise for a seek, so it makes sense to move them out of the intermediate pages and keep them only on the leaf pages, for query coverability purposes. In other words, move them to INCLUDE. This becomes more important as the size of Least
column increases. The idea is that this index can only benefit queries that specify MostSelective
either as an exact value or a range, and that column being the most selective it already restricts the candidate rows to great extent.
On the other hand an index on (Least, SecondMost, MostSelective)
may seem a mistake, but it actually quite a powerful index. Because it has the Least
column as its outermost query, it can be used for queries that have to aggregate results on low selectivity columns. Such queries are prevalent in OLAP and analysis data warehouses, and this is exactly where such indexes have a very good case going for them. Such indexes actually make excellent clustered indexes, exactly because they organize the physical layout on large chunks of related rows (same Least
value, which usually indicate some sort of category or type) and they facilitate analysis queries.
So, unfortunately, there is no 'correct' order. You shouldn't follow any cookie cutter recipe but instead analyze the query pattern you are going to use against those tables and decide which index column order is right.
Upvotes: 202
Reputation: 630379
Look at an index like this:
Cols
1 2 3
-------------
| | 1 | |
| A |---| |
| | 2 | |
|---|---| |
| | | |
| | 1 | 9 |
| B | | |
| |---| |
| | 2 | |
| |---| |
| | 3 | |
|---|---| |
See how restricting on A first, as your first column eliminates more results than restricting on your second column first? It's easier if you picture how the index must be traversed across, column 1, then column 2, etc...you see that lopping off most of the results in the fist pass makes the 2nd step that much faster.
Another case, if you queried on column 3, the optimizer wouldn't even use the index, because it's not helpful at all in narrowing down the result sets. Anytime you're in a query, narrowing down the number of results to deal with before the next step means better performance.
Since the index is also stored this way, there's no backtracking across the index to find the first column when you're querying on it.
In short: No, it's not for show, there are real performance benefits.
Upvotes: 269
Reputation: 332541
you should put columns that will be the most selective at the beginning of the index declaration.
Correct. Indexes can be composites - composed of multiple columns - and the order is important because of the leftmost principle. Reason is, that the database checks the list from left to right, and has to find a corresponding column reference matching the order defined. For example, having an index on an address table with columns:
Any query using the address
column can utilize the index, but if the query only has either city
and/or state
references - the index can not be used. This is because the leftmost column isn't referenced. Query performance should tell you which is optimal - individual indexes, or multiple composites with different orders. Good read: The Tipping Point, by Kimberley Tripp
Upvotes: 36