Reputation: 47
I've inherited a MS Sql database hosted in Azure. Looking for performance improvement, I've been reading a lot about indexing and covering index. (Maybe this is the most complete reading that I've found: https://www.red-gate.com/simple-talk/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/)
But one doubt still remains...
So, for example, for the billing table below (which has about 8 milion lines), I've found that the most used fields used in the queries' where clause are (within joins or not):
PAYMENT_DATE, DUE_DATE, CUSTOMER_ID, DELAY_DAYS, AMOUNT
.
CREATE TABLE [dbo].[BILLING](
[ID] [int] IDENTITY(1,1) NOT NULL,
[CHANGED_DATE] [datetime] NULL,
[INCLUDED_DATE] [datetime] NULL,
[CHANGED_USER_ID] [int] NULL,
[INCLUDED_USER_ID] [int] NULL,
[BILL_CODE] [varchar](255) NOT NULL,
[PAYMENT_DATE] [datetime] NULL,
[DUE_DATE] [datetime] NOT NULL,
[AMOUNT] [float] NOT NULL,
[AMOUNT_PAYED] [float] NULL,
[CUSTOMER_ID] [int] NOT NULL,
[OUR_NUMBER] [varchar](200) NULL,
[TYPE] [varchar](250) NULL,
[BANK_ID] [int] NULL,
[ISSUE_DATE] [datetime] NULL,
[STATE] [varchar](20) NULL,
[DUNNING_STATE_ID] [int] NULL,
[OPEN_VALUE] [float] NULL,
[ACCREDIT_VALUE] [float] NULL,
[LOWER_VALUE] [float] NULL,
[DISCCOUNT_VALUE] [float] NULL,
[INTEREST_VALUE] [float] NULL,
[FINE_VALUE] [float] NULL,
[RECEIVED_AMOUNT] [float] NULL,
[DELAY_DAYS] [int] NULL,
[BRANCH_ID] [int] NULL,
[FIELD1] [varchar](250) NULL,
[FIELD2] [varchar](250) NULL,
[FIELD3] [varchar](250) NULL,
[FIELD4] [varchar](250) NULL,
[FIELD5] [varchar](250) NULL,
[OBS1] [varchar](250) NULL,
[OBS2] [varchar](250) NULL,
[OBS3] [varchar](250) NULL,
[INTEREST_RATE] [float] NULL,
[INTEREST_CALC] [float] NULL,
[AGREEMENT_STATE] [varchar](20) NULL,
[AGREEMENT_ID] [int] NULL,
PRIMARY KEY CLUSTERED
(
[ID] ASC
)
In addition, the target queries for optimizing make calculations on select clause over:
AMOUNT, DELAY_DAYS, COUNT(ID)
.
For example:
SELECT
T.CUSTOMER_ID AS CUSTOMER_ID
, COUNT(T.ID) AS NUM_BILLS
, COUNT(
CASE
WHEN T.DELAY_DAYS <= 0 THEN 1
ELSE NULL
END
) AS DEPOSITS
, COUNT(
CASE
WHEN T.DELAY_DAYS > 0 THEN 1
ELSE NULL
END
) AS DEFAULTED
, COUNT(
CASE
WHEN T.DELAY_DAYS BETWEEN 30 AND 60 THEN 1
ELSE NULL
END
) AS DEFAULTED_30
, COUNT(
CASE
WHEN T.DELAY_DAYS BETWEEN 60 AND 90 THEN 1
ELSE NULL
END
) AS DEFAULTED_60
, COUNT(
CASE
WHEN T.DELAY_DAYS > 90 THEN 1
ELSE NULL
END
) AS DEFAULTED_90
, MAX(T.DELAY_DAYS) AS MAX_DEFAULTED_TIME
, SUM(
CASE
WHEN T.DELAY_DAYS > 0 THEN T.DELAY_DAYS
ELSE 0
END
) AS SUM_DEFAULTED_TIME
, SUM(T.AMOUNT) AS AMOUNT
, SUM(
CASE
WHEN T.DELAY_DAYS > 0 THEN T.AMOUNT
ELSE 0
END
) AS DEFAULTED_AMOUNT
FROM BILLING T
WHERE
T.DUE_DATE < GETDATE()
AND T.AMOUNT > 0
GROUP BY
T.CUSTOMER_ID
Thus, it looked obvious to me that the following index would solve all of my problems:
CREATE NONCLUSTERED INDEX [ix_Titulo_main_fields] ON [dbo].[BILLING]
(
[PAYMENT_DATE] ASC,
[DUE_DATE] DESC,
[AMOUNT] ASC,
[CUSTOMER_ID] ASC,
[STATE] ASC,
[DELAY_DAYS] ASC,
[BRANCH_ID] ASC,
[AGREEMENT_ID] ASC
)
INCLUDE ( [BILLING_CODE],
[AGREEMENT_STATE],
)
GO;
In contrast, when I ask for the query plan on Management Studio, the SQL Server doesn't use this index and suggests me creating a new one:
CREATE NONCLUSTERED INDEX [ix_billing_due_date_amount] ON [dbo].[billing]
(
[due_date] ASC,
[amount] ASC
)
INCLUDE ( [customer_id],
[delay_days])
GO
So, the doubts are:
does the covering index needs to be exactly what is the WHERE clause searching for?
If it's true, how can a covering index satisfy more than one query?
Otherwise, why don't the former index satisfy the query?
I really don't know where I've missed something...
Thanks in advance!
Upvotes: 2
Views: 328
Reputation: 611
For any specific query you can certainly create a specialized index - the optimizer can tell you this. And the particular query will be super-boosted, and other similar queries will get more or less faster. However as a rule of thumb i do not use indexes specifically for a query nor i prefer multiple column indexes and i do not use include. There can be rare exceptions, but usually i don't. Why? The optimizer will give you the hint about what index you need within the query scope - after 8-10 or so indexes you add to the same table and the optimizer will no longer recognize which index to use, not to mention insert/update delays (although proper indexing saves time even on insert/update on locking times).
For your case, i should use 8 single indexes, one for each column, unless a column is already primary key or part of PK. If a column is unique check if you can create a unique index instead of a simple one. This helps greatly.
In overall, having 4-8 single column indexes on a table is the best workaround for all sql that will eventually executed against the table. This is valid as long as those columns are selected with usage investigation as you describe.
This is because what really matters the most is the first filtering. Filtering out 10000 rows out of 8 million in 3 sec is a success - now how the 10000 are filtered to 10 final is not really important. Maybe there is an index there as well maybe not but where do you want the table scan, at 8 million or at 10000 filtered?
From my experience a nice set of single column indexes helps 99% of queries to respond fast because they get an indexed column to search for start.
Sometimes queries pick up the wrong index - some generic filter that filters out only 5% while they ignore the filter that filters out 95%. This can be poor statistics or cardinality estimator that leads to poor execution plan. You can overcome this with query hint for the index of you are sure to use it at all times or enforce 2012 cardinality estimator.
Upvotes: 1
Reputation: 5654
Order matters. Since your proposed index begins with [payment_date] but the query predicate does not include [payment_date], the index is not likely to be more advantageous than a table scan.
It is possible to have a single index cover multiple queries. The first listed field of the index will almost always need be in the predicate of all the queries. For improved results, also apply this logic to the 2nd field, 3rd field, etc.
When there is more than one choice for a position, one choice may perform better than another.
Side note: Oracle has a feature called "index skip scan" that allows an index to be used even if the leading column is not in the predicate. It is effective when the leading column has few distinct values (from learningintheopen.org).
Upvotes: 3