It appears that SQL Server does not automatically use a CHECKSUM/hash index unless the CHECKSUM column is explicitly included in the search arguments for the query. This is a problem because I do not control the applications that query the table, and I may not break their performance. Is there any way to get SQL Server to use a new CHECKSUM/hash index without modifying queries to include the new CHECKSUM/hash column? Repro script CREATE TABLE big_table ( id BIGINT IDENTITY CONSTRAINT pk_big_table PRIMARY KEY, wide_col VARCHAR(50), wide_col_checksum AS CHECKSUM(wide_col), other_col INT ) CREATE INDEX ix_checksum ON big_table (wide_col_checksum) Insert some test data: SET NOCOUNT ON DECLARE @count INT = 0 BEGIN TRANSACTION WHILE @count < 10000 BEGIN SET @count = @count + 1 INSERT INTO big_table (wide_col, other_col) VALUES (SUBSTRING(master.dbo.fn_varbintohexstr(CRYPT_GEN_RANDOM(25)), 3, 50), @count) IF @count % 1000 = 0 BEGIN COMMIT TRANSACTION BEGIN TRANSACTION END END COMMIT TRANSACTION INSERT INTO big_table (wide_col, other_col) VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 9999999) Legacy query. Causes Clustered Index Scan (BAD): SELECT * FROM big_table WHERE wide_col = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' Updated query. Causes NonClustered Index Seek (good): SELECT * FROM big_table WHERE wide_col = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' AND wide_col_checksum = CHECKSUM('ABCDEFGHIJKLMNOPQRSTUVWXYZ') Background My table is very large (many hundreds of millions of rows), has several indexes (~ 20), all of which are required. Some of the indexed columns are a little wide (~ 50 bytes) and have few duplicate values. The columns are only searched on equality. The table is inserted into constantly. Here is a table comparing "normal" indexes and CHECKSUM/hash indexes on the sample table above, both compressed and non-compressed. Data from freshly rebuilt indexes on tables with 1 million rows: Page compression alone is pretty ineffective on the sample data (real data should compress a bit better). The hash index achieves a 4X index size reduction. Page compression on the hash index achieves a 6X index size reduction. My aims with using hash indexes are: Reduce the size of these indexes in memory, thereby allowing SQL Server to cache a greater fraction in RAM, thereby avoiding physical reads. Reduce index storage size. Reduce index I/O for INSERT operations.

sql-serversql-server-2008hashindexingdatabase-performance

Reputation: 81

How to get MS SQL Server to transparently use a CHECKSUM/hash index?

It appears that SQL Server does not automatically use a CHECKSUM/hash index unless the CHECKSUM column is explicitly included in the search arguments for the query. This is a problem because I do not control the applications that query the table, and I may not break their performance.

Is there any way to get SQL Server to use a new CHECKSUM/hash index without modifying queries to include the new CHECKSUM/hash column?

Repro script

CREATE TABLE big_table
(
    id BIGINT IDENTITY CONSTRAINT pk_big_table PRIMARY KEY,
    wide_col VARCHAR(50),
    wide_col_checksum AS CHECKSUM(wide_col),
    other_col INT
)

CREATE INDEX ix_checksum ON big_table (wide_col_checksum)

Insert some test data:

SET NOCOUNT ON
DECLARE @count INT = 0
BEGIN TRANSACTION
WHILE @count < 10000
BEGIN
    SET @count = @count + 1
    INSERT INTO big_table (wide_col, other_col) 
    VALUES (SUBSTRING(master.dbo.fn_varbintohexstr(CRYPT_GEN_RANDOM(25)), 3, 50), @count)
    IF @count % 1000 = 0
    BEGIN
        COMMIT TRANSACTION
        BEGIN TRANSACTION
    END
END
COMMIT TRANSACTION

INSERT INTO big_table (wide_col, other_col) 
VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 9999999)

Legacy query. Causes Clustered Index Scan (BAD):

SELECT * FROM big_table 
WHERE wide_col = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

Clustered Index Scan (BAD)

Updated query. Causes NonClustered Index Seek (good):

SELECT * FROM big_table 
WHERE wide_col = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
AND wide_col_checksum = CHECKSUM('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

NonClustered Index Seek (good)

Background

My table is very large (many hundreds of millions of rows), has several indexes (~ 20), all of which are required. Some of the indexed columns are a little wide (~ 50 bytes) and have few duplicate values. The columns are only searched on equality. The table is inserted into constantly.

Here is a table comparing "normal" indexes and CHECKSUM/hash indexes on the sample table above, both compressed and non-compressed. Data from freshly rebuilt indexes on tables with 1 million rows:

Hash indexes and compression

Page compression alone is pretty ineffective on the sample data (real data should compress a bit better). The hash index achieves a 4X index size reduction. Page compression on the hash index achieves a 6X index size reduction.

My aims with using hash indexes are:

Reduce the size of these indexes in memory, thereby allowing SQL Server to cache a greater fraction in RAM, thereby avoiding physical reads.
Reduce index storage size.
Reduce index I/O for INSERT operations.

Upvotes: 8

Answers (4)

Razvan Socol

Reputation: 5684

On most collations, the two queries can provide different results, because 'A'='a', but CHECKSUM('A') is not equal to CHECKSUM('a'). Even on CS_AS or BIN collations, the trailing spaces could be a problem. So that's why SQL Server cannot use such an index automatically.

Upvotes: 0

MtwStark

Reputation: 4058

I have a solution for you, it was an hard task!

You can rename your table and then create a view with the name of the table and inside the view do the trick.

The idea is to catch calls to table with the view, and inside the view return all the records if there is no direct filter on wide_col or the record corresponding using the ix_checksum index.

I use sys.dm_exec_requests and sys.dm_exec_sql_text to get the text of the query the user want, then with a little parsing I extract the parameter for the wide_col column and its CHECKSUM() or NULL if no parameter is found.

After that I extract the id of the record with that checksum (if exists).

With UNION ALL operator I add to result set all records if no filter was requested in the query.

It's tricky but it works!

WARNING!
I have done just a little parsing to get the parameter from the query, you should check your queries to see if it is correct and adjust it if needed.

-- rename the table
exec sp_rename big_table, _big_table;
go

drop view big_table
go

-- create the view with the name of the table
create view big_table
as
with
q as ( -- extract the query text
    SELECT SUBSTRING(dest.text, (dem.statement_start_offset+2)/2, CASE WHEN dem.statement_end_offset=-1 THEN 8000 ELSE (dem.statement_end_offset-dem.statement_start_offset+4)/2 END) current_statement
    FROM   sys.dm_exec_requests dem CROSS APPLY sys.dm_exec_sql_text(dem.sql_handle) dest  WHERE  session_id = @@SPID
),
f as ( -- do some parsing to get WHERE condition
    select 
        REPLACE(REPLACE(REPLACE(REPLACE(
            SUBSTRING(current_statement, nullif(patindex('%WHERE%wide_col%=%''%''%', current_statement), 0)+5, 8000)
        , CHAR(9), ' '), CHAR(10), ' '), CHAR(13), ' '), ' ', '') par 
        from q 
        where current_statement like '%WHERE%wide_col%=%''%''%'
),
r as ( -- some more parsig to get wide_col filter
    select SUBSTRING(par, 1, charindex('''', par)-1) par
    from (
        select SUBSTRING(par, patindex('%wide_col=''%''%', par)+LEN('wide_col')+2, 8000) par
        from f
        where par like '%wide_col=''%''%'
    ) r
),
p as ( -- calc the checksum of the parameter
    select par, iif(par is null, null, CHECKSUM(par)) chk 
    from r
),
x as ( -- lookup the id of the searched record
    select m.id 
    from _big_table m 
    where wide_col_checksum = (select chk from p)),
z as ( -- test if a parameter was found (flag for normal operation)
    select COUNT(*) n 
    from p 
    where chk is not null
)

-- this is the fast output for searched record
select m.*
from _big_table m, x
where (m.id = x.id) --OR (x.id is null) 

union all

-- this is the normal output for all other conditions
select m.*
from _big_table m, z
where z.n = 0

enjoy

Upvotes: 1

sqlwithpanks

Reputation: 96

SQL Server does not automatically start using the checksum/hash index. The query needs to be using the hashed column for sql server to consider using the index. So I do not see how you can achieve your objective which making changes to the queries. It is an interesting question however, could be a good feature request to SQL Server.

Upvotes: 1

Andomar

Reputation: 238116

If your application queries:

SELECT * FROM big_table WHERE wide_col = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

You need an index on wide_col, not on wide_col_checksum.

SQL Server stores indexes as a B-tree. As @MartinSmith suggests, reducing the size of columns in an index does indeed decrease the memory and disk footprint.

Upvotes: 1

How to get MS SQL Server to transparently use a CHECKSUM/hash index?

Repro script

Background

Answers (4)

Related Questions