sqlperformancesql-server-2008t-sqlcomparison

Reputation: 62544

Performance of varbinary comparison in T-SQL

One of a table columns is of VARBINARY type, need to raise a query to evaluate rows for particular bytes pattern, following approach dramatically decreases performance:

declare @pattern varbinary

// 19 bytes constant
set @pattern = 0x00.....

-- r.payload is of VARBINARY type as well
SELECT .... FROM ...
WHERE substring(r.payload, 0, 19) <> @pattern

Is there any alternative to the given approach?

Upvotes: 10

Answers (9)

miniscule

Reputation: 730

More input please, otherwise this is a textbook question.

Please provide us with some sample data and let us know your application. Sometimes there are application behaviours which have broader solutions.

I noticed you categorized this as SQL Server 2008, so you should:

turn page compression on, disable automatic statistic generation, look at you sys.dm_os_wait_stats, and check for high fragmentation.

What sort of records are you reading that require the <> comparison?
Is it anything being assigned to a second memory stream to process?
Can you make any changes to prevent data to be written to disk?

How long is it taking to process this table?

Upvotes: 0

Jodrell

Reputation: 35726

Here you go, use a calculated column to store the first 19 bytes, PERSISTED so you can index it.

A demonstration can be found here on SQLFiddle.

When I perform the linked test, the persisted and indexed approach is about 5 times faster. This might increase significantly if the average [Payload] is very large.

CREATE TABLE [dbo].[YourTable]
(
    [Id] INT CONSTRAINT [PK_YourTable] PRIMARY KEY,
    [Payload] VARBINARY(MAX),
    [Prefix] AS CAST([Payload] AS BINARY(19)) PERSISTED
);

CREATE NONCLUSTERED INDEX [IX_YourTable_Prefix] ON [YourTable]([Prefix]);

allowing,

SELECT
            [Id]
    FROM
            [YourTable]
    WHERE
            [Prefix] <> @pattern

or some such.

It is not entirely clear to me what your are trying to achieve but potentially the use of HASHBYTES might be useful.

Upvotes: 9

user4622594

Reputation:

I tried the following and for 500K rows in my test it works very fast!

What's important here: You need to have a really good clustered index! Clustered Index should always be: NUSE

Narrow as possible, in terms of the number of bytes it stores
Unique – to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values
Static – ideally, never updated
Ever-increasing – to avoid fragmentation and improve write performance

I created the table like this:

CREATE TABLE [dbo].[YourTable]
(
    [Id] INT IDENTITY(1,1),
    [Payload] VARBINARY(MAX),
    Prefix AS CAST(LEFT([Payload],19) AS BINARY(19)) PERSISTED 
)

CREATE UNIQUE CLUSTERED INDEX idx1 ON dbo.YourTable(Id)
GO
CREATE NONCLUSTERED INDEX idex ON dbo.YourTable (Prefix)
GO

So the Prefix computed column is BINARY(19) and always the first 19 Bytes. The only operation SQL Server has to do is a Clustered Index Scan.

Upvotes: 2

Solomon Rutzky

Reputation: 48864

More than likely the question being asked here is misleading in terms of finding the true cause of the query to slow down. There have been some good suggestions here but far too little testing. I have done fairly thorough testing, posted here for those who want to try for themselves:

http://pastebin.com/0RFUL7W5

I tested with 50k rows as opposed to the OP's stated 5k rows. The tests include tables both with and without the indexed, persisted computed column and variations on those two tables for the Payload field to be VARBINARY(100) and VARBINARY(MAX). And the VARBINARY(MAX) field tests first test just the difference in datatype as the initial dataset is identical to what is in the VARBINARY(100) tables (the data in this case is still on the data page), and then the MAX tables are updated to push the data size to 14k bytes per row (the data in this case is now moved off to lob pages.

These tests need to be run individually so that one test does not skew the results of another test. And each test should be run multiple times.

Running these tests on my laptop shows that, between the variations of VARBINARY(100), VARBINARY(MAX) with <= 100 bytes, and VARBINARY(MAX) with 14k bytes:

the methods of SUBSTRING, CONVERT, and "< @Pattern OR > @Pattern + 0xFF" are greatly impacted by the datatype and data size, hence each has potential to be the better choice depending on the actual schema and data
the indexed persisted computed column is generally better than the 3 methods mentioned above, but never by more than 110 or so milliseconds.

What does all of this mean? It means:

This question cannot be answered without first getting more information from the OP, specifically the full table schema (including all indexes) AND some sample data. The problem is that the timing differences between the assumed current situation and any of the possible fixes is so small (110 milliseconds on 50k rows while the OP only has 5k rows) that it is doubtful that filtering on the VARBINARY is the true cause of any query slowness. Even the slowest method would return too quickly on 5000 rows to even be noticeable.
While there have been some good suggestions for improving performance on the stated scenario, without proper testing some suggestions that look good might not show any improvement at all. Assumptions should not be made on how SQL Server will react, nor do they need to be made since any situation can be tested (though structuring the tests properly can be a bit tricky.

Bonus lesson:

Be very careful when copying and pasting code posted in a question (or even in an answer for that matter). Why not? Because you might be copying a bug. A bug such as using 0 as the starting position for a SUBSTRING function in SQL. This is fine for most languages, but the starting index for strings in SQL is 1. Why does this matter? Because the number of characters below 1 that you specify for the start position is deducted from the length. Meaning, using SUBSTRING([Payload], 0, 19) actually performs SUBSTRING([Payload], 1, 18). Likewise, using SUBSTRING(field, -1, 19) would actually perform SUBSTRING([Payload], 1, 17).

Also, the code shown in the question also uses DECLARE @Pattern VARBINARY which actually creates a VARBINARY(1) variable instead of what I assume was the intended VARBINARY(30). For VARCHAR / NVARCHAR / VARBINARY, the default length is 30 in some cases (such as for table columns) and 1 in other cases (such as local variables).

Upvotes: 3

UnhandledExcepSean

Reputation: 12804

This is also an option if the computed column isn't an option.

DECLARE @t TABLE (
    val VARBINARY(MAX)
)

INSERT INTO @t SELECT 0x00000100000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00001000000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00010000000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00100000000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00000f00000000000000000000000000000000000000000000000000

declare @pattern varbinary(19)
declare @pattern2 varbinary(19)
set @pattern = 0x0001
set @pattern2 = @pattern+0xFF

select @pattern,@pattern2

SELECT
    *
FROM @t
WHERE val<@pattern
OR val>@pattern2

Upvotes: 1

Gordon Rudman

Reputation: 438

I have a few observations/suggestions:

You are probably getting similar benchmark results for the different queries because of SQL Server optimizations. To compare the performance of the queries properly you will need to avoid those optimizations. However before you do that, please make sure that you only try this in a test environment. I repeat do not do this in a production environment!
Querying against a prefix column will definitely improve the performance of your queries.
Since the prefix you are querying for is a fixed length (19), you should use the BINARY(19) datatype for that column. Some of the other suggestions here will create a VARBINARY column.
I would personally avoid putting Triggers on tables if possible. See C# example for one possible alternative.
In my tests querying using Equality was faster than SUBSTRING, which in turn was faster than LEFT

How I got to that conclusion was by loading about 15000 files into a test database and executing the different types of queries (multiple times), with the following results:

Equality check using a prefix column -> 191 ms
SUBSTRING -> 20448 ms = 107x slower
LEFT -> 34091 ms = 178x slower

Here is a SQL script to create the test table:

CREATE TABLE [dbo].[Files]
(
    [Id]            [int] IDENTITY(1,1) NOT NULL,
    [Path]          [varchar](260)      NOT NULL,
    [Prefix]        [binary](19)        NOT NULL,
    [AllBytes]      [varbinary](MAX)    NOT NULL,
 CONSTRAINT [PK_Files] PRIMARY KEY CLUSTERED ([Id] ASC)
)

And now the queries (Run these in different Tabs):

--1.    Equality check using a prefix  column -> 191 ms
--WARNING!! Do not run this on a production server
--Clear SQL optimizations
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO

DECLARE @pattern BINARY(19) --NOTE use BINARY not VARBINARY to match the column type exactly 
SET @pattern = 0x4D5A90000300000004000000FFFF0000B80000 --Start of EXE / DLL files

SELECT  [Path]
FROM    [dbo].[Files]
WHERE   Prefix <> @pattern

--2.    SUBSTRING  -> 20448 ms  = 107x slower
--WARNING!! Do not run this on a production server
--Clear Cache
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO


DECLARE @pattern BINARY(19) 
SET @pattern = 0x4D5A90000300000004000000FFFF0000B80000 --Start of EXE / DLL files
SELECT  [Path]
FROM    [dbo].[Files]
WHERE   SUBSTRING(AllBytes, 0, 19) <> @pattern

-- 3.   LEFT  ->  34091 ms   = 178x slower
--WARNING!! Do not run this on a production server
--Clear Cache
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO


DECLARE @pattern VARBINARY(19)
SET @pattern = 0x4D5A90000300000004000000FFFF0000B80000 --Start of EXE / DLL files
SELECT  [Path]
FROM    [dbo].[Files]
WHERE   LEFT(AllBytes, 19) <> @pattern

And now for some code to create the test data:

using System;
using System.Data;
using System.Data.SqlClient;
using System.IO;

namespace SO
{
    class Program
    {
        static void Main(string[] args)
        {
            //!! To keep this EXAMPLE as simple as possible I'm leaving off all error handling and taking other shortcuts I wouldn't normal advise.

            const string sqlConnectionString = @"Data Source=yourServer;Initial Catalog=databaseName;Integrated Security=True";
            const string folderToProcess = @"C:\Program Files";  //choose folder with your test files
            const int prefixLength = 19;

            using (var connection = new SqlConnection(sqlConnectionString))
            {
                connection.Open();

                foreach (var filePath in Directory.GetFiles(folderToProcess, "*.*", SearchOption.AllDirectories))
                {
                    Console.WriteLine(filePath);

                    //I'm only using ReadAllBytes here to keep this example simple. 
                    //This isn't appropriate for large files because it'll load the entire file into memory
                    byte[] allBytes = File.ReadAllBytes(filePath);

                    //To keep things simple I'm assuming we're only working with binary values that are >= prefix length
                    if (allBytes.Length < prefixLength) { continue; }

                    //This can also be accomplished using SQL 
                    byte[] prefix = new byte[prefixLength];
                    Array.Copy(allBytes, prefix, prefixLength);

                    //Rather use a stored procedure here
                    using (var command = new SqlCommand("INSERT INTO dbo.Files (Path, Prefix, AllBytes) Values(@path, @prefix, @allBytes)", connection))
                    {
                        command.Parameters.Add("@path", SqlDbType.NVarChar, 260).Value = filePath;
                        command.Parameters.Add("@prefix", SqlDbType.Binary, prefixLength).Value = prefix;             //DataType is Binary!!
                        command.Parameters.Add("@allBytes", SqlDbType.VarBinary, allBytes.Length).Value = allBytes;   //DataType is VarBinary
                        command.ExecuteNonQuery();
                    }
                }

                connection.Close();
            }

            Console.WriteLine("All Done !");
            Console.ReadLine();
        }
    }
}

Upvotes: 1

Michael B

Reputation: 7587

It sounds like you want to do a LIKE on the column. But you can't because LIKE only works on varchar or nvarchar columns.

Let's pretend for a moment its a nvarchar(4000) column, and you want to look for anything that starts with this 19 byte pattern. A like that ends with a % gets optimized into a between query by the plan.

For instance consider this query:

select data FROM #a
where data like N'006%'

The query plan optimizer has turned it into the following condition:

Start: [tempdb].[dbo].[#a].data >= Scalar Operator(N'006'), End: [tempdb].[dbo].[#a].data < Scalar Operator(N'007')

Thus we can replace your search with:

select data 
from my_table
where data >= @searchPattern and data < (convert(varbinary,1)+@searchPattern)

If the varbinary column is indexed then this search should be very well optimized.

Upvotes: 1

Rory

Reputation: 969

Not strictly an alternative ... more of an enhancement...

Create a columnstore index on your payload field

CREATE COLUMNSTORE INDEX payloadcsindex 
    ON yourdb.yourschema.yourtable ( payload )

You will end up with an in-memory Columnstore Index Scan that could give 10x performance improvement with your existing query.

Upvotes: 1

Ed B

Reputation: 796

Is it possible to create an additional binary column that contains these first 19 bytes?

If so, you could either create it as a persisted computed column, or as a 'real' column, and create an INSERT trigger that populates it when a record is added (and an UPDATE trigger too if the records are updated).

You could then index the new column, and rewrite the query to compare to that column instead and you would lose the expensive substring operation.

Upvotes: 5

Performance of varbinary comparison in T-SQL

Answers (9)

Related Questions