Reputation: 62544
One of a table columns is of VARBINARY type, need to raise a query to evaluate rows for particular bytes pattern, following approach dramatically decreases performance:
declare @pattern varbinary
// 19 bytes constant
set @pattern = 0x00.....
-- r.payload is of VARBINARY type as well
SELECT .... FROM ...
WHERE substring(r.payload, 0, 19) <> @pattern
Is there any alternative to the given approach?
Upvotes: 10
Views: 7409
Reputation: 730
More input please, otherwise this is a textbook question.
Please provide us with some sample data and let us know your application. Sometimes there are application behaviours which have broader solutions.
I noticed you categorized this as SQL Server 2008, so you should:
turn page compression on, disable automatic statistic generation, look at you sys.dm_os_wait_stats, and check for high fragmentation.
What sort of records are you reading that require the <> comparison?
Is it anything being assigned to a second memory stream to process?
Can you make any changes to prevent data to be written to disk?
How long is it taking to process this table?
Upvotes: 0
Reputation: 35726
Here you go, use a calculated column to store the first 19 bytes, PERSISTED
so you can index it.
A demonstration can be found here on SQLFiddle.
When I perform the linked test, the persisted and indexed approach is about 5 times faster. This might increase significantly if the average [Payload]
is very large.
CREATE TABLE [dbo].[YourTable]
(
[Id] INT CONSTRAINT [PK_YourTable] PRIMARY KEY,
[Payload] VARBINARY(MAX),
[Prefix] AS CAST([Payload] AS BINARY(19)) PERSISTED
);
CREATE NONCLUSTERED INDEX [IX_YourTable_Prefix] ON [YourTable]([Prefix]);
allowing,
SELECT
[Id]
FROM
[YourTable]
WHERE
[Prefix] <> @pattern
or some such.
It is not entirely clear to me what your are trying to achieve but potentially the use of HASHBYTES
might be useful.
Upvotes: 9
Reputation:
I tried the following and for 500K rows in my test it works very fast!
What's important here: You need to have a really good clustered index! Clustered Index should always be: NUSE
I created the table like this:
CREATE TABLE [dbo].[YourTable]
(
[Id] INT IDENTITY(1,1),
[Payload] VARBINARY(MAX),
Prefix AS CAST(LEFT([Payload],19) AS BINARY(19)) PERSISTED
)
CREATE UNIQUE CLUSTERED INDEX idx1 ON dbo.YourTable(Id)
GO
CREATE NONCLUSTERED INDEX idex ON dbo.YourTable (Prefix)
GO
So the Prefix computed column is BINARY(19) and always the first 19 Bytes. The only operation SQL Server has to do is a Clustered Index Scan.
Upvotes: 2
Reputation: 48864
More than likely the question being asked here is misleading in terms of finding the true cause of the query to slow down. There have been some good suggestions here but far too little testing. I have done fairly thorough testing, posted here for those who want to try for themselves:
I tested with 50k rows as opposed to the OP's stated 5k rows. The tests include tables both with and without the indexed, persisted computed column and variations on those two tables for the Payload
field to be VARBINARY(100)
and VARBINARY(MAX)
. And the VARBINARY(MAX)
field tests first test just the difference in datatype as the initial dataset is identical to what is in the VARBINARY(100)
tables (the data in this case is still on the data page), and then the MAX tables are updated to push the data size to 14k bytes per row (the data in this case is now moved off to lob pages.
These tests need to be run individually so that one test does not skew the results of another test. And each test should be run multiple times.
Running these tests on my laptop shows that, between the variations of VARBINARY(100), VARBINARY(MAX) with <= 100 bytes, and VARBINARY(MAX) with 14k bytes:
What does all of this mean? It means:
VARBINARY
is the true cause of any query slowness. Even the slowest method would return too quickly on 5000 rows to even be noticeable.Bonus lesson:
Be very careful when copying and pasting code posted in a question (or even in an answer for that matter). Why not? Because you might be copying a bug. A bug such as using 0
as the starting position for a SUBSTRING
function in SQL. This is fine for most languages, but the starting index for strings in SQL is 1
. Why does this matter? Because the number of characters below 1
that you specify for the start position is deducted from the length. Meaning, using SUBSTRING([Payload], 0, 19)
actually performs SUBSTRING([Payload], 1, 18)
. Likewise, using SUBSTRING(field, -1, 19)
would actually perform SUBSTRING([Payload], 1, 17)
.
Also, the code shown in the question also uses DECLARE @Pattern VARBINARY
which actually creates a VARBINARY(1)
variable instead of what I assume was the intended VARBINARY(30)
. For VARCHAR
/ NVARCHAR
/ VARBINARY
, the default length is 30 in some cases (such as for table columns) and 1 in other cases (such as local variables).
Upvotes: 3
Reputation: 12804
This is also an option if the computed column isn't an option.
DECLARE @t TABLE (
val VARBINARY(MAX)
)
INSERT INTO @t SELECT 0x00000100000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00001000000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00010000000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00100000000000000000000000000000000000000000000000000000
INSERT INTO @t SELECT 0x00000f00000000000000000000000000000000000000000000000000
declare @pattern varbinary(19)
declare @pattern2 varbinary(19)
set @pattern = 0x0001
set @pattern2 = @pattern+0xFF
select @pattern,@pattern2
SELECT
*
FROM @t
WHERE val<@pattern
OR val>@pattern2
Upvotes: 1
Reputation: 438
I have a few observations/suggestions:
How I got to that conclusion was by loading about 15000 files into a test database and executing the different types of queries (multiple times), with the following results:
Here is a SQL script to create the test table:
CREATE TABLE [dbo].[Files]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Path] [varchar](260) NOT NULL,
[Prefix] [binary](19) NOT NULL,
[AllBytes] [varbinary](MAX) NOT NULL,
CONSTRAINT [PK_Files] PRIMARY KEY CLUSTERED ([Id] ASC)
)
And now the queries (Run these in different Tabs):
1.
--1. Equality check using a prefix column -> 191 ms
--WARNING!! Do not run this on a production server
--Clear SQL optimizations
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE @pattern BINARY(19) --NOTE use BINARY not VARBINARY to match the column type exactly
SET @pattern = 0x4D5A90000300000004000000FFFF0000B80000 --Start of EXE / DLL files
SELECT [Path]
FROM [dbo].[Files]
WHERE Prefix <> @pattern
2.
--2. SUBSTRING -> 20448 ms = 107x slower
--WARNING!! Do not run this on a production server
--Clear Cache
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE @pattern BINARY(19)
SET @pattern = 0x4D5A90000300000004000000FFFF0000B80000 --Start of EXE / DLL files
SELECT [Path]
FROM [dbo].[Files]
WHERE SUBSTRING(AllBytes, 0, 19) <> @pattern
3.
-- 3. LEFT -> 34091 ms = 178x slower
--WARNING!! Do not run this on a production server
--Clear Cache
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE @pattern VARBINARY(19)
SET @pattern = 0x4D5A90000300000004000000FFFF0000B80000 --Start of EXE / DLL files
SELECT [Path]
FROM [dbo].[Files]
WHERE LEFT(AllBytes, 19) <> @pattern
And now for some code to create the test data:
using System;
using System.Data;
using System.Data.SqlClient;
using System.IO;
namespace SO
{
class Program
{
static void Main(string[] args)
{
//!! To keep this EXAMPLE as simple as possible I'm leaving off all error handling and taking other shortcuts I wouldn't normal advise.
const string sqlConnectionString = @"Data Source=yourServer;Initial Catalog=databaseName;Integrated Security=True";
const string folderToProcess = @"C:\Program Files"; //choose folder with your test files
const int prefixLength = 19;
using (var connection = new SqlConnection(sqlConnectionString))
{
connection.Open();
foreach (var filePath in Directory.GetFiles(folderToProcess, "*.*", SearchOption.AllDirectories))
{
Console.WriteLine(filePath);
//I'm only using ReadAllBytes here to keep this example simple.
//This isn't appropriate for large files because it'll load the entire file into memory
byte[] allBytes = File.ReadAllBytes(filePath);
//To keep things simple I'm assuming we're only working with binary values that are >= prefix length
if (allBytes.Length < prefixLength) { continue; }
//This can also be accomplished using SQL
byte[] prefix = new byte[prefixLength];
Array.Copy(allBytes, prefix, prefixLength);
//Rather use a stored procedure here
using (var command = new SqlCommand("INSERT INTO dbo.Files (Path, Prefix, AllBytes) Values(@path, @prefix, @allBytes)", connection))
{
command.Parameters.Add("@path", SqlDbType.NVarChar, 260).Value = filePath;
command.Parameters.Add("@prefix", SqlDbType.Binary, prefixLength).Value = prefix; //DataType is Binary!!
command.Parameters.Add("@allBytes", SqlDbType.VarBinary, allBytes.Length).Value = allBytes; //DataType is VarBinary
command.ExecuteNonQuery();
}
}
connection.Close();
}
Console.WriteLine("All Done !");
Console.ReadLine();
}
}
}
Upvotes: 1
Reputation: 7587
It sounds like you want to do a LIKE
on the column. But you can't because LIKE
only works on varchar
or nvarchar
columns.
Let's pretend for a moment its a nvarchar(4000)
column, and you want to look for anything that starts with this 19 byte pattern. A like that ends with a %
gets optimized into a between query by the plan.
For instance consider this query:
select data FROM #a
where data like N'006%'
The query plan optimizer has turned it into the following condition:
Start: [tempdb].[dbo].[#a].data >= Scalar Operator(N'006'), End: [tempdb].[dbo].[#a].data < Scalar Operator(N'007')
Thus we can replace your search with:
select data
from my_table
where data >= @searchPattern and data < (convert(varbinary,1)+@searchPattern)
If the varbinary column is indexed then this search should be very well optimized.
Upvotes: 1
Reputation: 969
Not strictly an alternative ... more of an enhancement...
Create a columnstore index on your payload field
CREATE COLUMNSTORE INDEX payloadcsindex
ON yourdb.yourschema.yourtable ( payload )
You will end up with an in-memory Columnstore Index Scan that could give 10x performance improvement with your existing query.
Upvotes: 1
Reputation: 796
Is it possible to create an additional binary column that contains these first 19 bytes?
If so, you could either create it as a persisted computed column, or as a 'real' column, and create an INSERT
trigger that populates it when a record is added (and an UPDATE
trigger too if the records are updated).
You could then index the new column, and rewrite the query to compare to that column instead and you would lose the expensive substring operation.
Upvotes: 5