Reputation: 699
I'm currently wondering about some performance differences on VARCHAR/NVARCHAR, especially when using Complex LIKE queries (that start with _ or %).
I have a testsetup on Microsoft SQL Server 2014. I have 2 Tables. Both have an ID Field (identity(1, 1), and a Value Field (either VARCHAR(450) OR NVARCHAR(450)). Both have the identical 1'000'000 randomly generated entries.
The tables are named tblVarCharNoIndex and tblNVarCharNoIndex (so, there are no indexes. The behaviour is nearly the same, if I use indexes).
Now, I execute the following queries test the duration (once on VARCHAR; once on NVARCHAR)
SELECT * FROM tblVarcharNoIndex WHERE Value LIKE '%ab%'
SELECT * FROM tblNVarcharNoIndex WHERE Value LIKE '%ab%'
The execution times are wildly different. It takes 1540ms on the VARCHAR Table, and 8630 ms on the NVARCHAR Table, so it takes over 5x longer with NVARCHAR.
I understand, that NVARCHAR has performance implications, since it needs 2 bytes to store, this totally makes sense. But I can't explain a performance degradation by 500%, this makes no sense to me.
As per Request, here some more Data.
Query for table creation
CREATE TABLE [dbo].[tblVarcharNoIndex](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Value] [varchar](450) NOT NULL,
CONSTRAINT [PK_tblVarcharNoIndex] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [dbo].[tblNVarcharNoIndex](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Value] [nvarchar](450) NOT NULL,
CONSTRAINT [PK_tblNVarcharNoIndex] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Query for value generation
DECLARE @cnt INT = 0;
DECLARE @entries INT = 1000000 --1'000'000;
DECLARE @maxLength INT = 450;
DECLARE @minLength INT = 50;
DECLARE @value VARCHAR(450)
DECLARE @length INT
WHILE @cnt < @entries
BEGIN
SELECT @value = ''
SET @length = @minLength + CAST(RAND() * (@maxLength - @minLength) as INT)
WHILE @length <> 0
BEGIN
SELECT @value = @value + CHAR(CAST(RAND() * 96 + 32 as INT))
SET @length = @length - 1
END
INSERT INTO tblBase(Value, NValue) VALUES (@value, @value)
SET @cnt = @cnt + 1;
END;
(Values are copied later from tblBase)
LIKE Query in question
DECLARE @start DATETIME
DECLARE @end DATETIME
DECLARE @testname NVARCHAR(100) = 'INSERT FROM other table'
--VARCHAR No Index
PRINT 'starting ''' + @testname + ''' on VARCHAR (No Index)'
SET @start = GETDATE()
SELECT * FROM tblVarcharNoIndex WHERE Value LIKE '%ab%' --This takes 1540ms
SET @end = GETDATE()
PRINT '-- finished ''' + @testname + ''' on VARCHAR (No Index)'
PRINT '-- Duration ' + CAST(DATEDIFF(mcs, @start, @end) AS VARCHAR(100)) + ' microseconds'
--NVARCHAR No Index
PRINT 'starting ''' + @testname + ''' on NVARCHAR (No Index)'
SET @start = GETDATE()
SELECT * FROM tblNVarcharNoIndex WHERE Value LIKE '%ab%' --This takes 8630ms
SET @end = GETDATE()
PRINT '-- finished ''' + @testname + ''' on NVARCHAR (No Index)'
PRINT '-- Duration ' + CAST(DATEDIFF(mcs, @start, @end) AS VARCHAR(100)) + ' microseconds'
Execution Plan The execution plan for both the queries looks exactly the same (I have nowhere to upload the image right now, but it's really simple):
SELECT (0%) <--- Parallelism (Gather Streams) (3%) <--- Clustered Index Scan ON Primary Key (97%)
Upvotes: 13
Views: 2374
Reputation: 1298
The theory though is sound. LIKE
is an operator that compares each value against a portion of a string. If the operator is truly based properly and if SQL Server
was not aware of the advantages of one part of the value over the other, then SQL Server
would necessarily have to run an algorithm like the following (example in C#
):
for (; foundValue == false && Start < (length - 2); Start += 1)
{
searchValue = x.Substring(Start, 2);
if (searchValue == compareValue)
foundValue = true;
}
only there is twice as many characters in NVARCHAR
.
From my own testing, I note the following:
Table 'tblVarcharNoIndex'. Scan count 1, logical reads 97, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tblNVarcharNoIndex'. Scan count 1, logical reads 189, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Logical Reads implies how much SQL was storing for the comparisons, and we notice this is a little over 2x. I think the answer can be seen when looking at the Actual Execution Plan and noting the estimated number of rows was 56 vs 73, even thought the same number of rows was ultimately returned.
A look at the Client Statistics, however, shows what you probably noticed:
NVAR VAR AVERAGE
Query Profile Statistics
Number of INSERT, DELETE and UPDATE statements 0 0 0.0000
Rows affected by INSERT, DELETE, or UPDATE statements 0 0 0.0000
Number of SELECT statements 2 2 2.0000
Rows returned by SELECT statements 306 306 306.0000
Number of transactions 0 0 0.0000
Network Statistics
Number of server roundtrips 1 1 1.0000
TDS packets sent from client 1 1 1.0000
TDS packets received from server 45 23 34.0000
Bytes sent from client 146 144 145.0000
Bytes received from server 180799 91692 136245.5000
Time Statistics
Client processing time 286 94 190.0000
Total execution time 317 156 236.5000
Wait time on server replies 31 62 46.5000
Notice the TDS packets received from the server was different (recall the estimation of rows was disparate), which not only takes up more bytes but time to process. Execution time about 2x, yet the processing time was 3x the amount.
How much of this is related to your processor vs SQL Server's protocols? Probably some or a lot of it (this query was run on an ancient EDU Lenovo laptop w/Windows 10, DuoCore 1.64Ghz, 16GB DDR3). Though the specifics I am unqualified to answer.
Still, we can conclude one thing: SQL Server's estimation of rows has an impact on the client and data sent/received.
Upvotes: 3
Reputation: 1202
Query that uses a varchar parameter does an index seek due to column collation sets.
query that uses a nvarchar parameter does an index scan due to column collation sets.
The basic rule to follow is Scans are bad, Seeks are good.
Index Scan
When SQL Server does a scan it loads the object which it wants to read from disk into memory, then reads through that object from top to bottom looking for the records that it needs.
Index Seek
When SQL Server does a seek it knows where in the index that the data is going to be, so it loads up the index from disk, goes directly to the part of the index that it needs and reads to where the data that it needs ends. This is obviously a much more efficient operation than a scan, as SQL already knows where the data it is looking for is located.
How can I modify an Execution Plan to use a Seek instead of a Scan?
When SQL Server is looking for your data probably one of the largest things which will make SQL Server switch from a seek to a scan is when some of the columns are you looking for are not included in the index you want it to use. Most often this will have SQL Server fall back to doing a clustered index scan, since the Clustered index contains all the columns in the table. This is one of the biggest reasons (in my opinion at least) that we now have the ability to INCLUDE columns in an index, without adding those columns to the indexed columns of the index. By including the additional columns in the index we increase the size of the index, but we allow SQL Server to read the index, without having togo back to the clustered index, or to the table it self to get these values.
References
For information regarding the specifics of each of these operators within a SQL Server Execution plan see....
https://msdn.microsoft.com/en-us/library/ms175184.aspx
https://technet.microsoft.com/en-us/library/ms190400.aspx
Upvotes: 0
Reputation: 1739
As suggested in some of the other posts the biggest impact on your performance in this scenario is the Unicode comparison rules. You can work around the problem with regards to ‘LIKE’ comparison queries by adding a non-persisted calculated field with a binary collation to your table:
ALTER TABLE tblNVarcharNoIndex
ADD ValueColBin AS UPPER(Value) COLLATE Latin1_General_100_Bin2;
Instead of querying your persisted data field you can now query the calculated field instead. Note that binary collation is case sensitive so in order to perform case insensitive searches you will have to convert your search strings to upper case. The following example shows what the updated query will look like:
DECLARE @compare NVARCHAR(10) = N'%AB%'
SELECT [Id]
,[Value]
FROM tblNVarcharNoIndex
WHERE [ValueColBin] LIKE @compare collate Latin1_General_100_Bin2
There will still be a performance hit however it should be within the expected range of 1.5 to 2 slower (in theory at least). Note that this method will incur a higher CPU cost.
Upvotes: 0
Reputation: 2102
Unicode comparison rules are much more complicated than ascii rules.
The effect of Unicode data on performance is complicated by a variety of factors that include the following:
- The difference between Unicode sorting rules and non-Unicode sorting rules
- The difference between sorting double-byte and single-byte characters
- Code page conversion between client and server
Ref: https://msdn.microsoft.com/en-us/library/ms189617.aspx
You can confirm this by changing the collation of the column to binary.
SELECT *
FROM #temp2
where col1 COLLATE Latin1_General_Bin2 like '%str%'
Lastly, some considerations if you have to use NVARCHAR and want to improve performance.
Upvotes: 1
Reputation: 3993
It will not be possible to give you specifics without more data, starting with an execution plan for both queries.
Some general reasons:
- As you stated there are twice as many bytes to read when doing a scan
- The number of page loads will increase
- The amount of necessary memory will increase, which can cause overflow to disk operations
- The amount of CPU will can increase which might be capped based on OS or SQL settings and be causing CPU waits.
Upvotes: 4