Reputation: 699

Complex LIKE Query extremely slow againt NVARCHAR(450), compared to VARCHAR(450)

I'm currently wondering about some performance differences on VARCHAR/NVARCHAR, especially when using Complex LIKE queries (that start with _ or %).

I have a testsetup on Microsoft SQL Server 2014. I have 2 Tables. Both have an ID Field (identity(1, 1), and a Value Field (either VARCHAR(450) OR NVARCHAR(450)). Both have the identical 1'000'000 randomly generated entries.

The tables are named tblVarCharNoIndex and tblNVarCharNoIndex (so, there are no indexes. The behaviour is nearly the same, if I use indexes).

Now, I execute the following queries test the duration (once on VARCHAR; once on NVARCHAR)

SELECT * FROM tblVarcharNoIndex WHERE Value LIKE '%ab%'
SELECT * FROM tblNVarcharNoIndex WHERE Value LIKE '%ab%'

The execution times are wildly different. It takes 1540ms on the VARCHAR Table, and 8630 ms on the NVARCHAR Table, so it takes over 5x longer with NVARCHAR.

I understand, that NVARCHAR has performance implications, since it needs 2 bytes to store, this totally makes sense. But I can't explain a performance degradation by 500%, this makes no sense to me.

As per Request, here some more Data.

Query for table creation

CREATE TABLE [dbo].[tblVarcharNoIndex](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Value] [varchar](450) NOT NULL,
 CONSTRAINT [PK_tblVarcharNoIndex] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]


CREATE TABLE [dbo].[tblNVarcharNoIndex](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Value] [nvarchar](450) NOT NULL,
 CONSTRAINT [PK_tblNVarcharNoIndex] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Query for value generation

DECLARE @cnt INT = 0;
DECLARE @entries INT = 1000000 --1'000'000;
DECLARE @maxLength INT = 450;
DECLARE @minLength INT = 50;
DECLARE @value VARCHAR(450)
DECLARE @length INT

WHILE @cnt < @entries
BEGIN
    SELECT @value = ''
    SET @length = @minLength + CAST(RAND() * (@maxLength - @minLength) as INT)
    WHILE @length <> 0
    BEGIN
        SELECT @value = @value + CHAR(CAST(RAND() * 96 + 32 as INT))
        SET @length = @length - 1
    END

    INSERT INTO tblBase(Value, NValue) VALUES (@value, @value)

    SET @cnt = @cnt + 1;

END;

(Values are copied later from tblBase)

LIKE Query in question

DECLARE @start DATETIME
DECLARE @end DATETIME
DECLARE @testname NVARCHAR(100) = 'INSERT FROM other table'

--VARCHAR No Index
PRINT 'starting ''' + @testname + ''' on VARCHAR (No Index)'
SET @start = GETDATE()

SELECT * FROM tblVarcharNoIndex WHERE Value LIKE '%ab%' --This takes 1540ms

SET @end = GETDATE()
PRINT '-- finished ''' + @testname + ''' on VARCHAR (No Index)'
PRINT '-- Duration ' + CAST(DATEDIFF(mcs, @start, @end) AS VARCHAR(100)) + ' microseconds'


--NVARCHAR No Index
PRINT 'starting ''' + @testname + ''' on NVARCHAR (No Index)'
SET @start = GETDATE()

SELECT * FROM tblNVarcharNoIndex WHERE Value LIKE '%ab%' --This takes 8630ms

SET @end = GETDATE()
PRINT '-- finished ''' + @testname + ''' on NVARCHAR (No Index)'
PRINT '-- Duration ' + CAST(DATEDIFF(mcs, @start, @end) AS VARCHAR(100)) + ' microseconds'

Execution Plan The execution plan for both the queries looks exactly the same (I have nowhere to upload the image right now, but it's really simple):

SELECT (0%) <--- Parallelism (Gather Streams) (3%) <--- Clustered Index Scan ON Primary Key (97%)

Upvotes: 13

Answers (5)

clifton_h

Reputation: 1298

The theory though is sound. LIKE is an operator that compares each value against a portion of a string. If the operator is truly based properly and if SQL Server was not aware of the advantages of one part of the value over the other, then SQL Server would necessarily have to run an algorithm like the following (example in C#):

        for (; foundValue == false && Start < (length - 2); Start += 1)
        {
            searchValue = x.Substring(Start, 2);
            if (searchValue == compareValue)
                foundValue = true;
        }

only there is twice as many characters in NVARCHAR.

From my own testing, I note the following:

Table 'tblVarcharNoIndex'. Scan count 1, logical reads 97, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'tblNVarcharNoIndex'. Scan count 1, logical reads 189, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Logical Reads implies how much SQL was storing for the comparisons, and we notice this is a little over 2x. I think the answer can be seen when looking at the Actual Execution Plan and noting the estimated number of rows was 56 vs 73, even thought the same number of rows was ultimately returned.

A look at the Client Statistics, however, shows what you probably noticed:

                                                        NVAR    VAR     AVERAGE
Query Profile Statistics                    
  Number of INSERT, DELETE and UPDATE statements        0       0       0.0000
  Rows affected by INSERT, DELETE, or UPDATE statements 0       0       0.0000
  Number of SELECT statements                           2       2       2.0000
  Rows returned by SELECT statements                    306     306     306.0000
  Number of transactions                                0       0       0.0000
Network Statistics                  
  Number of server roundtrips                           1       1       1.0000
  TDS packets sent from client                          1       1       1.0000
  TDS packets received from server                      45      23      34.0000
  Bytes sent from client                                146     144     145.0000
  Bytes received from server                            180799  91692   136245.5000
Time Statistics                 
  Client processing time                                286     94      190.0000
  Total execution time                                  317     156     236.5000
  Wait time on server replies                           31      62      46.5000

Notice the TDS packets received from the server was different (recall the estimation of rows was disparate), which not only takes up more bytes but time to process. Execution time about 2x, yet the processing time was 3x the amount.

How much of this is related to your processor vs SQL Server's protocols? Probably some or a lot of it (this query was run on an ancient EDU Lenovo laptop w/Windows 10, DuoCore 1.64Ghz, 16GB DDR3). Though the specifics I am unqualified to answer.

Still, we can conclude one thing: SQL Server's estimation of rows has an impact on the client and data sent/received.

Upvotes: 3

Sandeep Kumar

Reputation: 1202

Query that uses a varchar parameter does an index seek due to column collation sets.

query that uses a nvarchar parameter does an index scan due to column collation sets.

The basic rule to follow is Scans are bad, Seeks are good.

Index Scan

When SQL Server does a scan it loads the object which it wants to read from disk into memory, then reads through that object from top to bottom looking for the records that it needs.

Index Seek

When SQL Server does a seek it knows where in the index that the data is going to be, so it loads up the index from disk, goes directly to the part of the index that it needs and reads to where the data that it needs ends. This is obviously a much more efficient operation than a scan, as SQL already knows where the data it is looking for is located.

How can I modify an Execution Plan to use a Seek instead of a Scan?

When SQL Server is looking for your data probably one of the largest things which will make SQL Server switch from a seek to a scan is when some of the columns are you looking for are not included in the index you want it to use. Most often this will have SQL Server fall back to doing a clustered index scan, since the Clustered index contains all the columns in the table. This is one of the biggest reasons (in my opinion at least) that we now have the ability to INCLUDE columns in an index, without adding those columns to the indexed columns of the index. By including the additional columns in the index we increase the size of the index, but we allow SQL Server to read the index, without having togo back to the clustered index, or to the table it self to get these values.

References

For information regarding the specifics of each of these operators within a SQL Server Execution plan see....

https://msdn.microsoft.com/en-us/library/ms175184.aspx

https://technet.microsoft.com/en-us/library/ms190400.aspx

Upvotes: 0

Edmond Quinton

Reputation: 1739

As suggested in some of the other posts the biggest impact on your performance in this scenario is the Unicode comparison rules. You can work around the problem with regards to ‘LIKE’ comparison queries by adding a non-persisted calculated field with a binary collation to your table:

ALTER TABLE tblNVarcharNoIndex
ADD ValueColBin AS UPPER(Value) COLLATE Latin1_General_100_Bin2;

Instead of querying your persisted data field you can now query the calculated field instead. Note that binary collation is case sensitive so in order to perform case insensitive searches you will have to convert your search strings to upper case. The following example shows what the updated query will look like:

DECLARE @compare  NVARCHAR(10) = N'%AB%'

SELECT   [Id]
        ,[Value] 
FROM    tblNVarcharNoIndex 
WHERE   [ValueColBin]  LIKE @compare collate Latin1_General_100_Bin2

There will still be a performance hit however it should be within the expected range of 1.5 to 2 slower (in theory at least). Note that this method will incur a higher CPU cost.

Upvotes: 0

Dharmendar Kumar 'DK'

Reputation: 2102

Unicode comparison rules are much more complicated than ascii rules.

The effect of Unicode data on performance is complicated by a variety of factors that include the following:

The difference between Unicode sorting rules and non-Unicode sorting rules

The difference between sorting double-byte and single-byte characters

Code page conversion between client and server

Ref: https://msdn.microsoft.com/en-us/library/ms189617.aspx

You can confirm this by changing the collation of the column to binary.

SELECT  *
FROM    #temp2
where col1 COLLATE Latin1_General_Bin2 like '%str%'

Lastly, some considerations if you have to use NVARCHAR and want to improve performance.

Consider data compression. When you page/row compress in SQL Server, your nvarchar cols are automatically UNICODE compressed too. () https://msdn.microsoft.com/en-us/library/ee240835(v=sql.105).aspx
Consider changing collation. This has implications for sorting/searching. So test if it gives desired results across languages/characters.

Upvotes: 1

Joe C

Reputation: 3993

It will not be possible to give you specifics without more data, starting with an execution plan for both queries.

Some general reasons:

As you stated there are twice as many bytes to read when doing a scan

The number of page loads will increase

The amount of necessary memory will increase, which can cause overflow to disk operations

The amount of CPU will can increase which might be capped based on OS or SQL settings and be causing CPU waits.

Upvotes: 4

Complex LIKE Query extremely slow againt NVARCHAR(450), compared to VARCHAR(450)

Answers (5)

Related Questions