John
John

Reputation: 173

SQL Server PATINDEX sometimes returns false found index

I have noticed that PATINDEX on SQL Server (I'm using 2016) is giving odd results. I suspect it has something to do with collations and character sets.

I'm trying to get the index of the first space or hyphen using PATINDEX.

On a database with the default collation SQL_Latin1_General_CP1_CI_AS I get the expected results of 0 (not found):

select PATINDEX('%[ -]%',  'ABC') -- returns 0
select PATINDEX('%[ -]%', N'ABC') -- returns 0
select PATINDEX('%[- ]%', N'ABC') -- returns 0

However, on a database with collation Latin1_General_100_BIN I get an unexpected and wrong result indicating that either a space or hyphen was found at index 1:

select PATINDEX('%[ -]%',  'ABC') -- returns 0
select PATINDEX('%[ -]%', N'ABC') -- returns 1 (WRONG!)
select PATINDEX('%[- ]%', N'ABC') -- returns 0

In summary, I note this apparently wrong result when:

I have seen other similar questions but they don't address quite the same situation, especially why the patteren works on one collation and not other, and OK on non-unicode string and not on unicode string. I have seen patindex t-sql special characters which says that the '-' character in any position other than first is a range spec for LIKE and PATINDEX (although I don't see it in the SQL Server PATINDEX or Wildcard docs). Still doesn't explain why it works in some configurations and not others as shown.

Why such different PATINDEX and apparently wrong result?

Upvotes: 4

Views: 1250

Answers (2)

HoneyBadger
HoneyBadger

Reputation: 15140

You already mentioned yourself that the '-' character in any position other than first is (or rather can be) a range spec. The issue with ' -' is that no end of the range is given. So lets find out what the end of the range is:

SELECT  SV.number, NCHAR(SV.number) TestChar
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND NCHAR(SV.number) LIKE N'%[ -]%'

This returns:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     45 |        - |
+--------+----------+

So, in a non-binary collation (I'm using Latin1_General_CI_AS) the - is not interpreted as a range specifier at all, but as a literal character. Otherwise, the characters between 32 and 45 would also be returned. So only space and dash will return a patindex <> 0. Lets try a binary collation:

SELECT  SV.number, NCHAR(SV.number) TestChar 
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND NCHAR(SV.number) LIKE N'%[ -]%' COLLATE Latin1_General_100_BIN2

This returns:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     33 | !        |
|     34 | "        |
|     35 | #        |
|     36 | $        |
|     37 | %        |
|     38 | &        |
|     39 | '        |
|     40 | (        |
|     41 | )        |
|     42 | *        |
|     43 | +        |
|     44 | ,        |
|     45 | -        |
|     46 | .        |
|     47 | /        |
|     48 | 0        |
|     49 | 1        |
|     50 | 2        |
|     51 | 3        |
|     52 | 4        |
|     53 | 5        |
|     54 | 6        |
|     55 | 7        |
|     56 | 8        |
|     57 | 9        |
|     58 | :        |
|     59 | ;        |
|     60 | <        |
|     61 | =        |
|     62 | >        |
|     63 | ?        |
|     64 | @        |
|     65 | A        |
|     66 | B        |
|     67 | C        |
|     68 | D        |
|     69 | E        |
|     70 | F        |
|     71 | G        |
|     72 | H        |
|     73 | I        |
|     74 | J        |
|     75 | K        |
|     76 | L        |
|     77 | M        |
|     78 | N        |
|     79 | O        |
|     80 | P        |
|     81 | Q        |
|     82 | R        |
|     83 | S        |
|     84 | T        |
|     85 | U        |
|     86 | V        |
|     87 | W        |
|     88 | X        |
|     89 | Y        |
|     90 | Z        |
|     91 | [        |
|     92 | \        |
|     93 | ]        |
+--------+----------+

So now it is interpreted as a range, and the range includes A-Z. Note it does not contain a-z! The lower case letters would be included when we use LIKE N'%[ -z]%'. In binary, the end of the range (when none is specified) is always ], no matter what the beginning of the range is.

Now, lets have a look at what non-unicode values do:

SELECT  SV.number, CHAR(SV.number) TestChar
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND CHAR(SV.number) LIKE '%[ -]%' COLLATE Latin1_General_100_BIN2

This returns:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     45 |        - |
+--------+----------+

So, as ASCII the dash is again not interpreted as a range operator. Weird, huh?

By the way, if you really want to search for space dash, you can also use PATINDEX(N'% [-]%', N'ABC' COLLATE Latin1_General_BIN2).

Another by the way: If we check out Larnu's solution:

SELECT  SV.number, NCHAR(SV.number) TestChar
FROM    master..spt_values AS SV
WHERE   SV.type = 'p'
    AND CHAR(SV.number) LIKE '%[ --]%' COLLATE Latin1_General_100_BIN2

You'll get:

+--------+----------+
| number | TestChar |
+--------+----------+
|     32 |          |
|     33 | !        |
|     34 | "        |
|     35 | #        |
|     36 | $        |
|     37 | %        |
|     38 | &        |
|     39 | '        |
|     40 | (        |
|     41 | )        |
|     42 | *        |
|     43 | +        |
|     44 | ,        |
|     45 | -        |
+--------+----------+

So you're still evaluating a range. Not sure if that's what you want, but something to be aware of.

Upvotes: 3

Thom A
Thom A

Reputation: 95554

Double up the hyphen as it seems it's sometimes being used as a between operator.

SELECT PATINDEX(N'%[ --]%', 'ABC' COLLATE Latin1_General_100_BIN); --Returns 0
SELECT PATINDEX(N'%[ --]%', N'ABC' COLLATE Latin1_General_100_BIN); --Returns 0
SELECT PATINDEX(N'%[-- ]%', N'ABC' COLLATE Latin1_General_100_BIN); --Returns 0

SELECT PATINDEX(N'%[ --]%', '-ABC' COLLATE Latin1_General_100_BIN); --Returns 1
SELECT PATINDEX(N'%[ --]%', N'ABC-' COLLATE Latin1_General_100_BIN); --Returns 4
SELECT PATINDEX(N'%[-- ]%', N'-ABC' COLLATE Latin1_General_100_BIN); --Returns 0, as the hyphen is at the start, so doesn't need escaping.

Upvotes: 0

Related Questions