AlphaNERD
AlphaNERD

Reputation: 385

Why is SQL sorting our string apparently illogically?

While working with SQL today we found out that SQL does not sort our strings in a column as we would expect.

This is our list of datasets: (The URLs are shortened to prevent abuse)

http://10.10.14
http://192.168.
https://m.hanno
https://online.
https://online-
https://owi-000
https://owi2.su
https://owi2-00
https://owi2-71
https://owi-700
https://owi-702
https://owi-703
https://owi-704
https://owi-707
https://owi-708
https://owi-710
https://owi-711
https://owi-712
https://owi-713
https://owi-714
https://owi-715
https://owi-716
https://owi-717
https://owigo.n
https://owigosm
https://owigow.

The owi2-URLs are directly behind owi-000. However it should be behind owi-717 or before owi-000, but definitely not in between.

...
https://owi-000
https://owi2.su
https://owi2-00
https://owi2-71
https://owi-700
...

This is how we'd expect the results. The minus comes before the 2 in ASCII and many other character encodings. Therefore all URLs with owi2 should follow after owi-.

...
https://owi-717
https://owi2.su
https://owi2-00
https://owi2-71
https://owigo.n
...

We copied all the URLs in MS Excel and to our surprise we got the same results. So we checked the strings to make sure no non-printing-character is causing the issue. And in fact, we found no non-printing-character using Notepad++.

Our question is: Why is SQL sorting our URLs the way it does?

Upvotes: 1

Views: 619

Answers (2)

DhruvJoshi
DhruvJoshi

Reputation: 17146

TLDR: Make sure you have VARCHAR and not NVARCHAR. NVARCHAR strings will ignore hyphen. In case you can not make changes to data type, try using collation Latin1_General_CI_AS

As has been mentioned elsewhere before like here, on SO here and here, there's a difference in how hyphen is used in sorting of alphanumeric characters.

Typical collations ignore hyphen when sorting. Quoting from MSDN article:

A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules. Because the comparison rules for non-Unicode and Unicode data are different, when you use a SQL collation you might see different results for comparisons of the same characters, depending on the underlying data type. For example, if you are using the SQL collation "SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen.

Upvotes: 4

Thom A
Thom A

Reputation: 95913

As an alternative, if you do need to use an nvarchar, then you can use a different collation. You can change the collation of the column within the table, however, that may have undesired effects, as then any comparison you do will need to be explicitly collated if they don't match; which will effect SARGability. Considering you only want this for ordering, then you could simply use the COLLATE keyword. For example:

SELECT ROW_NUMBER() OVER (ORDER BY URL COLLATE SQL_Latin1_General_CP1_CI_AS),
       ROW_NUMBER() OVER (ORDER BY CONVERT(nvarchar(20),URL) COLLATE SQL_Latin1_General_CP1_CI_AS),
       ROW_NUMBER() OVER (ORDER BY URL COLLATE Latin1_General_CI_AS),
       ROW_NUMBER() OVER (ORDER BY CONVERT(nvarchar(20),URL) COLLATE Latin1_General_CI_AS),
       ROW_NUMBER() OVER (ORDER BY URL COLLATE Latin1_General_100_BIN2),
       ROW_NUMBER() OVER (ORDER BY CONVERT(nvarchar(20),URL) COLLATE Latin1_General_100_BIN2),
       URL
FROM (VALUES('http://10.10.14'),
            ('http://192.168.'),
            ('https://m.hanno'),
            ('https://online.'),
            ('https://online-'),
            ('https://owi-000'),
            ('https://owi2.su'),
            ('https://owi2-00'),
            ('https://owi2-71'),
            ('https://owi-700'),
            ('https://owi-702'),
            ('https://owi-703'),
            ('https://owi-704'),
            ('https://owi-707'),
            ('https://owi-708'),
            ('https://owi-710'),
            ('https://owi-711'),
            ('https://owi-712'),
            ('https://owi-713'),
            ('https://owi-714'),
            ('https://owi-715'),
            ('https://owi-716'),
            ('https://owi-717'),
            ('https://owigo.n'),
            ('https://owigosm'),
            ('https://owigow.'))V(URL)
ORDER BY URL COLLATE Latin1_General_CI_AS;

db<>fiddle

Here you can see that the collation SQL_Latin1_General_CP1_CI_AS has a different order for varchar to nvarchar. On the other hand Latin1_General_CI_AS is consistent for both (copying the order that nvarchar with the collation SQL_Latin1_General_CP1_CI_AS had). Latin1_General_100_BIN2 is also consistent, but follows the same order for the varchar.

Upvotes: 1

Related Questions