Gili
Gili

Reputation: 89983

How to find out whether collation uses word sort or string sort?

https://stackoverflow.com/a/361059/14731 discusses the differences between "word sort" and "string sort".

How does one query programmatically when an SQL Collation will use "word sort" vs "string sort"?

Corollary: Do all collations use "word sort" for Unicode strings and "string sort" for non-Unicode strings?

SELECT * from sys.fn_HelpCollations()
WHERE name = 'SQL_Latin1_General_CP1_CI_AS'

provides a lot of details about the collation, but notice that it makes no mention of "word sort".

Upvotes: 4

Views: 1723

Answers (2)

Gili
Gili

Reputation: 89983

  • srutzky's excellent answer reveals that, with the exception of non-Unicode types processed by SQL_ collators, all other data is sorted according to "Unicode Collation" rules.
  • Confusingly, Microsoft does not use the Unicode standard's sorting rules.
  • According to https://support.microsoft.com/en-us/kb/322112

    SQL Server 2000 supports two types of collations:

    • SQL collations
    • Windows collations

    [...]

    For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data.

    [...]

    A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules.

  • I interpret this as meaning that:

    • SQL_ collators are "SQL collations"
    • All other collators are "Windows collators".
    • With the exception of non-Unicode types processed by SQL_ collators, all other data is sorted according to "Windows collations".

So, let's dig into "Windows collations".

//  Sorting Flags.
//
//    WORD Sort:    culturally correct sort
//                  hyphen and apostrophe are special cased
//                  example: “coop” and “co-op” will sort together in a list
//
//                        co_op     <——-  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        co-op     <——-  hyphen (punctuation)
//                        cork
//                        went
//                        were
//                        we’re     <——-  apostrophe (punctuation)
//
//
//    STRING Sort:  hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <——-  hyphen (punctuation)
//                        co_op     <——-  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we’re     <——-  apostrophe (punctuation)
//                        went
//                        were
  • And finally, according to https://msdn.microsoft.com/en-us/library/windows/desktop/dd318144(v=vs.85).aspx

    [...] all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently from the other nonalphanumeric characters to ensure that words such as "coop" and "co-op" stay together in a sorted list.

Upvotes: 1

Solomon Rutzky
Solomon Rutzky

Reputation: 48776

Let's start with the definition of these types of sorts as given by Microsoft (taken from the "Remarks" section of the CompareOptions Enumeration MSDN page):

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

Unicode is culturally-sensitive and weighted, and the XML and N-prefixed types are Unicode, so they could be saying that data in Unicode types use "word sort" while data in the non-Unicode types uses "string sort". Ordinal refers to the BIN and BIN2 collations, though the BIN collations aren't 100% ordinal due to how they treat the first character.

But let's see what SQL Server says it is doing. Run the following:

DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
INSERT INTO @SampleData (ANSI, UTF16) VALUES 
    ('a-b-c', N'a-b-c'),
    ('ac', N'ac'),
    ('aba', N'aba'),
    ('a-b', N'a-b'),
    ('ab', N'ab');

SELECT sd.ANSI AS [ANSI-Latin1_General_100_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE Latin1_General_100_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-Latin1_General_100_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE Latin1_General_100_CI_AS ASC;

SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

Results:

ANSI-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

UTF16-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

ANSI-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
a-b
a-b-c
ab
aba
ac

UTF16-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

Hmm. Only the SQL_ collation in combination with the VARCHAR field appears to be doing what could be considered "string sort". It makes sense that the SQL_ collation in combination with the NVARCHAR field would do "word sort" it is the same Unicode handling as the non-SQL_ collations. But is there something besides being a SQL Server collation (i.e. starting with SQL_ ) that determines "string" vs "word" sort? Let's look at the only properties of the collations that we can extract:

SELECT N'Latin1_General_100_CI_AS' AS [CollationName],
       COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'CodePage') AS [CodePage],
       COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'LCID') AS [LCID],
      COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'ComparisonStyle') AS [ComparisonStyle]
UNION ALL
SELECT N'SQL_Latin1_General_CP1_CI_AS' AS [CollationName],
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'CodePage'),
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'LCID'),
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'ComparisonStyle');

Results:

CollationName                  CodePage   LCID    ComparisonStyle
----------------------------   --------   ----    ---------------
Latin1_General_100_CI_AS       1252       1033    196609
SQL_Latin1_General_CP1_CI_AS   1252       1033    196609

So, there are no discernible differences there. Which seems to leave us with this:

String sort is done when:

  • the collation name starts with SQL_, AND
  • the data (field, variable, string literal) is non-Unicode (i.e. CHAR / VARCHAR / TEXT)

For more information on Unicode sorting in general, check out the following resources:

Upvotes: 2

Related Questions