Martin Smith
Martin Smith

Reputation: 453328

SQL Server Code Pages and Collations

Is there any way in SQL Server of determining what a character in a code page would represent without actually creating a test database of that collation?

Example. If I create a test database with collation SQL_Ukrainian_CP1251_CS_AS and then do CHAR(255) it returns я.

If I try the following on a database with SQL_Latin1_General_CP1_CS_AS collation however

SELECT CHAR(255) COLLATE SQL_Ukrainian_CP1251_CS_AS

It returns y

SELECT CHAR(255)

Returns ÿ so it is obviously going first via the database's default collation then trying to find the closest equivalent to that in the explicit collation. Can this be avoided?

Upvotes: 6

Views: 19504

Answers (2)

Martin Smith
Martin Smith

Reputation: 453328

There is actually a reasonably straight forward way of doing this

SELECT CONCAT(CAST(255 AS BINARY(1)), '' COLLATE SQL_Ukrainian_CP1251_CS_AS) /* я */

So based on this method a SQL Server 2022+ compatible script that returns the 255 values from 0x01 to 0xFF for all 16 non utf8 code pages in use (DB Fiddle).

DECLARE @DynSQL NVARCHAR(MAX);

WITH DistinctCodePage AS
(
SELECT CodePage = CAST(COLLATIONPROPERTY(name, 'CodePage') AS INT), 
       ExampleCollation = MAX(name)
FROM sys.fn_helpcollations()
WHERE COLLATIONPROPERTY(name, 'CodePage') NOT IN (0 /*Exclude Unicode only collations*/, 65001 /*Exclude utf8 collations*/)
GROUP BY COLLATIONPROPERTY(name, 'CodePage')
)
SELECT  @DynSQL = 'SELECT  value,' + 
STRING_AGG(CONVERT(NVARCHAR(MAX), quotename(CodePage) + ' = CONCAT(CAST(value AS binary(1)), SPACE(0) COLLATE ' + ExampleCollation + ')' ), ',') + 
'
FROM generate_series(1,255)
ORDER BY value desc'
FROM   DistinctCodePage;

EXEC (@DynSQL);

Upvotes: 8

Robb
Robb

Reputation: 3851

While MS SQL supports both code pages and Unicode unhelpfully it doesn't provide any functions to convert between the two so figuring out what character is represented by a value in a different code page is a pig.

There are two potential methods I've seen to handle conversions, one is detailed here http://www.codeguru.com/cpp/data/data-misc/values/article.php/c4571 and involves bolting a custom conversion program onto the database and using that for conversions.

The other is to construct a db table consisting of

[CodePage], [ANSI Value], [UnicodeValue]

with the unicode value stored as either the int representing the unicode character to be converted using nchar()or the nchar itself

Your using the collation SQL_Ukrainian_CP1251_CS_AS which is code page 1251 (CP1251 from the centre of the string). You can grab its translation table here http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT

Its a TSV so after trimming the top off the raw data should import fairly cleanly.

Personally I'd lean more towards the latter than the former especially for a production server as the former may introduce instability.

Upvotes: 2

Related Questions