Reputation: 154
I'm doing some tests with SQL Server 2017.
I'm trying to store arbitrary Unicode code points in an NVARCHAR
column.
I've tried different collations.
I have no problem with common characters in the BMP plane of Unicode.
For more exotic symbols, for example if I try to store the "𝌹" character (U+1D33), the following happens:
My understanding is, for non-supplementary character collations, characters outside the UCS-2 subset shouldn't be interpreted correctly because NCHAR
fields are limited to 2 bytes.
But, I tried with Latin1_General_100_CS_AS_KS_WS_SC
, both at the DB level and column level, and it doesn't seem to work either.
Any ideas? Thanks
Upvotes: 2
Views: 1132
Reputation: 48826
I'm trying to store arbitrary unicode points in an nvarchar column. I've tried different collations. I have no problem with common characters in the PBS plane of Unicode.
Collations have nothing to do with what code points you can store in an NVARCHAR
/ NCHAR
/ NTEXT
(deprecated) column, variable, or literal. Those datatypes can store all 1,114,112 Unicode code points (even though most haven't been mapped to a character yet).
if I try to store 𝌹 character(U+1D33), ... within Management Studio, i only see the infamous square symbol. But management studio has the proper font since i can paste it in the query editor.
As others have explained already: this is merely a font issue. Fonts can hold a max of 65k characters, so you might need multiple fonts to cover all of the characters you are trying to use. I prefer Code2003 which you can find on FontSpace.com.
If i send the text from Visual Studio, the value i see in management studio is '??'
This should be due to forgetting to prefix the string literal with an upper-case "N" ;-).
SELECT '𝌹' AS [Oops], N'𝌹' AS [No Oops];
-- ?? 𝌹
My understanding is, for non supplementary character collations, characters outside the UCS-2 subset shouldn't be interpreted correctly because nchar fields are limited to 2 bytes.
The Supplementary Character-Aware (SCA) collations — those ending with _SC
or with _140_
in their names — do support supplementary characters. BUT, "support" only means that the built-in functions handle the surrogate pair as a single, supplementary code point instead a pair of surrogate code points. But, support for sorting and comparison of supplementary characters actually started in SQL Server 2005 with the introduction of the version 90 collations.
All code units in UCS-2 and UTF-16 are 16 bits / 2 bytes. Supplementary characters are merely two of those 2-byte code units. Hence, being able to store supplementary characters should have been available back in SQL Server 7.0 when NVARCHAR
was introduced. Even though no supplementary characters were defined until years later (after SQL Server 2000 was released), the NVARCHAR
types were still capable of storing and retrieving them. I don't have SQL Server 7.0 to test with, but I have confirmed this on SQL Server 2000.
For more info, please see:
Upvotes: 0
Reputation: 131393
I can't reproduce any data loss or encoding issue. I can reproduce a squares that becomes 𝌹
when copied. It's probably caused by the font used to display results in the SSMS grid or the Visual Studio debugger windows.
SQL Server and Windows use UTF16 for some time now, not UCS-2. Few fonts support the full UTF16 range though.
When I tried this in SSMS :
create table #tc(name nvarchar(20));
insert into #tc values (N'𝌹');
select name,len(name),DATALENGTH(name) from #tc;
I saw a square, 2
and 4
in the grid. This means the character was stored properly and took 4 bytes. When I tried to copy those results to SO though I saw :
name (No column name) (No column name)
𝌹 2 4
When I used Result to Text
I got the actual character :
name
-------------------- ----------- -----------
𝌹 2 4
The correct character is there but the SSMS grid's font can't display it
Update
As Dan Guzman noted,the font can be changed from Tools-->Options-->Environment-->Fonts and Colors-->Show settings for:-->Grid Results. The default font is Microsoft Sans Serif, a small font (855KB) used as the default font on Windows. It contains "only" 3000 glyphs. Chinese characters aren't included, which is why squares are displayed.
Chinese computers use SimShun as the default though, whose file is 17.1MB. They wouldn't have any problem displaying chinese characters.
Upvotes: 1