Reputation: 57169
Updated question ¹
With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?
Original question
I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:
string s = "\u1D7D9"; // ("Mathematical double-struck digit one")
and it stores the string "ᵽ9"
.
I'm basically looking for definitive references of answers to the following:
¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.
Upvotes: 28
Views: 5305
Reputation: 2782
.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0 - The Unicode Standard, version 5.0 .NET Framework 2.0 and 1.1 - The Unicode Standard, Version 3.1
The complete answers can be found here under the section Remarks.
Upvotes: 0
Reputation: 7449
Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.
The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter
) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.
You can address high Unicode codepoints directly using uppercase \U
- e.g. "\U0001D7D9"
- but again, only inside strings, not chars.
As for Unicode version, from the MSDN documentation:
"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."
Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0
Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.
Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).
Update 3: Since .NET version 4.5 a new class SortVersion
is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion
. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.
Upvotes: 20
Reputation: 1234
That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:
string text = "\U0001D7D9"
If you create a WPF app with that character in a text block, it should render the double-one character perfectly.
Upvotes: 5
Reputation: 19305
MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx
I tried this:
static void Main(string[] args) {
string someText = char.ConvertFromUtf32(0x1D7D9);
using (var stream = new MemoryStream()) {
using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
writer.Write(someText);
writer.Flush();
}
var bytes = stream.ToArray();
foreach (var oneByte in bytes) {
Console.WriteLine(oneByte.ToString("x"));
}
}
}
And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:
So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)
Upvotes: 4