Reputation: 3921
Is it possible that a .NET String object will contain an invalid Unicode code point?
If yes, how this could happen (and how can I determine if the string has such invalid chars)?
Upvotes: 4
Views: 3333
Reputation: 3921
Although the response given by @DPenner is excellent (and I used it as a starting point), I want to give some other details.
Beside the orphaned surrogates which I think that are a clear sign of an invalid string, there is always the possibility that a string contains unassigned code points, and this case can't be treated like an error by the .NET Framework, since new characters are always added to the Unicode standard, see for example the versions of Unicode http://en.wikipedia.org/wiki/Unicode#Versions. And, in order to make things more clear, this call Char.GetUnicodeCategory(Char.ConvertFromUtf32(0x1F01C), 0);
returns UnicodeCategory.OtherNotAssigned
when using .NET 2.0, but it will return UnicodeCategory.OtherSymbol
when using .NET 4.0.
Besides this, there is another interesting point: not even the .NET class library methods agree on how to handle the Unicode non-characters and the unpaired surrogate characters. For example:
System.Text.Encoding.Unicode.GetBytes("\uDDDD");
- returns { 0xfd, 0xff}
the encoding for the Replacement character, that is, the data is considered as invalid."\uDDDD".Normalize();
- throws an exception with the message "Invalid Unicode code point found at index 0.", that is, the data is considered as invalid.System.Text.Encoding.Unicode.GetBytes("\uFFFF");
- returns {0xff, 0xff}
, that is, the data is considered as valid."\uFFFF".Normalize();
- throws an exception with the message "Invalid Unicode code point found at index 0.", that is the data is considered as invalid.Below is a method which will search for invalid chars in a string:
/// <summary>
/// Searches invalid charachters (non-chars defined in Unicode standard and invalid surrogate pairs) in a string
/// </summary>
/// <param name="aString"> the string to search for invalid chars </param>
/// <returns>the index of the first bad char or -1 if no bad char is found</returns>
static int FindInvalidCharIndex(string aString)
{
int ch;
int chlow;
for (int i = 0; i < aString.Length; i++)
{
ch = aString[i];
if (ch < 0xD800) // char is up to first high surrogate
{
continue;
}
if (ch >= 0xD800 && ch <= 0xDBFF)
{
// found high surrogate -> check surrogate pair
i++;
if (i == aString.Length)
{
// last char is high surrogate, so it is missing its pair
return i - 1;
}
chlow = aString[i];
if (!(chlow >= 0xDC00 && chlow <= 0xDFFF))
{
// did not found a low surrogate after the high surrogate
return i - 1;
}
// convert to UTF32 - like in Char.ConvertToUtf32(highSurrogate, lowSurrogate)
ch = (ch - 0xD800) * 0x400 + (chlow - 0xDC00) + 0x10000;
if (ch > 0x10FFFF)
{
// invalid Unicode code point - maximum excedeed
return i;
}
if ((ch & 0xFFFE) == 0xFFFE)
{
// other non-char found
return i;
}
// found a good surrogate pair
continue;
}
if (ch >= 0xDC00 && ch <= 0xDFFF)
{
// unexpected low surrogate
return i;
}
if (ch >= 0xFDD0 && ch <= 0xFDEF)
{
// non-chars are considered invalid by System.Text.Encoding.GetBytes() and String.Normalize()
return i;
}
if ((ch & 0xFFFE) == 0xFFFE)
{
// other non-char found
return i;
}
}
return -1;
}
Upvotes: 10
Reputation: 10452
Yes, it is possible. According to Microsoft's documentation, a .NET String is simply
A String object is a sequential collection of System.Char objects that represent a string.
while a .NET Char
Represents a character as a UTF-16 code unit.
Taken together, this means that a .NET String is just a sequence of UTF-16 code units, whether or not they are valid strings according to the Unicode standard. There are many ways this can occur, some of the more common ones I can think of are:
As a result, the following C# code is completely legal and will compile:
class Test
static void Main(){
string s =
"\uEEEE" + // A private use character
"\uDDDD" + // An unpaired surrogate character
"\uFFFF" + // A Unicode noncharacter
"\u0888"; // A currently unassigned character
System.Console.WriteLine(s); // Output is highly console dependent
}
}
Upvotes: 6
Reputation: 66
All strings in .NET and C# are encoded using UTF-16, but with an exception (taken from Jon Skeet's blog):
...there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8...
Upvotes: 1
Reputation: 402
Well i think invalid codepoints within a .NET String can only occur if someone sets an individual element to a hi- or lo-surrogate. It can also happen that someone deletes a hi- or lo-surrogate from a valid surrogate pair, the latter can not just happen by deletion of an element but also by changing the value of an element. In my opinion, the answer is "yes", it can happen and the only reason can be that there is an orphaned hi- or lo-surrogate within the string. Do you have a real example string? Post it here and i can check what's wrong.
B.t.w. this is true for UTF-16 files as well. It can happen. For an utf-16LE file with 0xFFEE BOM be sure that your first character isn't a 0, because then your first 4 Bytes are 0xFFFE0000 which sure will be interpreted as a utf-32LE BOM instead of a utf-16LE BOM!
Upvotes: 0