Andrei Bozantan
Andrei Bozantan

Reputation: 3921

.NET String object and invalid Unicode code points

Is it possible that a .NET String object will contain an invalid Unicode code point?

If yes, how this could happen (and how can I determine if the string has such invalid chars)?

Upvotes: 4

Views: 3333

Answers (4)

Andrei Bozantan
Andrei Bozantan

Reputation: 3921

Although the response given by @DPenner is excellent (and I used it as a starting point), I want to give some other details. Beside the orphaned surrogates which I think that are a clear sign of an invalid string, there is always the possibility that a string contains unassigned code points, and this case can't be treated like an error by the .NET Framework, since new characters are always added to the Unicode standard, see for example the versions of Unicode http://en.wikipedia.org/wiki/Unicode#Versions. And, in order to make things more clear, this call Char.GetUnicodeCategory(Char.ConvertFromUtf32(0x1F01C), 0); returns UnicodeCategory.OtherNotAssigned when using .NET 2.0, but it will return UnicodeCategory.OtherSymbol when using .NET 4.0.

Besides this, there is another interesting point: not even the .NET class library methods agree on how to handle the Unicode non-characters and the unpaired surrogate characters. For example:

  • unpaired surrogate char
    • System.Text.Encoding.Unicode.GetBytes("\uDDDD"); - returns { 0xfd, 0xff} the encoding for the Replacement character, that is, the data is considered as invalid.
    • "\uDDDD".Normalize(); - throws an exception with the message "Invalid Unicode code point found at index 0.", that is, the data is considered as invalid.
  • noncharacter code points
    • System.Text.Encoding.Unicode.GetBytes("\uFFFF"); - returns {0xff, 0xff}, that is, the data is considered as valid.
    • "\uFFFF".Normalize(); - throws an exception with the message "Invalid Unicode code point found at index 0.", that is the data is considered as invalid.

Below is a method which will search for invalid chars in a string:

/// <summary>
/// Searches invalid charachters (non-chars defined in Unicode standard and invalid surrogate pairs) in a string
/// </summary>
/// <param name="aString"> the string to search for invalid chars </param>
/// <returns>the index of the first bad char or -1 if no bad char is found</returns>
static int FindInvalidCharIndex(string aString)
{
    int ch;
    int chlow;

    for (int i = 0; i < aString.Length; i++)
    {
        ch = aString[i];
        if (ch < 0xD800) // char is up to first high surrogate
        {
            continue;
        }
        if (ch >= 0xD800 && ch <= 0xDBFF)
        {
            // found high surrogate -> check surrogate pair
            i++;
            if (i == aString.Length)
            {
                // last char is high surrogate, so it is missing its pair
                return i - 1;
            }

            chlow = aString[i];
            if (!(chlow >= 0xDC00 && chlow <= 0xDFFF))
            {
                // did not found a low surrogate after the high surrogate
                return i - 1;
            }

            // convert to UTF32 - like in Char.ConvertToUtf32(highSurrogate, lowSurrogate)
            ch = (ch - 0xD800) * 0x400 + (chlow - 0xDC00) + 0x10000;
            if (ch > 0x10FFFF)
            {
                // invalid Unicode code point - maximum excedeed
                return i;
            }
            if ((ch & 0xFFFE) == 0xFFFE)
            {
                // other non-char found
                return i;
            }
            // found a good surrogate pair
            continue;
        }

        if (ch >= 0xDC00 && ch <= 0xDFFF)
        {
            // unexpected low surrogate
            return i;
        }

        if (ch >= 0xFDD0 && ch <= 0xFDEF)
        {
            // non-chars are considered invalid by System.Text.Encoding.GetBytes() and String.Normalize()
            return i;
        }

        if ((ch & 0xFFFE) == 0xFFFE)
        {
            // other non-char found
            return i;
        }
    }

    return -1;
}

Upvotes: 10

DPenner1
DPenner1

Reputation: 10452

Yes, it is possible. According to Microsoft's documentation, a .NET String is simply

A String object is a sequential collection of System.Char objects that represent a string.

while a .NET Char

Represents a character as a UTF-16 code unit.

Taken together, this means that a .NET String is just a sequence of UTF-16 code units, whether or not they are valid strings according to the Unicode standard. There are many ways this can occur, some of the more common ones I can think of are:

  • A non UTF-16 byte stream being mistakenly put into a String object without proper conversion.
  • A String object was split between a surrogate pair.
  • Someone purposely included such a String to test the system's robustness.

As a result, the following C# code is completely legal and will compile:

class Test
    static void Main(){
        string s = 
            "\uEEEE" + // A private use character
            "\uDDDD" + // An unpaired surrogate character
            "\uFFFF" + // A Unicode noncharacter
            "\u0888";  // A currently unassigned character       
        System.Console.WriteLine(s); // Output is highly console dependent
    }
}

Upvotes: 6

szKarlen
szKarlen

Reputation: 66

All strings in .NET and C# are encoded using UTF-16, but with an exception (taken from Jon Skeet's blog):

...there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8...

Upvotes: 1

brighty
brighty

Reputation: 402

Well i think invalid codepoints within a .NET String can only occur if someone sets an individual element to a hi- or lo-surrogate. It can also happen that someone deletes a hi- or lo-surrogate from a valid surrogate pair, the latter can not just happen by deletion of an element but also by changing the value of an element. In my opinion, the answer is "yes", it can happen and the only reason can be that there is an orphaned hi- or lo-surrogate within the string. Do you have a real example string? Post it here and i can check what's wrong.

B.t.w. this is true for UTF-16 files as well. It can happen. For an utf-16LE file with 0xFFEE BOM be sure that your first character isn't a 0, because then your first 4 Bytes are 0xFFFE0000 which sure will be interpreted as a utf-32LE BOM instead of a utf-16LE BOM!

Upvotes: 0

Related Questions