CD DelRio
CD DelRio

Reputation: 71

How to check for invalid UTF-8 characters?

Now there are lots of supported Hexadecimal (UTF-8) entities out there starting from Decimal values 0 to 10175, is there a fast way to check a certain value contained in a variable is one of the values of the supported Hexadecimal (UTF-8) entities.

e.x.

var something="some string value";
char[] validCharacter = new[] { All 10175 UTF-8 Hexadecimal characters };
if(validCharacter.Contains(something))
{ \\do something };

How can I do this check the fastest way possible?

Upvotes: 3

Views: 15239

Answers (2)

ispiro
ispiro

Reputation: 27633

UTF8Encoding.GetString(byteArray) will throw an ArgumentException if Error detection is enabled.

Source: https://msdn.microsoft.com/en-us/library/kzb9f993(v=vs.110).aspx

But if you're testing something that is already a string - as far as I know - it will almost always be valid UTF8. (see below.) As far as I know all C# strings are encoded in UTF16 which is an encoding for all Unicode characters. UTF8 is just a different encoding for the same set. i.e. For all of the Unicode characters.

(This might excluded some Unicode characters which are new etc. But those will also not be in UTF16 so that won't matter here.)

As someone has commented, there might be "halves" of UTF16 characters that would be valid strings but won't be valid UTF8 values. So you can Encoding.Unicode.GetBytes() and then Encoding.UTF8.GetString() to verify. But those will probably be quite rare.

EDIT

Enabling error detection: Use this UTF8Encoding(Boolean, Boolean) constructor for UTF8Encoding.

Upvotes: 4

xanatos
xanatos

Reputation: 111810

This should return what you asked. It will check for both the absence of unpaired high/low surrogate and for absence of non-defined codepoints (were "defined" depends on the unicode tables present in the version of .NET you are using and on the version of operating system)

static bool IsLegalUnicode(string str)
{
    for (int i = 0; i < str.Length; i++)
    {
        var uc = char.GetUnicodeCategory(str, i);

        if (uc == UnicodeCategory.Surrogate)
        {
            // Unpaired surrogate, like  "😵"[0] + "A" or  "😵"[1] + "A"
            return false;
        }
        else if (uc == UnicodeCategory.OtherNotAssigned)
        {
            // \uF000 or \U00030000
            return false;
        }

        // Correct high-low surrogate, we must skip the low surrogate
        // (it is correct because otherwise it would have been a 
        // UnicodeCategory.Surrogate)
        if (char.IsHighSurrogate(str, i))
        {
            i++;
        }
    }

    return true;
}

Note that Unicode is in continuous expansion. UTF-8 is able to map all the Unicode codepoints, even the ones that can't be assigned at this time.

Some examples:

var test1 = IsLegalUnicode("abcdeàèéìòù"); // true
var test2 = IsLegalUnicode("⭐ White Medium Star"); // true, Unicode 5.1
var test3 = IsLegalUnicode("😁 Beaming Face With Smiling Eyes"); // true, Unicode 6.0
var test4 = IsLegalUnicode("🙂 Slightly Smiling Face"); // true, Unicode 7.0
var test5 = IsLegalUnicode("🤗 Hugging Face"); // true, Unicode 8.0
var test6 = IsLegalUnicode("🤣 Rolling on the Floor Laughing"); // false, Unicode 9.0 (2016)

var test7 = IsLegalUnicode("🤩 Star-Struck"); // false, Unicode 10.0 (2017)

var test8 = IsLegalUnicode("\uFF00"); // false, undefined BMP UTF-16 unicode

var test9 = IsLegalUnicode("😀"[0] + "X"); // false, unpaired high surrogate pair
var test10 = IsLegalUnicode("😀"[1] + "X"); // false, unpaired low surrogate pair

Note that you can encode in UTF-8 even well-formed "unknown" Unicode codepoints, like the 🤩 Star-Struck.

Results taken with .NET 4.7.2 under Windows 10.

Upvotes: 7

Related Questions