Reputation: 2057

Return code point of characters in C#

How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.

With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).

Upvotes: 27

Answers (7)

DigitalDan

Reputation: 2637

In .NET Core 3.0 or later, you can use the Rune Struct:

// Note that 😉 and 👍 are encoded using surrogate pairs
// but A, B, C and ✋ are not
var runes = "ABC✋😉👍".EnumerateRunes();

foreach (var r in runes)
    Console.Write($"U+{r.Value:X4} ");
        
// Writes: U+0041 U+0042 U+0043 U+270B U+1F609 U+1F44D

Upvotes: 15

Călin Darie

Reputation: 6237

Actually there is some merit in @Yogendra Singh 's answer, currently the only one with negative voting. The job can be done like this

    public static IEnumerable<int> Utf8ToCodePoints(this string s)
    {
        var utf32Bytes = Encoding.UTF32.GetBytes(s);
        var bytesPerCharInUtf32 = 4;
        Debug.Assert(utf32bytes.Length % bytesPerCharInUtf32 == 0);
        for (int i = 0; i < utf32bytes.Length; i+= bytesPerCharInUtf32)
        {
            yield return BitConverter.ToInt32(utf32bytes, i);
        }
    }

Tested with

    var surrogatePairInput = "abc💩";
    Debug.Assert(surrogatePairInput.Length == 5);
    var pointsAsString = string.Join(";" , 
        surrogatePairInput
        .Utf8ToCodePoints()
        .Select(p => $"U+{p:X4}"));
    Debug.Assert(pointsAsString == "U+0061;U+0062;U+0063;U+1F4A9");

Example is relevant because the pile of poo is represented as a surrogate pair.

Upvotes: 2

Mahmoud Al-Qudsi

Reputation: 29529

C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.

This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:

    public class InvalidEncodingException : System.Exception
    { }

    public static IEnumerable<string> UnicodeCodepoints(this string s)
    {
        for (int i = 0; i < s.Length; ++i)
        {
            if (Char.IsSurrogate(s[i]))
            {
                if (s.Length < i + 2)
                {
                    throw new InvalidEncodingException();
                }
                yield return string.Format("{0}{1}", s[i], s[++i]);
            }
            else
            {
                yield return string.Format("{0}", s[i]);
            }
        }
    }
}

Upvotes: 4

driis

Reputation: 164291

Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}

Upvotes: 14

dtb

Reputation: 217283

The following code writes the codepoints of a string input to the console:

string input = "\uD834\uDD61";

for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
    var codepoint = char.ConvertToUtf32(input, i);

    Console.WriteLine("U+{0:X4}", codepoint);
}

Output:

U+1D161

Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.

Upvotes: 16

Esailija

Reputation: 140220

public static string ToCodePointNotation(char c)
{

    return "U+" + ((int)c).ToString("X4");
}

Console.WriteLine(ToCodePointNotation('a')); //U+0061

Upvotes: -2

Yogendra Singh

Reputation: 34367

I found a little method on msdn forum. Hope this helps.

    public int get_char_code(char character){ 
        UTF32Encoding encoding = new UTF32Encoding(); 
        byte[] bytes = encoding.GetBytes(character.ToString().ToCharArray()); 
        return BitConverter.ToInt32(bytes, 0); 
    }

Upvotes: -1

Return code point of characters in C#

Answers (7)

Related Questions