Reputation: 2057
How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.
With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).
Upvotes: 27
Views: 15216
Reputation: 2637
In .NET Core 3.0 or later, you can use the Rune Struct:
// Note that 😉 and 👍 are encoded using surrogate pairs
// but A, B, C and ✋ are not
var runes = "ABC✋😉👍".EnumerateRunes();
foreach (var r in runes)
Console.Write($"U+{r.Value:X4} ");
// Writes: U+0041 U+0042 U+0043 U+270B U+1F609 U+1F44D
Upvotes: 15
Reputation: 6237
Actually there is some merit in @Yogendra Singh 's answer, currently the only one with negative voting. The job can be done like this
public static IEnumerable<int> Utf8ToCodePoints(this string s)
{
var utf32Bytes = Encoding.UTF32.GetBytes(s);
var bytesPerCharInUtf32 = 4;
Debug.Assert(utf32bytes.Length % bytesPerCharInUtf32 == 0);
for (int i = 0; i < utf32bytes.Length; i+= bytesPerCharInUtf32)
{
yield return BitConverter.ToInt32(utf32bytes, i);
}
}
Tested with
var surrogatePairInput = "abc💩";
Debug.Assert(surrogatePairInput.Length == 5);
var pointsAsString = string.Join(";" ,
surrogatePairInput
.Utf8ToCodePoints()
.Select(p => $"U+{p:X4}"));
Debug.Assert(pointsAsString == "U+0061;U+0062;U+0063;U+1F4A9");
Example is relevant because the pile of poo is represented as a surrogate pair.
Upvotes: 2
Reputation: 29529
C# cannot store unicode codepoints in a char
, as char
is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.
This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:
public class InvalidEncodingException : System.Exception
{ }
public static IEnumerable<string> UnicodeCodepoints(this string s)
{
for (int i = 0; i < s.Length; ++i)
{
if (Char.IsSurrogate(s[i]))
{
if (s.Length < i + 2)
{
throw new InvalidEncodingException();
}
yield return string.Format("{0}{1}", s[i], s[++i]);
}
else
{
yield return string.Format("{0}", s[i]);
}
}
}
}
Upvotes: 4
Reputation: 164291
Easy, since chars in C# is actually UTF16 code points:
char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);
To address the comments, A char
in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair
and Char.ConvertToUtf32
, as suggested in another answer:
string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
int x = Char.ConvertToUtf32(input, i);
Console.WriteLine("U+{0:X4}", x);
}
Upvotes: 14
Reputation: 217283
The following code writes the codepoints of a string
input to the console:
string input = "\uD834\uDD61";
for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
var codepoint = char.ConvertToUtf32(input, i);
Console.WriteLine("U+{0:X4}", codepoint);
}
Output:
U+1D161
Since strings in .NET are UTF-16 encoded, the char
values that make up the string need to be converted to UTF-32 first.
Upvotes: 16
Reputation: 140220
public static string ToCodePointNotation(char c)
{
return "U+" + ((int)c).ToString("X4");
}
Console.WriteLine(ToCodePointNotation('a')); //U+0061
Upvotes: -2
Reputation: 34367
I found a little method on msdn forum. Hope this helps.
public int get_char_code(char character){
UTF32Encoding encoding = new UTF32Encoding();
byte[] bytes = encoding.GetBytes(character.ToString().ToCharArray());
return BitConverter.ToInt32(bytes, 0);
}
Upvotes: -1