Reputation: 339
I am trying to convert a String
object containing a string representing an emoticon's Unicode format into a String
with the same emoticon represented by the Unicode as its only character, e.g. converting "\u1F34E"
to 🍎
.
I attempted the following under the supposition the string's escape sequence would be properly processed:
String str = "\u1F34E";
Console.WriteLine("'{0}' to '{1}'", str, str.ToCharArray()[0]);
Output:
'\u1F34E' to '\'
Outputting the string directly to a text file yields the same result, so it is not just the debugger I am using. I am unsure what to do. Any help would be greatly appreciated.
EDIT:
I realize my original question was not clear; my intent was to have a properly formatted UTF-16 string with a UTF-32 unicode within a string, as an API I was sending this value to required this formatting. I have successfully resolved the problem with the following:
String str = "1F34E"; //removed \u with prior parsing
int unicode_utf32 = int.Parse(stdemote.Unicode, System.Globalization.NumberStyles.HexNumber);
String unicode_utf16_str = Char.ConvertFromUtf32(unicode_utf32);
Console.WriteLine("'{0}' to '{1}'", str, unicode_utf16_str);
Upvotes: 3
Views: 645
Reputation: 81493
This is not what it seems
string str = "\u1F34E";
.Net uses using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point. Which in turn makes the Unicode \u
escape sequence actually U+0000
to U+FFFF
(16-bit) or for the extended version U+00000000
to U+FFFFFFFF
(32-bit)
The emoji 🍎, uses a high code point 0001F34E
so will need to encode it as a surrogate pair, two UTF-16 characters "\uD83C\uDF4E"
or combined as
"\U0001F34E"
1
Example
string str = "\uD83C\uDF4E";
// or
string str = "\U0001F34E"
If you goal is to separate actual text elements apposed to characters, you could make use of StringInfo.GetTextElementEnumerator
public static IEnumerable<string> ToElements(string source)
{
var enumerator = StringInfo.GetTextElementEnumerator(source);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
}
Note : My use of terminology might not be most common or accurate, if you think it can be tightened up feel free to edit
1 Thanks to Mark Tolonen for pointing out that the Unicode escape sequence actually supports both 16bit and 32bit variants \uXXXX
and \UXXXXXXXX
more information can be found in a blog post by Jon Skeet Strings in C# and .NET
Upvotes: 4