Reputation: 107
I'm trying to convert a converted UTF-8 string to UTF-16, because I'm going to read a file and it comes like the var strUTF8
below.
For example, the entry would be the string "Não é possÃvel equipar"
and the return I needed is "Não é possível equipar"
.
static void Main(string[] args)
{
test3();
Console.ReadKey();
}
static void test3()
{
string str = "Não é possÃvel equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(utf8String);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
return Encoding.Unicode.GetString(unicodeBytes);
}
I really don't know how to solve this. Any tips?
Upvotes: 0
Views: 955
Reputation: 595971
Your Utf8ToUtf16()
function is effectively a no-op. You are taking an arbitrary UTF-16 string as input, encoding it into UTF-8 bytes, then decoding those bytes as UTF-8 back into UTF-16. So, you effectively end up with the same string
value you started with. You may as well have just written the following, the result would be the same:
static string Utf8ToUtf16(string utf8String)
{
return utf8String;
}
That being said, Não é possÃvel equipar
is what you get when the UTF-8 encoded form of Não é possível equipar
is mis-interpreted as Latin (probably ISO-8859-1) or Windows-125x etc, instead of being properly interpreted as UTF-8 to begin with.
If you have a C# string
that contains such UTF-8 bytes which were up-scaled as-is to UTF-16 (why???), then you need to down-scale those characters as-is back into 8-bit bytes, and then you can decode those bytes as UTF-8, eg:
static void test3()
{
string str = "Não é possÃvel equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(utf8String); // or: GetEncoding(28591)
return Encoding.UTF8.GetString(utf8Bytes);
}
Upvotes: 0
Reputation: 3182
If you want to read a file then you should read a file. When you read the file, specify the encoding of that file. If I'm not mistaken UTF8 is the default, so reading files encoded with UTF8 doesn't require the encoding to be specified. If you want to save that text to a file with a specific encoding, specify that encoding when saving the file.
var text = File.ReadAllText(filePath, Encoding.UTF8);
File.WriteAllText(filePath, text, Encoding.Unicode);
That will effectively convert a file from UTF8 encoding to UTF16. A more verbose version would be:
var data = File.ReadAllBytes(filePath);
var text = Encoding.UTF8.GetString(data);
data = Encoding.Unicode.GetBytes(text);
File.WriteAllBytes(filePath, data);
Upvotes: 1