parag
parag

Reputation: 2573

Thai character issues in unicode string

I have a string with few characters in Thai. This string is using unicode characters. But I don't see thai characters in IDE or even if I write the string in text file. If I want to see thai characters properly I have to write the following code

 var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";
 var ascii = Encoding.Default.GetBytes(text);           
 text = Encoding.UTF8.GetString(ascii);

After applying above logic, I can see string correctly with thai characters. Here is output

// notice the thai character เดี่ยว in the string M_M-150 150CC. เดี่ยว (2 For 18 Save 2)

I am not sure why I need to apply above logic to see the thai characters even the string is Unicode? What exactly Encoding.Default is doing in this case?

Upvotes: 0

Views: 14440

Answers (1)

Kaj
Kaj

Reputation: 814

From MSDN

Here is what Encoding.Default Property is:

Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.

The string is coming in by Encoding.Default, but then Decoded using UTF8 So the bottleneck is not the Encoding.Default. It's Encoding.UTF8 It's taking the bytes and convert it to string correctly.

Even if you tried to print it in the Console. Take a look at both cases : enter image description here The second line, printed with utf8 configuration You can config your console to support utf8 by adding this line :

Console.OutputEncoding = Encoding.UTF8;

Even with your code : the result in a file will be looks like : enter image description here

while converting the string to byte with Encoding.UTF8

var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);

the result will be :

enter image description here

If you take a look at Supported Scripts you'll see that UTF8 supports all Unicode characters

including Thai.

Note that the Encoding.Default will not be able to read chinese or japanese for an example,

take this example :

var text = "漢字";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);

Here is the output from a text file :

enter image description here

Here if you try to write it to text, it'll not be converted successfully.

So you have to read and write it using UTF8

 var text = "漢字";
 var ascii = Encoding.UTF8.GetBytes(text);
 text = Encoding.UTF8.GetString(ascii);

and you'll get this :

enter image description here

So as I said, the whole process depending on UTF8 not Default encoding.

Upvotes: 6

Related Questions