user2537911
user2537911

Reputation:

Why is Default encoding in C# not recommend?

I Googled about encoding. I found that Default encoding is not recommended in C#. Full message is:

Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, your application should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.

Source MSDN

But how to change decoding of Computer ? I am not clear about the bit "Different computers can use different encodings as the default".

Upvotes: 7

Views: 2532

Answers (2)

Hans Passant
Hans Passant

Reputation: 941585

A lot of software from the previous century uses a single byte to store a character. Agnostic of the demands of Unicode. A byte can only provide 256 distinct values so such software can only handle text with a limited number of distinct characters.

Just about everybody agrees what characters are represented by byte values 0 through 127, they are the characters in the ASCII character set. A standard from the early 1960s that assigned values to letters and symbols in the English alphabet.

Which left another 128 unassigned values. In therein lies the rub, they can represent different characters in different places, used to represent non-English glyphs. Such as necessary in languages like Greek and Russian, languages that don't use a Latin alphabet. Or Vietnamese and Polish, languages that have a Latin alphabet but use lots of diacritics to mark distinct sounds. And especially convoluted for languages that have very large alphabets like Chinese, Korean and Japanese. Such languages require a double-byte encoding trick to squeeze the alphabet into 128 values.

The mapping of byte values to characters is called a code page. There are many code pages. Even for a single language. English can be encoded, for example, in code page 437, the old IBM-PC character set. Distinctive for having box-drawing characters, commonly used in old DOS software and still the default for console mode programs. And code page 1252, an ANSI code page that's the default for Windows programs in Western Europe and the Americas. And code page 28591, ISO's lovely contribution to Babel's tower. And I ought to mention code page 37, used for IBM's EBCDIC encoding, a non-ASCII encoding that survived through IBM's prowess at selling mainframe computers. Otherwise a notable accident in history that standardized the size of a byte to 8 bits. And code page 65001, the one that end them all, the code page for UTF-8, an Unicode encoding that uses variable length 8-bit encoding.

This is bad. There is no way to tell from a text file which code page was used to encode the text in the file. You have to make an educated guess at it. If you guess wrong then you just get nonsense.

Encoding.Default will use the default ANSI encoding of the machine, configured in the Region and Language applet in Control Panel, "Language for non-Unicode programs" setting. Changing it from the default is very unwise, that significantly increases the odds that old programs will produce nonsense from text files. It is code page 1252 in Western Europe and the Americas, 1251 for languages that use the Cyrillic alphabet, 1253 for Greek, 1256 for Arabic, etcetera. A list is here.

You avoid this misery by avoiding Encoding.Default whenever you can. And favor UTF-8, a Unicode encoding that works very well with .NET support for Unicode. And is the default for classes like StreamWriter and File. And is capable of writing a BOM, 3 distinct bytes at the start of the file that indicates the encoding used for the text, so that other programs can see what encoding you used. Only ever consider another encoding when you've got your back to the wall and are forced to work with legacy software.

Upvotes: 8

MrFox
MrFox

Reputation: 5116

encoding usually means which charset you are using. Most of the time utf-8 is used, but for example, chineese symbols need utf-16 to be represented as a single symbol (more characters).

So what Google is saying: you should specify the charset you want to use, instead of assuming the client will be using utf-8. For example this first line on an xml file:

<?xml version="1.0" encoding="utf-8"?>

Upvotes: 1

Related Questions