Reputation: 193
This may sound like a stupid question. I typed some Chinese characters into an empty text file in VS code text editor (default utf8). Then I saved the file in an encoding for Japanese: shift JIS
, which apparently doesn't cover all the characters I have typed in.
However, before I close the file, all Chinese characters are displayed properly in VS code. Now after I closed the file and reopened it using shift JIS
encoding, several characters are displayed as a question mark ?
. I guess these are the Chinese characters not covered by the Japanese encoding?
What happened in the process? Is there anyway I can 'get back' the Chinese characters that are now shown in ?
? I don't really understand how encoding works in this scenario...
Upvotes: 2
Views: 685
Reputation: 14558
unicode/utf-8 are not locale independent!
You didn't even have to save as JIS, just opening on a machine with different locale, as it happened when you coworker opened the file as per your comment on another answer.
See https://tonsky.me/blog/unicode/ (there's no anchor links, so search for Unicode is locale-dependent
heading)
That article explains that if you save a single UTF-8 text file with one character U+8FD4
, it will show 5 or more variations when that text file is opened with english, chinese, simplified chinese, japanese or korean locales OS. And since most developers assume UTF-8 to be locale independent, there's little features to change the locale per-file. Or places where it is impossible, for example, if you have two filenames with that character, but one was saved on a japanese locale computer, and another on a chinese locale computer... both files will have the same filename regardless.
Upvotes: 0
Reputation: 10643
Not all encodings cover all characters. (Unicode encodings, in principle, do, but even they don't have quite everything yet.) If you save some text in an encoding which does not include all characters in that text, something has to give.
Options:
Once that conversion is done, the data is lost, and cannot be recovered. Why not use UTF-8 or another Unicode encoding? (GB 18030 might be the best for large amounts of Chinese text.)
Upvotes: 2