Reputation: 7506
While I am experimenting code units under utf-8 in Visual Studio, I entercountered many pitfalls:
By default, VS save the source file with system region related encoding, for me , it's GB2312(codepage 936, a Chinese encoding).
Solution: I use save as and save the file with UTF-8 without signature.
Then I found that by default the compiler interpret the source file with system region related encoding too, which it's still GB2312, so I got puzzling warning and syntax error.
Solution: I use /source-charset:utf-8
to compile, no warning and error. But the size result it's 2('知' in GB2312 is encoded with 2 code units). But it should be 3 under utf-8.
'知' Unicode reference https://unicode-table.com/en/77E5/
(I think one can use any character that both exist in your current system encoding and utf-8 but with different code unit size to make a similar test.)
Code:
#include <iostream>
#include <string>
using namespace std;
int main(){
string s = "知";
cout << s.size() <<endl;
cout << s << endl;
}
Moreover, the Windows cmd as well as powershell use the system region related encoding too (type chcp
in cmd). So I can't print characters like ə
.
So there's three stuff I need to take care about:
Besides, I have some confusion derived from this experience:
Why Windows acts like this? Can it just set everything with utf-8? I copied the same file to Mac and everything works as expected. And it's very easy to set Mac's terminal encoding.
Some posts I found said the reason is that some encoding standards (like this GB2312) are created before utf-8 come out. And many of them are not compatible with utf-8. So it continues to use for compatibility.
But I wonder how the incompatibility would occur? e.g. I download NotePad++ and install all the language packages. My system's encoding is GB2312, but I can still change the display language of NotePad++ to Japanese and it displays well. Not such thing like ????
.
Upvotes: 1
Views: 2751
Reputation: 179779
The term "source charset" is no coincidence here. The C++ standard explicitly differentiates between the (basic) source character set (96 common characters, all found in plain ASCII) and the execution character set.
Since you used UTF-8 as the source character set, 知
is mapped to \u77E5
.
At runtime, however, you're using the execution character set. The VC++ /source-charset
option does not affect VC++'s execution character set; for that there is an /execution-charset
But as @Matteo Italia already notes, the VC++ runtime is known to be more than a little bit flaky when it comes to UTF-8 I/O. std::string.size
should work but std::cout
might not.
Upvotes: 2