Reputation: 11
I'm currently working on a C++ project on macOS, using Clang as my compiler. I've encountered a problem related to the locale settings when dealing with wide characters. Here is a simplified version of my code:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
locale zhLocale("");
wcin.imbue(zhLocale);
wcout.imbue(zhLocale);
wstring input;
getline(wcin, input);
wcout << input << endl;
return 0;
}
and the input is:
你好
output:
你你你好
During debugging, it is found that the input variable becomes L"\U00000002\U00000002你你你好"
In launch and debug I see input was wrong
and this is my envionment variables:
$ clang++ --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
I would appreciate it if anyone could help me figure out what's going wrong and how to fix it. Is this a bug in Clang's handling of locale settings on macOS, or am I doing something wrong in my code?
I tried the correct code(I think), and I expect the output equals to input and the correct program behavior
When I remove imbue, this piece of code works just like cin.
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
wstring input;
getline(wcin, input);
wcout << input << endl;
return 0;
}
你好
你好
However, when I open the debugger and check the content of inputs, the content in its data array is [L'\U0000fffd', L'\U00000001', L'\U00000006', L'\0', L'\n'] instead of ['你', '好'] as I expected. In this case, I can't iterate over individual Chinese characters. This is the same as using cin, and I also can't iterate over individual Chinese characters when using cin.
Upvotes: 1
Views: 90
Reputation: 52612
Wide characters on MacOS are four bytes; you might be expecting two bytes.
Switch to UTF-8 if that is at all possible.
Upvotes: 0