Craven Mueller
Craven Mueller

Reputation: 11

Clang locale issue on macOS when handling wide characters

I'm currently working on a C++ project on macOS, using Clang as my compiler. I've encountered a problem related to the locale settings when dealing with wide characters. Here is a simplified version of my code:

#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
    locale zhLocale("");
    wcin.imbue(zhLocale);
    wcout.imbue(zhLocale);

    wstring input;
    getline(wcin, input);
    wcout << input << endl;

    return 0;
}

and the input is:

你好

output:

你你你好

During debugging, it is found that the input variable becomes L"\U00000002\U00000002你你你好"

In launch and debug I see input was wrong

and this is my envionment variables:

$ clang++ --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ locale                                        
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

I would appreciate it if anyone could help me figure out what's going wrong and how to fix it. Is this a bug in Clang's handling of locale settings on macOS, or am I doing something wrong in my code?

I tried the correct code(I think), and I expect the output equals to input and the correct program behavior


When I remove imbue, this piece of code works just like cin.

#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
    wstring input;
    getline(wcin, input);
    wcout << input << endl;

    return 0;
}
你好
你好

However, when I open the debugger and check the content of inputs, the content in its data array is [L'\U0000fffd', L'\U00000001', L'\U00000006', L'\0', L'\n'] instead of ['你', '好'] as I expected. In this case, I can't iterate over individual Chinese characters. This is the same as using cin, and I also can't iterate over individual Chinese characters when using cin.

Upvotes: 1

Views: 90

Answers (1)

gnasher729
gnasher729

Reputation: 52612

Wide characters on MacOS are four bytes; you might be expecting two bytes.

Switch to UTF-8 if that is at all possible.

Upvotes: 0

Related Questions