Reputation: 117
I've been using the function below to convert from the decimal representation of unicode characters to the UTF8 character itself in C++. The function I have at the moment works well on Linux / Unix system but it keeps returning the wrong characters on Windows.
void GetUnicodeChar(unsigned int code, char chars[5]) {
if (code <= 0x7F) {
chars[0] = (code & 0x7F); chars[1] = '\0';
} else if (code <= 0x7FF) {
// one continuation byte
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
} else if (code <= 0xFFFF) {
// two continuation bytes
chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
} else if (code <= 0x10FFFF) {
// three continuation bytes
chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
} else {
// unicode replacement character
chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
chars[3] = '\0';
}
}
Can anyone provide an alternative function or a fix for the current function I'm using that will work on Windows?
--UPDATE--
INPUT: 225
OUTPUT ON OSX: á
OUTPUT ON WINDOWS: á
Upvotes: 1
Views: 2918
Reputation: 88155
You don't show your code for printing, but presumably you're doing something like this:
char s[5];
GetUnicodeChar(225, s);
std::cout << s << '\n';
The reason you're getting okay output on OS X and bad output on Windows is because OS X uses UTF-8 as the default encoding and Windows uses some legacy encoding. So when you output UTF-8 on OS X, OS X assumes (correctly) that it's UTF-8 and displays it as such. When you output UTF-8 on Windows, Windows assumes (incorrectly) that it's some other encoding.
You can simulate the problem on OS X using the iconv
program with the following command in Terminal.app
iconv -f cp437 -t utf8 <<< "á"
This takes the UTF-8 string, reinterprets it as a string encoded using Windows code page 437, and converts that to UTF-8 for display. The output on OS X is á
.
For testing small things you can do the following to properly display UTF-8 data on Windows.
#include <Wincon.h>
#include <cstdio>
char s[5];
GetUnicodeChar(225, s);
SetConsoleOutputCP(CP_UTF8);
std::printf("%s\n", s);
Also, parts of Windows' implementation of the standard library don't support output of UTF-8, so even after you change the output encoding code like std::cout << s
still won't work.
On a side note, taking an array as a parameter like this:
void GetUnicodeChar(unsigned int code, char chars[5]) {
is a bad idea. This will not catch mistakes such as:
char *s; GetUnicodeChar(225, s);
char s[1]; GetUnicodeChar(225, s);
You can avoid these specific problems by changing the function to take a reference to an array instead:
void GetUnicodeChar(unsigned int code, char (&chars)[5]) {
However in general I'd recommend just avoiding raw arrays altogether. You can use std::array
if you really want an array. You can use std::string
if you want text, which IMO is a good choice here:
std::string GetUnicodeChar(unsigned int code);
Upvotes: 5
Reputation: 179897
The function is correct. The output presumably isn't, which means there's a bug in that routine. But you don't show it. I'll bet that you're assuming that Windows can print UTF-8.
Upvotes: 2