Reputation: 3661
I have an array of uint32_t
elements that each store a codepoint for a non-latin Unicode character. How do I print them on the console or store them in a file as UTF-8 encoded characters? I understand that they may fail to render properly on a console, but they should display fine if I open them in a compatible editor.
I have tried using wprintf(L"%lc", UINT32_T_VARIABLE)
, and fwprintf(FILE_STREAM, L"%lc", UINT32_T_VARIABLE)
but to no avail.
Upvotes: 5
Views: 2590
Reputation: 153338
Best to use existing code when available.
Rolling ones own Unicode code-point to UTF8 is simply, yet easy to mess up. The answer took 2 edits to fix. @Jonathan Leffler @chqrlie, so rigorous testing is recommended for any self-coded solution. Follows is lightly tested code to convert a code-point to an array.
Note that the result is not a string.
// Populate utf8 with 0-4 bytes
// Return length used in utf8[]
// 0 implies bad codepoint
unsigned Unicode_CodepointToUTF8(uint8_t *utf8, uint32_t codepoint) {
if (codepoint <= 0x7F) {
utf8[0] = codepoint;
return 1;
}
if (codepoint <= 0x7FF) {
utf8[0] = 0xC0 | (codepoint >> 6);
utf8[1] = 0x80 | (codepoint & 0x3F);
return 2;
}
if (codepoint <= 0xFFFF) {
// detect surrogates
if (codepoint >= 0xD800 && codepoint <= 0xDFFF) return 0;
utf8[0] = 0xE0 | (codepoint >> 12);
utf8[1] = 0x80 | ((codepoint >> 6) & 0x3F);
utf8[2] = 0x80 | (codepoint & 0x3F);
return 3;
}
if (codepoint <= 0x10FFFF) {
utf8[0] = 0xF0 | (codepoint >> 18);
utf8[1] = 0x80 | ((codepoint >> 12) & 0x3F);
utf8[2] = 0x80 | ((codepoint >> 6) & 0x3F);
utf8[3] = 0x80 | (codepoint & 0x3F);
return 4;
}
return 0;
}
// Sample usage
uint32_t cp = foo();
uint8_t utf8[4];
unsigned len = Unicode_CodepointToUTF8(utf8, cp);
if (len == 0) Handle_BadCodePoint();
size_t y = fwrite(utf8, 1, len, stream_opened_in_binary_mode);
Upvotes: 3
Reputation: 144550
You must first select the proper locale with:
#include <locale.h>
setlocale(LC_ALL, "C.UTF-8");
or
setlocale(LC_ALL, "en_US.UTF-8");
And then use printf
or fprintf
with the %lc
format:
printf("%lc", UINT32_T_VARIABLE);
This will work only for Unicode code points small enough to fit in a wchar_t
. For a more complete and portable solution, you may nee to implement the Unicode to UTF-8 conversion yourself, which is not very difficult.
Upvotes: 3