hazrmard
hazrmard

Reputation: 3661

How to print Unicode codepoints as characters in C?

I have an array of uint32_t elements that each store a codepoint for a non-latin Unicode character. How do I print them on the console or store them in a file as UTF-8 encoded characters? I understand that they may fail to render properly on a console, but they should display fine if I open them in a compatible editor.

I have tried using wprintf(L"%lc", UINT32_T_VARIABLE), and fwprintf(FILE_STREAM, L"%lc", UINT32_T_VARIABLE) but to no avail.

Upvotes: 5

Views: 2590

Answers (2)

chux
chux

Reputation: 153338

Best to use existing code when available.

Rolling ones own Unicode code-point to UTF8 is simply, yet easy to mess up. The answer took 2 edits to fix. @Jonathan Leffler @chqrlie, so rigorous testing is recommended for any self-coded solution. Follows is lightly tested code to convert a code-point to an array.
Note that the result is not a string.

// Populate utf8 with 0-4 bytes
// Return length used in utf8[]
// 0 implies bad codepoint
unsigned Unicode_CodepointToUTF8(uint8_t *utf8, uint32_t codepoint) {
  if (codepoint <= 0x7F) {
    utf8[0] = codepoint;
    return 1;
  }
  if (codepoint <= 0x7FF) {
    utf8[0] = 0xC0 | (codepoint >> 6);
    utf8[1] = 0x80 | (codepoint & 0x3F);
    return 2;
  }
  if (codepoint <= 0xFFFF) {
    // detect surrogates
    if (codepoint >= 0xD800 && codepoint <= 0xDFFF) return 0;
    utf8[0] = 0xE0 | (codepoint >> 12);
    utf8[1] = 0x80 | ((codepoint >> 6) & 0x3F);
    utf8[2] = 0x80 | (codepoint & 0x3F);
    return 3;
  }
  if (codepoint <= 0x10FFFF) {
    utf8[0] = 0xF0 | (codepoint >> 18);
    utf8[1] = 0x80 | ((codepoint >> 12) & 0x3F);
    utf8[2] = 0x80 | ((codepoint >> 6) & 0x3F);
    utf8[3] = 0x80 | (codepoint & 0x3F);
    return 4;
  }
  return 0;
}

// Sample usage
uint32_t cp = foo();
uint8_t utf8[4];
unsigned len = Unicode_CodepointToUTF8(utf8, cp);
if (len == 0) Handle_BadCodePoint();
size_t y = fwrite(utf8, 1, len, stream_opened_in_binary_mode);

Upvotes: 3

chqrlie
chqrlie

Reputation: 144550

You must first select the proper locale with:

#include <locale.h>

setlocale(LC_ALL, "C.UTF-8");

or

setlocale(LC_ALL, "en_US.UTF-8");

And then use printf or fprintf with the %lc format:

printf("%lc", UINT32_T_VARIABLE);

This will work only for Unicode code points small enough to fit in a wchar_t. For a more complete and portable solution, you may nee to implement the Unicode to UTF-8 conversion yourself, which is not very difficult.

Upvotes: 3

Related Questions