Alexander Jonsson
Alexander Jonsson

Reputation: 179

printing a string with UTF8 characters in C

I want to print blå using UTF-8 but I do not know how to do it. UTF-8 for b is 62, l is 6c and å is c3 a5. I am not sure what to make with the å character. Here is my code:

#include <stdio.h>

int main(void) {

    char myChar1 = 0x62;  //b
    char myChar2 = 0x6C;  //l
    char myChar3 = ??     //å

    printf("%c", myChar1);
    printf("%c", myChar2);
    printf("%c", myChar3);

    return 0;
}

I also tried this:

#include <stdio.h>

#define SIZE 100

int main(void) {

    char myWord[SIZE] = "\x62\x6c\xc3\xa5\x00";

    printf("%s", myWord);

    return 0;
}

However, the output was:

blå

Finally, I tried this:

#include <stdio.h>
#include <locale.h>

#define SIZE 100

int main(void) {

    setlocale(LC_ALL, ".UTF8");
    char myWord[SIZE] = "\x62\x6c\xc3\xa5\x00";

    printf("%s", myWord);

    return 0;
}

Same output as before.

I am not sure I understand unicode fully. If I understand it correctly, UTF-16 and UTF-32 use wide characters, where each character requires the same number of bytes (2 or 4 for UTF-16). On the other hand, UTF-8 uses wide characters where the size may vary (1-4 bytes). I know the first 128 characters require 1 byte, and almost all of latin-1 can be described with 2 bytes etc. Since UTF-8 does not require wide characters, I do not need to use wchar functions in my code. Therefore, I do not see why my second and/or third code will not work. My only solution would be to include setmode to change the encodings of stdin and stdout, although I am not sure I that would work and I am not sure how to implement it.

Summary:

Why doesn't my code work?

I am on windows and VScode and have MINGW32 as compiler.

Upvotes: 2

Views: 949

Answers (1)

Rob Napier
Rob Napier

Reputation: 299265

Your second attempt is correct and does output UTF-8 as you wanted. The problem is that your terminal doesn't display UTF-8. See Displaying Unicode in PowerShell and Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) for discussion of displaying UTF-8 in Windows terminals.

Your current configuration is one in which 0xc3 encodes ├, which is probably CP850, which I believe is the default for some of the mingw-based terminals (MSYS, git bash). It's been a very long time since I've used mingw, but you may also want to see How to set console encoding in MSYS?

Upvotes: 4

Related Questions