NomeQueEuLembro
NomeQueEuLembro

Reputation: 112

How to handle UTF-8 encoded source when compiling on Windows?

I'm currently writing a small C program, using MinGW's gcc to compile it on Windows. I'm also hosting it on GitHub (and using GitHub Desktop for Windows). GitHub, however, appears to enforce UTF-8 encoding in the files and the Windows Terminal have trouble dealing with UTF-8.

After some searching I found a few solutions, but they require manual, end-user style solutions, which I want to avoid (I'm not planning on distributing it or anything, but I wonder what I would do if I was).

What currently works is changing the encoding to ANSI and manually fixing everything before compilation, but I would rather avoid having to do that every damn time I want to work on Windows.

So the question is: How to handle UTF-8 encoded source when compiling on Windows?


Here's some sample output:

[Screenshot]

Compilation process is exactly the same, only difference is the actual source-code encoding.

Upvotes: 0

Views: 2085

Answers (1)

NomeQueEuLembro
NomeQueEuLembro

Reputation: 112

The issue is caused by the fact that the Windows Terminal have issues displaying UTF-8 encoded characters normally.

To solve the issue you need to tell the terminal to use the UTF-8 Code Page. You do not need to call setlocale() after changing the codepage, as this will probably mess things.

To tell Windows which codepage it should use to display output you can use the SetConsoleOutputCP function passing the UTF-8 code (65001) as parameter (for more information check "Code Page Identifiers" from MSDN).

Here is a test program:

#include <stdio.h>
#include <locale.h>
#include <windows.h>

int main(void)
{
    UINT CODEPAGE_UTF8 = 65001;
    UINT CODEPAGE_ORIGINAL = GetConsoleOutputCP();

    printf("DEFAULT CODEPAGE, DEFAULT LOCALE: ¶\n");
    setlocale(LC_ALL, "");
    printf("DEFAULT CODEPAGE, SYSTEM LOCALE: ¶\n");

    SetConsoleOutputCP(CODEPAGE_UTF8);

    setlocale(LC_ALL, "C");
    printf("UTF-8 CODEPAGE, DEFAULT LOCALE: ¶\n");

    setlocale(LC_ALL, "");
    printf("UTF-8 CODEPAGE, SYSTEM LOCALE: ¶\n");

    SetConsoleOutputCP(CODEPAGE_ORIGINAL);
    return 0;
}

And here's the program output, compiled with source code encoded in ANSI, UTF-8 without BOM (Byte Order Mark) and UTF-8 with BOM, respectively:

[TEST OUTPUT]

Caveat: Some info around the internet says this only works with certain fonts, notably Lucida Console. Also, this only works on Windows 2000 Professional and above. I don't think you will need to touch something older than that nowadays, though.

Upvotes: 2

Related Questions