RdlP
RdlP

Reputation: 1396

Input encoding issue in Windows C++

I am developing a simple console application with Visual Studio 2013

int _tmain(int argc, _TCHAR* argv[])
{    
    std::wstring name;
    std::wcout << L"Enter your name: ";
    std::wcin >> name;
    std::wcout << L"Hello, " << name << std::endl;
    system("pause");
    return 0;
}

If I enter as input Ángel the application works well and the output is

Hello, Ángel

the problem is that If i put a breakpoint on

std::wcout << L"Hello, " << name << std::endl;

the Visual studio debugger shows

+       name    L"µngel"    std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >

Although the output in console is correct in other part of the program I have a call to win32api function CopyFileW() and it always fails because the path has the substring Ángel and the substring passed to function is transformed to µngel

Upvotes: 1

Views: 233

Answers (1)

rodrigo
rodrigo

Reputation: 98328

The problem is that Windows consoles are broken by default.

The problem arises from Windows using a different 8-bit codepage in console application than in Windows applications. By default, in Western Windows versions, the default 8-bit codepage (called ANSI) is Windows-1252, while the console 8-bit codepage (called OEM) is CP850.

Since your program doesn't know if it is reading from console or from a redirected file, it simply assumes ANSI input. But when you type Á, it is actually the codepoint from CP850: 0xB5. It is then interpreted using Windows-1252 as µ, that is Unicode characters U+00B5. The funny thing is that when you print it into the console, the inverse transformation happens, and you see a Á again. Two wrongs make one right!

But when you want to use that characters in a non-console context, it is actually a µ.

You may think that you can convert from OEM to ANSI and then from ANSI to Unicode, and that would seem to work... until you run your program as:

c:\> myprogram < input.txt

And you wrote that input.txt using notepad, so it is using ANSI, and then you are doing a conversion you do not need.

You say then that you could detect if you are reading the actual console or a redirection and do the OEM to ANSI conversion only when there is no redirect... until you do:

c:\> echo Ángel | myprogram

And you are doing it wrong again!

There are a lot of alternatives, but none of them works completely fine. At least you should use a Unicode font and then a more normal codepage. Something like chcp 1252 to change the OEM codepage to match the ANSI one. You can even configure it by default with a bit of registry foo:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP=1252

Upvotes: 4

Related Questions