Zebrafish
Zebrafish

Reputation: 14320

C++ Visual Studio Unicode confusion

I've been looking at the Unicode chart, and know that the first 127 code points are equivalent for almost all encoding schemes, ASCII (probably the original), UCS-2, ANSI, UTF-8, UTF-16, UTF-32 and anything else.

I wrote a loop to go through the characters starting from decimal 122, which is lowercase "z". After that there are a couple more characters such as {, |, and }. After that it gets into no-man's land which is basically around 20 "control characters", and then the characters begin again at 161 with an inverted exclamation mark, 162 which is the cent sign with a stroke through it, and so on.

The problem is, my results don't correspond the Unicode chart, UTF-8, or UCS-2 chart, the symbols seem random. By the way, the reason I made the "character variable a four-byte int was that when I was using "char" (which is essentially a one byte signed data type, after 127 it cycled back to -128, and I thought this might be messing it up.

I know I'm doing something wrong, can anyone figure out what's happening? This happens whether I set the character set to Unicode or Multibyte characters in the project settings. Here is the code you can run.

#include <iostream>

using namespace std;

int main()
{
    unsigned int character = 122; // Starting at "z"
    for (int i = 0; i < 100; i++)
    {
        cout << (char)character << endl;
        cout << "decimal code point = " << (int)character << endl;
        cout << "size of character =  " <<  sizeof(character) << endl;
        character++;
        system("pause");
        cout << endl;
    }

    return 0;
}

By the way, here is the Unicode chart

http://unicode-table.com/en/#control-character

Upvotes: 0

Views: 226

Answers (1)

roeland
roeland

Reputation: 5751

Very likely the bytes you're printing are displayed using the console code page (sometimes referred to as OEM), which may be different than the local single- or double-byte character set used by Windows applications (called ANSI).

For instance, on my English language Windows install ANSI means windows-1252, while a console by default uses code page 850.

There are a few ways to write arbitrary Unicode characters to the console, see How to Output Unicode Strings on the Windows Console

Upvotes: 1

Related Questions