Vincent
Vincent

Reputation: 1537

Why is the console not printing the characters i am expecting

I'm currently trying to educate my self about the different Encoding types. I tried to make a simple console app to tell me the difference between the types.

byte[] byteArray = new byte[] { 125, 126, 127, 128, 129, 130, 250, 254, 255 };
string s = Encoding.Default.GetString(byteArray);
Console.OutputEncoding = Encoding.Default;
Console.WriteLine("Default: " + s);

s = Encoding.ASCII.GetString(byteArray);
Console.OutputEncoding = Encoding.ASCII;
Console.WriteLine("ASCII: " + s);

s = Encoding.UTF8.GetString(byteArray);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("UTF8: " + s);

The output however is nothing like I expected it to be.

Default: }~€‚úûüýþÿ
ASCII: }~?????????
UTF8: }~���������

Hmm... the characters do not copy well from the console output to here either so here's a print screen.

Console output printscreen

What I do expect is to see the the extended ASCII characters. The default encoding is almost correct but it cannot display 251, 252 and 253 but that might be a shortcoming on the Console.writeLine() though i'd not expect that.

enter image description here

The representation of the variable when debugging is as follows:

Default encoded string = "}~€‚úûüýþÿ"
ASCII encoded string = "}~?????????"
UTF8 encoded string = "}~���������"

Can someone tell me what I'm doing wrong? I expect one of the encoding types to properly display the extended ASCII table but apparently none can...

A bit of context:
I am trying to determine what Encoding would be best a standard in our company, I personally think UTF8 will do but my supervisor would like to see some examples before we decide.

Obviously we know we will need to use other encoding types every now and then (serial communication for example uses 7-bits so we can't use UTF8 there) but in general we would like to stick with one encoding type. Currently we are using default, ASCII and UTF8 at random so that's not a good thing.

EDIT
The output according to:

Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.CodePage);

output with code page

Edit 2:
Since I thought there might not be an encoding in which the extended ascii characters correspond to the decimal numbers in the table I linked to I turned it around and this:

char specialChar = '√';
int charNumber = (int)specialChar;

gives me the number: 8730 which in the table is 251

Upvotes: 0

Views: 3615

Answers (3)

Cecilia Colley
Cecilia Colley

Reputation: 1

Here's a Bluesky thread I wrote about it in C++, maybe it'll be useful to you!

https://bsky.app/profile/cecisharp.bsky.social/post/3ld2bpp5qj22h

Thread in full:

Why do some characters print on the console, and others don’t? Your immediate thought might be, "Oh, the console’s font doesn’t support those characters." And yeah, that makes sense. Except... it’s not always true.

For example, the default font for most Windows consoles is Consolas. If you open the Character Map, you’ll see that Consolas supports a ton of characters. Including the square symbol, ■. So... why isn’t it showing up?

Your next guess might be, "Maybe it’s because I’m using an extended ASCII character, and I need to declare it as a wide character." Hmm. Nope, that didn’t work either.

Okay, forget ASCII for a second. What if we assign the character using its Unicode code? Hmm... still nothing.

Fine. What if we skip all that and just look up the ASCII value for the character, assign that number to a char, and print it that way? Oh, now it works! Why?

Well, the answer involves bytes, encoding, and how your program interprets text. Let’s break it down.

Why Assigning the Number Directly Works

When you assign a char like this:

char ch = 254;
cout << ch;

It works. Why? Because a char in C takes up exactly 1 byte—that’s 8 bits. And 254 fits perfectly into those 8 bits.

Here’s what happens:

You assign 254 to the char. Internally, the program stores it as the binary value 11111110. The console reads this byte, looks it up in its active code page (like CP437), and renders it as ■. This works because there’s no interpretation or decoding involved. You’re giving the program exactly what it needs, so it just prints the symbol without any fuss.

But what about this code?

char ch = '■';
cout << ch;

Why doesn’t that work? After all, it’s the same character, right? Well, here’s where encoding comes into play.

Remember that our code is nothing more than a text file, that we're giving to some IDE to translate into binary. The encoding we use to save our source file will determine how that translation is done.

Encoding is essentially the "translation system" that tells the computer how to interpret, store, and display text symbols. It’s important because most of what we see on a computer screen is text. You’ll even see it when saving something in notepad... And since our source file is nothing more than a text file at the end of the day, we also save it with a specific encoding.

Most people probably encode their source files as UTF-8 without even knowing it. This is the standard. So, what is UTF-8 encoding? Well it’s short for "Unicode Transformation Format - 8-bit") and it’s a variable-length character encoding.

Basically it’s a kind of encoding that understands all Unicode symbols, and stores them in variables of different lengths.

Can you see where I’m going with this? In C, a character is always only one byte. But with UTF-8 encoding, characters can have varying lengths. In fact, with UTF-8 encoding, characters in the ASCII range (0–127) are encoded in 1 byte and have the same binary values as ASCII while less common characters, like our square, use 2–4 bytes.

So when we write this code here:

char ch = L'■';
cout << ch;

... and save the source file with UTF-8 encoding, then run the program, we end up trying to fit multiple bytes into one byte, which the program realizes isn’t gonna work, and defaults to a question mark.

Alright, so what if we use a wchar_t instead? Like this:

wchar_t ch = L'■';
wcout << ch;

That gives wchar_t enough space to store the character, so it should work, right? Nope. Not yet.

The issue here isn’t the storage space—it’s the locale.

By default, C++ uses the "C" locale. This is a minimal locale that only understands basic ASCII characters. It doesn’t know what ■ is, even if you’ve stored it correctly.

To fix this, you need to tell your program to use a locale that understands Unicode. For example:

locale::global(locale("en_US.UTF-8"));
wchar_t ch = L'■';
wcout << ch;

This one will work.

With this line, you’re switching to the English (US) locale with UTF-8 encoding, which can handle Unicode characters. Now the program knows how to interpret L'■' and display it properly.

So, let’s go back to everything we tried:

Assigning the Number Directly: Worked because we skipped all encoding and just gave the program the byte 254. The console knew how to render it.

Using a Literal: Failed because the source file was saved as UTF-8. The program couldn’t fit the 3-byte UTF-8 sequence for ■ into a single char.

Using a Wide Character: Failed until we set the locale. Even though wchar_t could store the character, the default "C" locale didn’t understand Unicode.

Setting the Locale: Worked because it allowed the program to interpret wide characters as Unicode.

Upvotes: -2

Wernfried Domscheit
Wernfried Domscheit

Reputation: 59456

Strange, with this code

Console.OutputEncoding = Encoding.Default;
Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.HeaderName);
s = Encoding.ASCII.GetString(byteArray);
Console.OutputEncoding = Encoding.ASCII;
Console.WriteLine("ASCII: {0} for {1}", s, Console.OutputEncoding.HeaderName);
s = Encoding.UTF8.GetString(byteArray);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("UTF8: {0} for {1}", s, Console.OutputEncoding.HeaderName);

I get this one:

Default: }~€‚úþÿ for Windows-1252
ASCII: }~?????? for us-ascii
UTF8: }~ ������ for utf-8

This is what I would expect. Default Codepage is CP1252, not CP850 which your tables shows. Try another default font in for your console, e.g. "Consolas" or "Lucidia Console" and check the output.

Upvotes: 1

Joey
Joey

Reputation: 354506

The output encoding in your case should be mostly irrelevant since you're not even working with Unicode. Furthermore, you need to change your console window settings from Raster fonts to a TrueType font, like Lucida Console or Consolas. When the console is set to raster fonts, you can only have the OEM encoding (CP850 in your case), which means Unicode doesn't really work at all.

However, all that is moot as well, since your code is ... weird, at best. First, as to what is happening here: You have a byte array, interpret that in various encodings and get a (Unicode) string back. When writing that string to the console, the Unicode characters are converted to their closest equivalent in the codepage of the console (850 here). If there is no equivalent, not even close, then you'll get a question mark ?. This happens most prominently with ASCII and characters above 127, because they simply don't exist in ASCII.

If you want the characters you want to see, then either use correct encodings throughout instead of trying to meddle around until it somewhat works, or just use the right characters to begin with.

Console.WriteLine("√ⁿ²")

should actually work because it runs through the encoding translation processes described above.

Upvotes: 3

Related Questions