Rob
Rob

Reputation: 555

Trouble with chr and encoding issues

I am wondering why the output for the following code is changing:

N = 128

print(chr(N))

file = open('output.txt', 'w')
file.write(chr(N))
file.close()

In the output.txt the output is: (<- character not showing up but its a box with two zeros on top row and an 8 and a 0 on the bottom row..) however in my IDE the output is an empty square: □ . Can someone explain why these two outputs are not matching?

I am using Ubuntu 16.04 and my IDE is PyCharm CE. Also, the situation does not change if I try encoding:

file = open('output.txt', 'w', encoding = 'utf-8')

Upvotes: 0

Views: 1208

Answers (2)

abarnert
abarnert

Reputation: 365747

There’s nothing wrong with your code, or the file, or anything else.

You are correctly writing chr(128), aka U+0080, aka a Unicode control character, as UTF-8. The file will have the UTF-8 encoding of that character (the two bytes \xc2\x80).

When you view it in the unspecified first program (maybe you’re just catting it to whatever your terminal is?), it’s correctly reading those two bytes as the UTF-8 for the character U+0800 and displaying whatever image its selected font has for that character.

When you view it in PyCharm, it’s also correctly reading U+0800 and displaying it using whatever its selected font is.

The only difference is that they’re using different fonts. Different fonts do different things for non-printable control characters. (There's no standard rendering for this character—it has no specific meaning in Unicode, but is mapped to the Latin-1 Supplement character 0x80, which is defined as control character "PAD", short for "Padding Character".1) Different things could be useful, so different fonts do different things:

  • Showing you the hex value of the control character could be useful for, e.g., the kind of people who work with Unicode at the shell, so your terminal (or whatever) is configured to use a font that shows them the way.
  • Just showing you that this is something you probably didn’t want to print by using the generic replacement box2 could also be reasonable, so PyCharm is configured with a font that does that.
  • Just displaying it as a space could also be reasonable, especially in a fixed-width font. That's what I get when I cat it, or print it from my Python REPL, on my terminal.
  • Displaying the traditional Latin-1 name for the control character (PAD) in a box could also be useful. This is what Unifont has.
  • Displaying it as a Euro sign could be useful for cases where you're dealing with a bunch of old Java or Win32 code, for backward compatibility reasons.3

1. Technically, that's no longer quite true; Unicode defines it in terms of ISO-15924 code 998, "Zyyy: Code for undetermined script", not as part of ISO-8859 at all. But practically, it's either PAD, or it's an indeterminate meaningless character, which isn't exactly more useful.

2. What you actually pasted into your question is neither U+0080 nor U+FFFD but U+25A1, aka "White Square". Presumably either PyCharm recognized that its font didn't have a glyph for U+0080 and manually substituted U+25A1, or something on the chain from your clipboard to your browser to Stack Overflow did the same thing…

3. After the Euro sign was created, but before Unicode 2.1 added U+20AC and ISO-8859 added the Latin-9 encoding, people had to have some way of displaying Euros. And one of the two most common non-standard encodings was to use Latin-1 80/Unicode U+0080. (The other was A4/U+00A4). And there are a few Java and Win32 code applications written for Unicode 2.0, using this hack, still being used in the wild, and fonts to support them.

Upvotes: 2

MTMD
MTMD

Reputation: 1232

Python uses UTF-8 for its encoding. The functin chr returns the corresponding character for each input value. However, not all characters can be shown; some characters are only for control purposes. In your case, 128 is the Padding Character. Since it cannot be shown, each environment treats it differently. Hence, your file editor shows its value in hex and your IDE simply doesn't show it. Nevertheless, both editor and IDE realize what character it is.

Upvotes: 1

Related Questions