Reputation: 555
I am wondering why the output for the following code is changing:
N = 128
print(chr(N))
file = open('output.txt', 'w')
file.write(chr(N))
file.close()
In the output.txt the output is: (<- character not showing up but its a box with two zeros on top row and an 8 and a 0 on the bottom row..) however in my IDE the output is an empty square: □ . Can someone explain why these two outputs are not matching?
I am using Ubuntu 16.04 and my IDE is PyCharm CE. Also, the situation does not change if I try encoding:
file = open('output.txt', 'w', encoding = 'utf-8')
Upvotes: 0
Views: 1208
Reputation: 365747
There’s nothing wrong with your code, or the file, or anything else.
You are correctly writing chr(128)
, aka U+0080, aka a Unicode control character, as UTF-8. The file will have the UTF-8 encoding of that character (the two bytes \xc2\x80
).
When you view it in the unspecified first program (maybe you’re just cat
ting it to whatever your terminal is?), it’s correctly reading those two bytes as the UTF-8 for the character U+0800 and displaying whatever image its selected font has for that character.
When you view it in PyCharm, it’s also correctly reading U+0800 and displaying it using whatever its selected font is.
The only difference is that they’re using different fonts. Different fonts do different things for non-printable control characters. (There's no standard rendering for this character—it has no specific meaning in Unicode, but is mapped to the Latin-1 Supplement character 0x80
, which is defined as control character "PAD", short for "Padding Character".1) Different things could be useful, so different fonts do different things:
cat
it, or print
it from my Python REPL, on my terminal.PAD
) in a box could also be useful. This is what Unifont
has.1. Technically, that's no longer quite true; Unicode defines it in terms of ISO-15924 code 998, "Zyyy: Code for undetermined script", not as part of ISO-8859 at all. But practically, it's either PAD
, or it's an indeterminate meaningless character, which isn't exactly more useful.
2. What you actually pasted into your question is neither U+0080
nor U+FFFD
but U+25A1
, aka "White Square". Presumably either PyCharm recognized that its font didn't have a glyph for U+0080
and manually substituted U+25A1
, or something on the chain from your clipboard to your browser to Stack Overflow did the same thing…
3. After the Euro sign was created, but before Unicode 2.1 added U+20AC and ISO-8859 added the Latin-9 encoding, people had to have some way of displaying Euros. And one of the two most common non-standard encodings was to use Latin-1 80
/Unicode U+0080
. (The other was A4
/U+00A4
). And there are a few Java and Win32 code applications written for Unicode 2.0, using this hack, still being used in the wild, and fonts to support them.
Upvotes: 2
Reputation: 1232
Python uses UTF-8 for its encoding. The functin chr
returns the corresponding character for each input value. However, not all characters can be shown; some characters are only for control purposes. In your case, 128 is the Padding Character. Since it cannot be shown, each environment treats it differently. Hence, your file editor shows its value in hex and your IDE simply doesn't show it. Nevertheless, both editor and IDE realize what character it is.
Upvotes: 1