Accents with python

Question

I would like to know how to keep the accents in python, and if we could explain to myself a little about how it works, I spend hours searching. I still do not understand anything x)

Example 1:

text = "Danay Suarèz hablé"
print(text)

output:

Danay Suar▒z habl▒

Example 2:

print(text.encode('utf-8'))

output:

 b'Danay Suar\xc3\xa8z habl\xc3\xa9'

I would like just output: Danay Suarèz hablé

syntonym · Accepted Answer

Computers work in bits, so sequences of ones and zeros (how they are physically stored is another story). Integers are normally saved as 16 ones and zeros, so a 51 = 00000000 00110011. Because that is pretty long we normally write that in hexadecimal, so 2 dec = 00 33 hex. But not only numbers are saved as bits, characters (and basically everything else) is also. While we can "naturally" encode integers in bits (binary) other datatypes are harder. For characters the "normal" way is ASCII, which just maps "randomly" byte sequences to characters. In ASCII 00 33 = "3".

But ASCII only declares 128 (7 bit) different characters. That is just enough for english plus some extra characters, but for other languages that is not enough. So people created lots of encodings, mostly for what characters they used for their language. So while ASCII says 00 33 = "3", other encodings could say 00 33 = "ü" or whatever. Most encodings that one encounters actually agree with ASCII on the first 128 characters but extend it.

Your sys.stdout.encoding says it is UTF8 so python takes your è and translates it to the bytes C3 A8. Now your command line codepage is 850, better known as latin1. In latin1 C3 A8 should be Ã¨ (which is not what you see, so maybe I did an error in the translations somewhere or maybe your terminal does not have a font which can display that) which is different from UTF8.

But how does one fix that? Either tell your command line to use UTF8 or tell python to use latin1. You should be able to change the command line encoding to UF8 by typing chcp 65001 before you execute your script.

If you use print(text.encode('utf-8')) python tries to show a human readable version of the bytes that this object consists of. It will interpret the bytes as ASCII where possible and just show the raw bits where not. So \xc3\xa8 means the bytes c3 a8. But of course if you actually print that it will transmit these signs in UTF8 to your terminal, but because on the ASCII range UTF8 and latin1 do agree, your terminal interprets these characters correctly.

Accents with python

Answers (2)

Related Questions