Reputation: 1771
I am trying to load a text file, which contains some German letters with
content=open("file.txt","r").read()
which results in this error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
if I modify the file to contain only ASCII characters everything works as expected.
Apperently using
content=open("file.txt","rb").read()
or
content=open("file.txt","r",encoding="utf-8").read()
both do the job.
Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?
Upvotes: 2
Views: 6541
Reputation: 30250
ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.
When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character Ã
is now read in without error. But despite it seeming to work, it's still not "correct".
If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.
It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.
A character like this: č
Will be read in as two bytes, but properly decoded, will be one character:
bytes = bytes('č', encoding='utf-8')
print(len(bytes)) # 2
print(len(bytes.decode('utf-8'))) # 1
Upvotes: 5
Reputation: 123
In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters.
Upvotes: 6