Unexpected output using read() with textfiles and different encodings

Question

I was experimenting with simple code to see how read() behaves on text files. So I made a simple txt file with the following:

AB

BA

Tried to output to console the fist 2 characters.

With encoding set to "ansi" to both txt file and open() the output is correct.

With encoding set to "utf-8" to both txt file and open() the output is A.

With encoding set to "utf-8" to txt file and open() set to default the output is ο».

What is going on ? locale.getpreferredencoding() returns cp1253. Could be that ο» character's messing with my utf-8 encoding? How can I get rid of it?

My code:

current_dir = "some_directory" #doesn't really matter 
file_name = "name_of_text.txt"
full_path = current_dir+file_name
file_mode = "rt"

f = open(full_path,mode = file_mode) # add encoding = "utf_8" or "ansi" to replicate
reader = f.read(2)
print(reader)

f.close()

snakecharmerb · Accepted Answer

The files have been encoded with the utf-8-sig codec, used by some Microsoft applications when UTF-8 encoding is required. This codec inserts three marker characters at the beginning of the file (described in this section of the codecs docs).

When you decode with UTF-8 the marker characters are read as a single, invisible, character (UTF-8 characters may be composed of more than one byte), so you only see 'A'.

When you decode with no encoding specified cp1253 is used, and it treats the marker characters as normal characters, hence the output that you see:

>>> 'AB'.encode('utf-8-sig').decode('cp1253')[:2]
'ο»'

Unexpected output using read() with textfiles and different encodings

Answers (1)

Related Questions