Reputation: 319
I'm trying to understand Unicode and all asociated things. I have made an utf-8.txt file which obviously is encoded in utf-8. It has "Hello world!" inside. Heres what I do:
f = open('utf8.txt', mode = 'r', encoding = 'utf8')
f.read()
What I get is: '\ufeffHello world!' where did the prefix came from?
Another try:
f = open('utf8.txt', 'rb')
byte = f.read()
printing byte gives: b'\xef\xbb\xbfHello world!' I assume that prefix came in as hex.
byte.decode('utf8')
above code again gives me: '\ufeffHello world!'
What am I doing wrong? How to retrive text to python from utf-8 file?
Thanks for feedback!
Upvotes: 5
Views: 2461
Reputation: 665
Your utf-8.txt is encoded utf-8-bom which is different from utf-8. For an utf-8-bom file, '\uFEFF' is written into the beginning of the file. Instead of using encoding = 'utf8'
, try encoding = 'utf-8-sig'
f = open('utf8.txt', mode = 'r', encoding = 'utf-8-sig')
print (f.read())
Upvotes: 7