Reputation: 79
I made a pig latin translator that takes input from the user, translates it, and returns it. I want to add the ability to input a text file to take text from but I'm running into an issue that the file isn't being opened as I expect. Here is my code:
from sys import argv
script, filename = argv
file = open(filename, "r")
sentence = file.read()
print sentence
file.close()
The problem is that when I print out the information inside the file it looks like this:
■T h i s i s s o m e t e x t i n a f i l e
Instead of this:
This is some text in a file
I know I could do a workaround the spaces and the odd square character with slicing, but I feel like that is treating a symptom and I want to understand why the text is formatted weird so maybe I can fix the cause.
Upvotes: 3
Views: 4032
Reputation: 79
At first when I saw everyone responding with stuff about unicode and utf I shied away from reading and trying to fix it, but I'm persistent about learning to program in python so I did some research, primarily this website. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
That was really helpful. So what I can gather is that notepad++ which I used to write the text file, wrote it in UTF-8, and python read it in UTF-16. The solution was to import codecs, and use the codecs function like this (as Will said above): from sys import argv import codecs
script, filename = argv
file = codecs.open(filename, encoding = "utf-8")
sentence = file.read()
print sentence
file.close()
Upvotes: 1
Reputation: 24689
I believe this is a Unicode UTF-16 encoded file, and this is the "Unicode Byte Order Mark" (BOM). It could also be another encoding with a byte-order mark, but it definitely appears to be a multi-byte encoding.
This is also why you're seeing the whitespace between characters. UTF-16 effectively represents each character as two bytes, but for standard ASCII characters like you're using, the other half of the character is empty (second byte is 0
).
Try this instead:
from sys import argv
import codecs
script, filename = argv
file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()
Replace encoding='utf-16'
with whatever encoding this actually is. You might just need to try a few and experiment.
Upvotes: 4
Reputation: 177461
The original file is UTF-16. Here's an example that writes a UTF-16 file and reads it with open
vs. io.open
, which takes an encoding parameter:
#!python2
import io
sentence = u'This is some text in a file'
with io.open('file.txt','w',encoding='utf16') as f:
f.write(sentence)
with open('file.txt') as f:
print f.read()
with io.open('file.txt','r',encoding='utf16') as f:
print f.read()
Output on US Windows 7 console:
■T h i s i s s o m e t e x t i n a f i l e
This is some text in a file
As a guess, I'd say the OP created the text file in Windows Notepad and saved it as "Unicode", which is Microsoft's misnomer for UTF-16 encoding.
Upvotes: 2
Reputation: 5279
Well - the most striking explanation is that your file is reading the data correctly.
As to why there is weird output - could be due to some many reasons
However it looks like you are using Python 2 (print statement) - And as the text is appearing as
CHARCHAR
I would assume that the file you are reading is UNICODE encoded text - so that ABC is witten \u0065\u0066\u0067
Either decode the byte string - until a Unicode string - or use Python 3 and look the Unicode issue.
Upvotes: 0