Supetorus
Supetorus

Reputation: 79

Python Read String from File with Strange Encoding

I made a pig latin translator that takes input from the user, translates it, and returns it. I want to add the ability to input a text file to take text from but I'm running into an issue that the file isn't being opened as I expect. Here is my code:

from sys import argv
script, filename = argv

file = open(filename, "r")

sentence = file.read()

print sentence

file.close()

The problem is that when I print out the information inside the file it looks like this:

■T h i s   i s   s o m e   t e x t   i n   a   f i l e

Instead of this:

This is some text in a file

I know I could do a workaround the spaces and the odd square character with slicing, but I feel like that is treating a symptom and I want to understand why the text is formatted weird so maybe I can fix the cause.

Upvotes: 3

Views: 4032

Answers (4)

Supetorus
Supetorus

Reputation: 79

At first when I saw everyone responding with stuff about unicode and utf I shied away from reading and trying to fix it, but I'm persistent about learning to program in python so I did some research, primarily this website. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

That was really helpful. So what I can gather is that notepad++ which I used to write the text file, wrote it in UTF-8, and python read it in UTF-16. The solution was to import codecs, and use the codecs function like this (as Will said above): from sys import argv import codecs

script, filename = argv

file = codecs.open(filename, encoding = "utf-8")

sentence = file.read()

print sentence

file.close()

Upvotes: 1

Will
Will

Reputation: 24689

I believe this is a Unicode UTF-16 encoded file, and this is the "Unicode Byte Order Mark" (BOM). It could also be another encoding with a byte-order mark, but it definitely appears to be a multi-byte encoding.

This is also why you're seeing the whitespace between characters. UTF-16 effectively represents each character as two bytes, but for standard ASCII characters like you're using, the other half of the character is empty (second byte is 0).

Try this instead:

from sys import argv
import codecs
script, filename = argv

file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()

Replace encoding='utf-16' with whatever encoding this actually is. You might just need to try a few and experiment.

Upvotes: 4

Mark Tolonen
Mark Tolonen

Reputation: 177461

The original file is UTF-16. Here's an example that writes a UTF-16 file and reads it with open vs. io.open, which takes an encoding parameter:

#!python2
import io

sentence = u'This is some text in a file'

with io.open('file.txt','w',encoding='utf16') as f:
    f.write(sentence)

with open('file.txt') as f:
    print f.read()

with io.open('file.txt','r',encoding='utf16') as f:
    print f.read()

Output on US Windows 7 console:

 ■T h i s   i s   s o m e   t e x t   i n   a   f i l e
This is some text in a file

As a guess, I'd say the OP created the text file in Windows Notepad and saved it as "Unicode", which is Microsoft's misnomer for UTF-16 encoding.

Upvotes: 2

Tim Seed
Tim Seed

Reputation: 5279

Well - the most striking explanation is that your file is reading the data correctly.

As to why there is weird output - could be due to some many reasons

However it looks like you are using Python 2 (print statement) - And as the text is appearing as

CHARCHAR

I would assume that the file you are reading is UNICODE encoded text - so that ABC is witten \u0065\u0066\u0067

Either decode the byte string - until a Unicode string - or use Python 3 and look the Unicode issue.

Upvotes: 0

Related Questions